Tracing the Changing State of the Union with Text Analysis
U.S. Presidents since George Washington have delivered State of the Union addresses each year to describe the nation’s condition and prioritize future action. Can we glean historical patterns from the texts? Do presidents speak similarly in times of war or depression? Do Republicans and Democrats emphasize different words? How does the evolution of American English affect the speeches?
To explore these questions, I performed textual analysis of all of the State of the Union Addresses given through 2012. (For those interested, technical details are at the end of the post.) I broke the addresses into words, removing words like “the” and “and” and weighting uncommon words more highly. Then I measured the similarity of each pair of presidents by counting the words they used in common.
This figure shows the pairwise similarities for presidential State of the Union addresses–click for an interactive version. Blue squares indicate that the presidents used few words in common, while white and red imply more overlap in vocabulary. The red diagonal line is the similarity of each president to himself, which is obviously 100%. I’ve color-coded the presidents by their political party.
Several interesting trends appear. First, the dominant effect is that presidents nearer in time use more similar words. Second, there seems to be a general separation before and after the early part of the 20th century (Wilson-Hoover): there is a great deal of overlap among the early presidents and among the post-WWII presidents, but less similarity between those groups.
We can look for bulk similarity and difference by aggregating the similarities for each president and looking at the range of values (excluding the self-comparison):
Few presidents seem to generate unusual influence on later addresses. However, some presidents use noticeably different language than their peers, including John Adams, James Polk, Warren Harding, and George W. Bush.
Finally, each president’s words define a direction in vocabulary space. We can project the presidents’ addresses into two dimensions, spacing the points by their relative distances in vocabulary space:
Closer presidents in this projection are more similar, so we can easily see that temporal order rather than political party seems to be the dominant effect influencing the language that presidents use in their State of the Union addresses.
I got the State of the Union texts from Project Gutenberg and whitehouse.gov and manually edited them to remove footnotes and commentary like “(applause).” For each president, I combined the texts of all addresses into a single text. I used Python’s NLTK to tokenize and stem the texts and remove stopwords. Then I used gensim to convert to a vector representation of the texts. I tfidf-weighted the vectors and computed the pairwise cosine similarities between the texts using gensim. For the projection into 2D, I used scikit-learn’s Multi-Dimensional Scaling routine. Presidents Harrison and Garfield did not live long enough to deliver a State of the Union address, and I have combined Cleveland’s two terms into a single text.
There’s more that can be done with these data, including identifying the most significant words that distinguish one address from another and looking for evolution in the individual addresses of specific presidents.