2024-09-15

About That Presidential Debate

If you’ve been on the internet for the past few days, you’ve seen plenty of "They’re eating the dogs, they’re eating the cats" and "Transgender operations on illegal aliens in prison" memes. Reading through those and general takes on the presidential debate between Kamala Harris and Donald Trump gave me the idea to throw a few simple analysis tools at a transcript of the thing and seeing what happens.

In this post, I will just dryly present what I found and leave any judgment on what these findings imply about the people involved to you, the reader. The script I used for the entire analysis can be found on Github.

Amount of talking

Going purely by the number of words said, Trump talked about a third more than Harris at approximately 8000 versus approximately 6000 words. Assuming that this might in part be because Harris uses longer words than Trump, I also checked for the number of syllables each participant said: At approximately 11000 to approximately 8700 syllables, this ended up with a slightly lower ratio of roughly 25% more syllables said by Trump.

The moderators, whom I included in the entire analysis both for fun and as a point of reference, end up at roughly 2200 words (3200 syllables) for David Muir and 949 words (1391 syllables) for Linsey Davis. The difference in speaking time between analysing words and analysing syllables is notably bigger here.

Now that we know how much each candidate said in pure numbers, let’s launch into some (more interesting, I promise!) analysis of what they said:

Language complexity

Running both Flesch-Kincaid readability tests on each participant’s speech paints a pretty stark difference between the candidates:

The Flesch-Kincaid Grade Level test, which estimates what (US-American) school grade reading comprehension level readers need to be at to be able to understand the given text, places Harris at a grade of 8.08 and Trump at a score of 4.22. For reference, the moderators Muir and Davis end up at 5.89 and 5.99 respectively.

The Flesch Reading Ease test, which works on a scale where a higher score means a simpler text, places Trump at 82.13 ("Easy to read. Conversational English for consumers." according to the wikipedia article), whereas Harris gets scored at 66.41 ("Plain English. Easily understood by 13- to 15-year-old students.") and Muir and Davis both reach roughly 72 ("Fairly easy to read.").

Note that the gap between Harris and the moderators is a lot smaller in the Flesch Reading Ease test as opposed to the Flesch-Kincaid Grade Level test. The amount of syllables per word impacts the former much more than the latter, with the number of words per sentence taking more of a back seat. This implies that the difference mainly comes from Harris using longer sentences, rather than longer words.

As a joke and in reference to an analysis that inspired quite a bit of this analysis - Linus Neumann’s Trolldrossel (German) - I also ran every speaker’s part through zlib to see how well their speech compresses. The idea this is based on is that as LZ77, the algorithm zlib is based on, works best with an input that contains lots of repetition, this would indicate how varied each participant’s speech is. The compression ratio ended up barely differing between the two presidential candidates, reaching 34.6% of the original size for Trump and 35.9% for Harris. For reference, Muir’s speech could be compressed by 36.8%, while Davis’ parts were the least compressible at a ratio of only 41.0%. Since this is a joke metric in the first place and the compression ratio seems to roughly scale with the total amount said by each participant, I’d let this slide.

Most used words

I also took a look at the 20 words each participant uses the most. This count excludes a few "boring" words such as "and" or "then" that would otherwise clog up the metrics and hide all the fun parts. The list of "boring" words can be found in the source code,

The full list of most used words can be seen in the results file, this segment will be more of a "highlight reel" of interesting things I found in the results.

The first finding to immediately jump at me was that Trump’s most used word by a large margin was "they", which he said no less than 168 times. "They" does not show up in any of the other participants’ most used words.

His next two most used words are also pronouns, mentioning "she" 91 times and "we" 84 times. "We" does also show up in Harris’ most used words at 91 mentions, but none of the participants other than Trump had "she" in their most used words either.

In turn, while Harris’ most used words included both "Donald" (32 mentions) and "Trump" (31 mentions), Trump’s most used words include neither "Kamala" nor "Harris" (nor "Joe" or "Biden", for the record). The moderators both bring up both candidates’ names, but with a different distribution: Muir mentions "Harris" more often (26 to 21 mentions), whereas Davis uses "Trump" more (21 to 17 times).

In order to get a few more interesting words, I ran the analysis again, also excluding all pronouns from the list. The only noteworthy result from this is that even after this filtering, none of the speakers’ most used words include any word directly relating to specific policy - That is, except for Linsey Davis, whose most used words then also include "Abortion", "Issue", "Israel" and "Race". Seeing as she only mentioned these words 3-5 times each, this can likely be chalked up to her having said less overall. A short look at the questions Muir asks confirms this at least for the moderators’ side, as he also brings up specific policy querstions.

One last test for word use I ran was checking the difference between how often Trump used words vs. how often Harris used them and vice versa, then printing out the 20 words with the biggest difference each way. The immediately visible difference, again, is that Trump mentioned "they" and "they’re" a lot more than Harris, at 159 and 45 more uses respectively.

Other than that, this also shows differences the candidates use in wording - Harris said "American" 25 more times than Trump, whereas Trump’s part contains the word "Country" 43 more times. A potentially misleading detail is that Trump made 33 more mentions of the word "People". This number seems bigger than it is because this part of the analysis does not take the total number of words said into account: "People" also shows up in Harris’ most used words, with her having said it 47 times. Trump still mentioned "People" slightly more often (about 1% of the words he said were "People" - for Harris it’s roughly 0.78%), but the difference is smaller than the initial number makes it seem. Always remember: A number without context can mean nothing ;D

If you come up with other analyses to run on the transcript, please do email me! If they’re interesting, they’ll be edited into this post below this note. Other than that, all I can say to end this is to again ask you to draw your own conclusions from this data. I hope to have given you enough context to do so.

Addendum: Implementation details

As a little bonus, let me highlight a few fun technical things I ran into while writing the script:

Both Flesch-Kincaid readability tests use the number of syllables in a sentence as part of their calculation. Now, googling "syllable count" will lead you down a fun rabbit hole of natural language processing and a fully algorithmic syllable counter is not really doable, especially not for English. Most solutions to this just involve using a dictionary where the number of syllables is pre-defined, which is what I ended up doing, copying BigPhoney’s solution of using the CMU Pronouncing Dictionary. The number of syllables in a word can then simply be determined as the number of phonemes with stress markers.

In case the CMU dictionary does not include a word, I default to a simple hack that counts the number of vowels/vowel groups - this is flawed for words like "date" or "going", but given that the number of words not in the CMU dictionary is likely just a small minority, is not going to make a huge impact. In fact, an early version of the script had a bug where it’d never use the CMU dictionary version and ended up at fairly similar numbers. The only major variation was in the Flensch-Kincaid Grade Level scores for the moderators, which landed at 6.27 and 6.91 (nice) for the bugged version.

One interesting bug was that in the first run, the script returned not four, but seven speakers. This is due to inconsistencies in the names the ABC11 transcript calls speakers - "VICE PRESIDENT KAMALA HARRIS" becomes "VICE PRESIDENT HARRIS" in one spot, "LINSEY DAVIS" is misspelled as "LINDSEY DAVIS" one time and "FORMER PRESIDENT DONALD TRUMP" is simplified to "PRESIDENT TRUMP" in one place. My solution for this was to simply merge the erroneous names’ data into their correct places.