Exploring Homestuck

Main | XML Homestuck | Character Statistics | Profanity | Topic Modeling

Homestuck Character Data

Click any column title to sort from lowest to highest.

Character	Total Word Count	Distinct Word Count	Type-Token Ratio	Average Words Per Line	Average Word Length
Calsprite	78	3	3.85%	5.86	3.85
Narrator	55744	7501	13.46%	9.92	4.53
John	8212	1263	15.38%	7.04	3.87
Jade	3539	670	18.93%	6.54	3.73
Dave	8909	1769	19.86%	7.31	4.05
Rose	7013	1763	25.14%	8.87	4.37
Karkat	4131	1117	27.04%	9.18	4.31
Terezi	2581	774	29.99%	6.09	4.18
Tavros	1292	478	37.00%	7.18	4.04
Kanaya	1593	605	37.98%	8.80	4.45
Nannasprite	797	341	42.79%	16.96	4.31
Jaspersprite	257	127	49.42%	9.52	3.77
Sollux	205	130	63.41%	8.54	3.87
Davesprite	280	181	64.64%	7.18	4.11

This table documents the speech of each character in Homestuck, as well as that of the omniscient narrator who may be considered to be equivalent to the author, Andrew Hussie. Hussie has appeared several times as a self-insert character within Homestuck, although of course this character may say and do things that the real Andrew Hussie would not. This project only covers data contained within Acts 1 though 4 of Homestuck. All dialogue in Homestuck is carried out through online chats on the fictional instant messaging system Pesterchum. All characters have individual "typing quirks," primarily consisting of variations in capitalization and punctuation habits, that make their typing style different from everyone else in the comic. Dialogue is broken up script style, with each line preceded by the writer's chat handle. These lines can contain anything from multiple sentences to a single emoticon and a speaker can have more than one line in a row. The Narrator is the one exception to this. He writes in ordinary paragraph form, and all line statistics for him apply to sentences. So, his measure of average words per line is actually average words per sentence.

Type-Token Ratio

The scatterplot above visually represents the relationship between word counts and distinct word counts, or type-token ratio. The type-token ratio is the percentage of words in a sample of text that are distinct. In this text at least, there seems to be a general correlation between type-token ratio and word count. This makes sense, as the more words one uses, the more likely one is to repeat words. For example, Sollux, who has only one conversation and uses a total of only 205 words, has one of the highest type-token ratios at 63.41%. In contrast, the Narrator, who of course has the highest word count of 55,744 words, has the highest count of distinct words but also has one of the smallest type-token ratios by a large margin, only 13.46%. The Narrator and Sollux are clear outliers in both of these categories, indicating a correlation between them. The only character who has both a lower word count than Sollux and a lower type-token ration than the Narrator is Calsprite, a character who appears only once and uses only the words “Haa,” “Hee,” and “Hoo,” in various combinations. When discounting this outlier, the trend of high word count corresponding with low type-token ratio is a consistent trend.

However, this correlation isn't the whole story among the characters with comparable amounts of speech. Although Dave has the second highest word count, he has only the fifth lowest type-token ratio. Both John, who has about the same amount of speech as Dave, and Jade, who speaks less than half as much, have a lower token-type ratio and therefore, we can conclude, use a less varied vocabulary than Dave does. Given the sheer magnitude of Dave's speech, the data indicates that Dave has a fairly extensive vocabulary. Further evidence for Dave’s high vocabulary is Davesprite, a future version of Dave.

Some of these quantitative observations back up things that regular readers of the comic have picked up before. In Act 6, far past the information covered by this table, the main characters of Homestuck meet in person for the first time. One of the things that quickly gained attention was that Dave seemed to be using smaller and simpler sentences, as well as fewer and shorter words. Many people speculated that Dave had Dictionary.com open whenever he talked to his friends. Of course, most people have a higher type-token ratio when they write compared to when they speak, because they have more time to think about what they want to say.

Karkat also appears to have quite a varied vocabulary. Most of the lower range of word counts find that their type-token ratio descends in approximately the same order that their word counts progress. But Karkat has about the same type-token ration as Terezi, despite speaking twice as much. In fact, his speech is more varied than Rose, who has almost double his total word count. Davesprite has the highest type-token ratio of all, even though Jadesprite and Sollux both speak less than him. If Davesprite’s speech can be taken as a smaller sample of Dave’s speech, this might be further indication of Dave’s large vocabulary. It is also possible that as an older, more experience Dave, Davesprite might use even more varied and sophisticated speech.

Karkat also has one of the higher average sentence lengths, which adds some support to the idea that type-token ratio can reflect sophistication of language. On average, Karkat’s sentences are about as long as the Narrator’s, the only character more or less required to adhere to ordinary rules of grammar and syntax. The character with the longest sentences on average is Nannasprite, which is logical in light of both her character and her narrative function. Nannasprite is the reanimation of John’s dead grandmother, making her demographically more likely to speak in a correct and sophisticated manner. She is also the first proper source of exposition in the story, and therefore prone to long, multi-sentence lines, with John occasionally fulfilling the role of the Watson by asking her questions to prompt her along. Supporting this, Jaspersprite, the second true source of exposition, has the third longest sentences after Nannasprite and the Narrator. His sentences most likely do not match Nannasprite’s because she had already provided most of the heavy lifting in explaining things to the reader and because as a cat his speech is more believable if he does not speak in too sophisticated a manner.