Exploring Homestuck

Main | XML Homestuck | Character Statistics | Profanity | Topic Modeling

Topic Modeling

Mallet is software created by Andrew McCallum at the University of Massachusetts at Amherst. It carries out topic modeling, which is a way to find what topics in what proportion compose a document. In the case of our project, one document is all of the speech spoken by one character. Therefore, we have eighteen documents (one document each for the narrator and seventeen characters), and when we ran the Mallet topic modeling software on the documents, Mallet gave us output in two different forms.

One piece of output that Mallet gives us was the "topic keys" for the documents. A topic in the Mallet sense is a group of words that tend to occur together. Therefore, three different topics to Mallet would look like this:

ruck rugby ball tackle try backs forwards

cheese Mozzarella France milk macaroni goat Swiss

gallop hooves saddle horsemanship Jack broke

To Mallet, these words have no semantic meaning, but since they are statistically related to each other in the text of the documents, they are a "topic". It is the job of a human to determine that the first topic relates to rugby, the second to cheese, and the third to horses.

Naturally, a document is composed of multiple topics, and this is true whether the document is the speech of a character in Homestuck, a journal entry, or a scientific article. Therefore, the second piece of output that Mallet gives us is the composition of topics for each document. For example, for the AT.txt document that contained all of AT’s speech, topic 3 from the "topic keys" composed 29.7% of the document, topic 6 composed 16.4% of the document, and composed 7.8% of the document, etc. Every word from the document belongs to one of the seven topics that Mallet has designated except for the "stop words," which are common prepositions, articles, etc. that the Mallet software is programmed to ignore.

When we ran Mallet on the documents containing each character's speech, the output surprised us. When Mallet returned "topics", or groups of words that tended to occur together in the text of the documents, the list of words in a topic did not have any coherent theme. When we tried to make sense of a topic, the words did not point in any singular direction, and we could not assign any of the topics intelligible semantic meaning.

We may have gotten these results because Mallet does not suit our data well. The webcomic's text is in a conversational style, so often there are no enduring, clearly defined topics that there would be in a newspaper article about a robbery, for example. Also, because of the conversational style of the data, there is a relatively low number of content words. Phatic expressions and words like 'yeah' and 'uh oh' naturally occur often in a conversation.

However, there are other ways we can run Mallet over the webcomic. We created a separate document for each character so that there were eighteen documents. It may have been a mistake to divide the text this way considering that the webcomic is a conversation, which by nature takes place with two or more participants. We mangled conversations by taking each participant's contribution and putting it into a different document. When we read the .txt file created for a character that was part of the Mallet input, the semantic flow was not coherent, because the other half of the conversation was missing. This may the reason that Mallet did not output meaningful topics. With the current data from Mallet, we can only make the conclusion that characters are distinct in some way, because their speech varies significantly in composition of topics.

Before Mallet iterates over the text of the documents, its user must determine how many topics she wants Mallet to create out of the data. We tried a range in number of topics from five to thirty. Mallet’s output from pulling eight topics from the documents seems to be most coherent. The raw data is shown below.

First, we present the topic keys, which is the eight topics that Mallet has found to compose the documents. Remember that a topic is a list of words that are statistically related to each other in the text.

0 6.25 time rose sort put idea grist meteor dead dad pretty present stuff room computer sweet beta ago ghost fire

1 6.25 found wait jade man moment appears http modus dream captchalogue years wall deck bro young god green pair decide

2 6.25 john make dave good game house find things long birthday hard lot place day guy made kind work asleep

3 6.25 time future stupid thought human hell code give point feel guess play earth planet dumb mess timeline means friends

4 6.25 card cards left make room cool package hat inside sylladex door full items item cal pm mom station locked

5 6.25 back thing past black skaia book hate white jack red queen king session land set light captchalogue useless medium

6 6.25 hoo hee understand haa conversation word dear working head perfectly question father read build suppose words watching living letters

7 6.25 yeah guess im dont shit thing talk man youre ha kind stuff sort fucking game wait back hey wow

Second, we present the composition proportions for each document. Mallet shows what percentage of words in the document came from the first topic, the second topic, etc. This way, we can see what the narrator and each of the characters talks about the most.


Topic 7 Topic 3 Topic 2 Topic 6 Topic 0 Topic 5 Topic 4 Topic 1
29.8% 18.7% 16.5% 14.7% 07.8% 05.6% 04.3% 02.7%


Topic 6 Topic 1 Topic 4 Topic 5 Topic 7 Topic 3 Topic 2 Topic 0
61.3% 08.1% 07.4% 05.2% 04.5% 04.5% 04.5% 04.5%


Topic 3 Topic 7 Topic 5 Topic 2 Topic 0 Topic 6 Topic 4 Topic 1
35.3% 21.6% 18.2% 11.6% 05.1% 05.0% 02.1% 01.1%


Topic 7 Topic 0 Topic 2 Topic 3 Topic 1 Topic 5 Topic 6 Topic 4
34.3% 14.2% 12.3% 09.3% 08.7% 07.5% 06.9% 06.9%


Topic 7 Topic 3 Topic 0 Topic 2 Topic 6 Topic 4 Topic 5 Topic 1
44.2% 18.1% 15.2% 13.6% 03.1% 02.5% 02.1% 01.2%


Topic 6 Topic 3 Topic 7 Topic 0 Topic 2 Topic 1 Topic 5 Topic 4
30.7% 23.5% 13.7% 13.1% 07.1% 05.3% 04.3% 02.2%


Topic 7 Topic 2 Topic 3 Topic 5 Topic 0 Topic 6 Topic 4 Topic 1
29.5% 26.0% 25.6% 05.5% 05.1% 03.4% 02.5% 02.4%


Topic 2 Topic 7 Topic 0 Topic 3 Topic 1 Topic 6 Topic 4 Topic 5
47.8% 36.5% 05.5% 04.7% 02.0% 01.8% 01.0% 00.8%


Topic 6 Topic 2 Topic 0 Topic 3 Topic 7 Topic 4 Topic 5 Topic 1
24.8% 15.7% 14.3% 12.9% 11.4% 07.9% 06.5% 06.5%


Topic 6 Topic 5 Topic 2 Topic 1 Topic 3 Topic 0 Topic 4 Topic 7
36.7% 17.1% 15.1% 08.4% 08.1% 05.3% 05.0% 04.4%


Topic 7 Topic 3 Topic 0 Topic 6
21.5% 19.1% 15.0% 12.5%


Topic 7 Topic 2 Topic 3 Topic 0 Topic 4 Topic 5 Topic 1 Topic 6
52.4% 14.4% 11.8% 08.4% 07.2% 02.5% 01.7% 01.7%


Topic 6 Topic 2 Topic 0 Topic 3 Topic 7 Topic 1 Topic 4 Topic 5
28.9% 20.1% 16.4% 11.1% 08.4% 06.3% 05.0% 03.8%


Topic 4 Topic 1 Topic 0 Topic 5 Topic 2 Topic 3 Topic 6 Topic 7
17.7% 17.7% 17.1% 14.7% 14.5% 10.3% 04.6% 03.4%

We welcome feedback on our results from the Mallet software. Any observations about a topic that seems to be coherent, patterns in the composition of the documents, or suggestions about using the Mallet software would be appreciated!