My colleagues, Dr. Achberger and Junle Ma from CAU, and I are working on a topic modeling project that compares major themes from collections of English and Mandarin academic texts on the same subject. While we’re consolidating the majority of the work and results into an article for submission, there is a portion of the project delegated to blog posts. Namely: the topic modeling runs that ‘didn’t work.’
We’re using the MALLET software to train our topic models and MALLET, like most LDA topic models, makes the major assumption that the user knows the ‘correct’ number of topics present in your collection of text beforehand.
Oftentimes this assumption is not true. Thus begins the trial and error of figuring out what the ‘right’ number of topics is. What range to test (i.e. small, zero to a couple dozen, or large, into the hundreds or thousands) depends on the size of your corpus and what level of detail you want from the topics. For example, are you looking for every possible topic across the texts or just the major-level topics that themselves may contain several subtopics?
Given our small corpus and interest in broad themes, we ran experimental topic models between 5 and 20 topics. It very quickly became clear that 15+ topics was several topics too many.
Here, I share the results of our topic model run with 15 and 20 topics and highlight what a cohesive topic might look like and how our results did not always achieve that cohesion.
EnglishFullText_15-and-20 and ChineseFullText_15-and-20
Let’s talk about what does work first. Here are two examples, one from each language, from the 15 topic run. The top 20 key words for each topic are listed. We can see that the key words for Topic #3 from the Mandarin run mostly work together to contextualize the state of African (agricultural) economies. From the English example, we can see how words like “migrants”, “farms”, “embassy”, and “vegetables” together form the Chinese Farmers (in Africa) topic.
I picked what I thought were the most cohesive topics for the examples above. From there, the key words become more and more muddled until any interpret able topic definition seems lost. Now we get to what doesn’t work.
Topic #8 in Chinese and #10 in English both feature a hodgepodge of words that, while one or three might relate to each other, do not work together as a whole to describe a topic. For example, from the Mandarin group, we can see how “field survey” might relate to “sustainability” but not as readily to “processing plant”. The English key words are even less easy to interpret, with several verbs and adjectives (e.g. “explained”, “providing”, “greater”).
Now this is not to say the above results are ‘garbage’; they just don’t work for our research purposes. As we’re looking for concrete, major themes from our collections of text, it seems we’d be better served running topic models with a smaller number of topics. And that’s exactly what we did.
For those interested, here are the lines of code we fed MALLET in order to run the topic models:
bin\mallet train-topics –input chinesefull.mallet –num-topics 15 –output-state Cfull15.gz –output-topic-keys Cfull15_keys.txt –output-doc-topics Cfull15_composition.txt
The code uses the base mallet file that contains all the articles’ full text as input. The number of topics is set at 15, and outputs both a key word and a document composition text file. The green text can be edited to change the name of the files and/or number of topics tested.
For further reading on LDA topic models: Blei, David M. “Probabilistic topic models.” Communications of the ACM 55.4 (2012): 77-84.
MALLET: McCallum, Andrew Kachites. “MALLET: A Machine Learning for Language Toolkit.” http://mallet.cs.umass.edu. 2002.