MALLET
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modelling, information extraction, and other machine learning applications to text.
To get started with MALLET and Latent Dirichlet Allocation topic modeling, we're going to use this tool. However, for extensive use, I recommend using the command line.
A few things to consider when using MALLET:
- Borrowing from Colorado Reed, "a topic is a probability distribution over a collection of words
and a topic model is a formal statistical relationship between a group of observed and
latent (unknown) random variables that specifies a probabilistic procedure to generate
the topics—a generative model. The central goal of a topic is to provide a 'thematic
summary' of a collection of documents. In other words, it answers the question: what
themes are these documents discussing?" (2).
- MALLET relies on Latent Dirichlet Allocation (LDA), which is probably the most popular topic model across the disciplines. The model describes how documents (or bags of words) obtain their words. In so doing, it makes no assumptions about the order in which the words appear in a given document (Reed 2-3). What's more, you can create documents based on your own preferences. For instance, a document could be a paragraph in a novel, or it could be an entire novel. That said, it's important to interpret models across the topics they generate. In the case of MALLET, a cluster of words is only meaningful in relation to the other clusters that are identified. Avoid isolating specific topics as if they emerged independently from the balance.
- This particular tool outputs MALLET results in both CSV (for spreadsheets) and HTML (for browsers). You should them both, as they provide different information and inform each other, through different structures/formats.
- When using the tool, it is important to run the algorithm several times, changing the number of preferred topics and iterations. Also consider running it with and without stopwords (even if the results with stopwords will seem banal or obvious). This way, you can test for consistency (or interesting anomalies), and you iteratively develop the model, seeing what congeals across trials.
- In the humanities, topic modelling and LDA are rarely used to prove anything about texts. Instead, they are vehicles for conjecture and speculation, perhaps prompting us to think about groups of documents in ways we have not considered.
- Keep Ben Schmidt's perspective in mind: "And most humanists who do what I've just done—blindly throwing data into MALLET—won't be able to give the results the pushback they deserve. . . . I don't think I'm alone: and I'm not sure that we should too enthusiastic about interpreting results from machine learning which we can only barely steer. So there are cases where topic modeling can be useful for data-creation purposes . . . But as artifacts to be interpreted on their own, topic models may be less useful."
Some other resources for MALLET and topic modelling:
- Posner, "Very Basic Strategies for Interpreting Results from the Topic Modeling Tool"
- Graham, Weingart, and Milligan, "Getting Started with Topic Modeling and MALLET"
- Underwood, "Topic Modeling Made Just Simple Enough"
- Schmidt, "Compare and Contrast"
- Schmidt, "When You Have a MALLET, Everything Looks Like a Nail"
- Weingart, "Topic Modeling for Humanists: A Guided Tour"
- Templeton, "Topic Modeling in the Humanities: An Overview"
- Blei, "Topic Modeling and Digital Humanities"
- Nelson, "Mining the Dispatch"
- Blevins, "Topic Modeling Martha Ballard’s Diary"