Topic Modeling and Gephi: A Work in Progress
July 21, 2014 5:00 pm
Leave your thoughts
Armed with a corpus of 358 documents pertaining to the Environmental Humanities and two analytic applications new to me – Mallet and Gephi – I have been working on topic modeling for the past few months. With such a wide variety of articles at my disposal, I have come to fully appreciate the utility of text analysis tools to humanities research. While the learning curve may be somewhat steep, the resulting topic collections and visualizations offer insight that would otherwise be nearly impossible for one person to reach singlehandedly. In this blog post, I will describe the process of using the Topic Modeling Tool and Gephi, as well as preliminary conclusions stemming from the visualizations created thus far.
First, however, a word about the corpus itself. With the aid of Zotero, each item was catalogued from the bibliographies sent in by DEH Workshop participants. Additionally, many database searches were conducted, namely on JSTOR. The documents analyzed were generally journal articles, along with some dissertations, reports and conference summaries. Each discipline of the humanities is represented to some degree in the final corpus, which can be seen in the table below. Clearly, some disciplines were more abundant in documentation than others. I made an attempt to include as many “overview”-style articles concerning EH in each discipline as possible, without any publication date cutoff. Below is a summary of the corpus:
|Discipline||Number of Documents||Span of Documents|
1. Creation of the Corpus
Each document was tagged with the discipline it pertained to or as ‘interdisciplinary’ if crossover was observed, which was often the case. Next, an attempt was made to locate a PDF version of each document (many were accessed through McGill’s library subscription). Unfortunately, this step was limited to articles only, as digital access to most books was unavailable. This decreased the size of the corpus a substantial amount, as there are around 650 items in the Zotero collection in total.
Once all available PDF files were collected and appropriately tagged, they had to be converted into plain text format in order to be parsed through the TM tool. Adobe Acrobat Pro was used to accomplish this, simply through the “save as- plain text” option. This method worked well for both older texts and more recent ones alike.
Next, each plain text file had to be “cleaned up”. This process involved removing metadata, any foreign languages within the document, citations, notes, captions, and so on. This was done on a document-by-document basis, as each required different changes. For instance, some documents conversions resulted in spelling errors, misplaced spaces and unnecessary characters. With so many documents to clean up, this was the most time consuming step of the entire project.
2. Topic Modeling Tool
Once the files were cleaned up, they were run through the Topic Modeling Tool, discipline by discipline. This tool computes topic models using latent Dirichlet allocation, LDA for short. Essentially, this statistical model reveals the hidden structure of the disciplinary corpus, in terms of topics that are present across its entirety, albeit in different proportions in each document. Each topic contains words that have a high probability of appearing in the same context within documents. A more in-depth explanation of the algorithm behind the TM tool can be found here:
“Topic Modeling and Digital Humanities”– Journal of Digital Humanities
” ‘The Heart of the Matter” Topic Modeled”– 4Humanities
“Probabilistic Topic Models”– Communications of the ACM
The tool’s user interface is extremely straightforward. All the user must do is select the folder of documents to be analyzed under ‘input’ and select where the result files are to be placed under ‘output’. Then, the user indicates how many topics the tool is to provide. There was only one small detail that caused a lot of initial frustration before I figured out what was wrong; the TM tool does not work if any of the documents contain commas in their file names.
After about 15 seconds, the topics have been computed, and comma-separated-values (CSV) and HTML files are available. The HTML files illustrate the composition of each document in terms of the proportion of each topic it contains. However, what was used for the purposes of this analysis was the CSV file providing the 10 topics for each discipline. Below is an example of the topic modeling results for the History collection. To reiterate, it demonstrates the topics that are most representative of the entire corpus, the words within which have a highly probability of occurring together.
Gephi is a program designed for exploring and visualizing “all kinds of networks and complex systems, dynamic and hierarchical graphs”. For the purposes of this research, Gephi was useful as it allowed the visualization of the results in a non-hierarchical format, simply through links. The primary tutorial I used to understand the functionality of Gephi can be found here.
In order to input the TM results into Gephi, the data had to be converted into XML format, identifying the nodes and edges of the graph. The nodes indicate the particular words from the topics and edges outline where connections exist (two iterations of the same word in different disciplines). The creation of such a file was challenging for me, so Prof. Sinclair helped me with this part. The first test run on Gephi produced a visualization that seemed to be exactly what we were aiming for. However, upon further inspection, we realized that some parts were off. For instance, the word “ballet” was central and had multiple connections to various disciplines. We were surprised by this, as this word only appeared once in the topic modeling results. Upon closer examination of the XML file, we realized that certain node identification numbers were being repeated, and Gephi was integrating all “Performing Arts” connections to the term “ballet”. Therefore, although the initial results looked promising, a careful examination of both the XML file and resulting graph was certainly necessary. A sample of the updated file can be seen below.
Once the XML file is opened in Gephi, a rough and messy graph can be seen, illustrating the totality of the topics, with links showing the co-appearance of words.
At this point, the functionality of Gephi is put to use in terms of establishing a layout, ranking and colouring of the connections. The tutorial linked above offers a great step-by-step outline of how to apply ranking parameters, display labels, apply filters, etc. Below are two samples of the visualizations that were created, as well as preliminary interpretations of each.
This visualization most clearly illustrates the links between disciplines. The size of each label is representative with how many times it appears in the TM results, with the disciplines themselves being the largest. For example, ‘nature’ occurs more often than ‘water’ and is therefore larger in size. The centrality of a node indicates how interconnected it is in its representation. For instance, ‘people’ is a word that occurs in more disciplines than ‘sound’, which only occurs in Performing Arts. The colour of each label indicates the discipline in which it is the most represented. For example, ‘world’ is a word that has occurrences in various disciplines, but the most in Literature and is therefore given a blue label.
Some interesting observations stemming from this visualization:
– It appears that the Philosophy and Interdisciplinary areas are the ones with the most overlap. One may argue that the “interdisciplinary” category is very representative of the framework of EH as an emerging field. With this analysis, it seems that interdisciplinary documents utilize a lot of philosophical vocabulary. Perhaps this has something to do with the fact that EH is indeed an emerging field, and research papers primarily concern themselves with outlining its philosophical foundations? Another potential explanation could be that this tag was applied to content that was more theoretical and conceptual to begin with, as it could not be linked to one single discipline.
– ‘Environmental’ and ‘ecological’, although both central, appear the most often in the context of Visual Arts. Perhaps this discipline is the one in which environmentalism is the most explicitly addressed? Other disciplines seem to employ more nuanced aspects of environmentalism, such as ‘nature’, ‘world’ and ‘land’, as seen in Literature.
– Linguistics is clearly the outlier in EH, as seen in its peripheral position, as well as the low number of documents that could be found in the first place.
– Philosophy, Interdisciplinary and Literature are the three disciplines that are the closest in the visualization. Are these three the “big players” in EH research?
– Performing Arts, Religion and Linguistics TM results contain many words that are exclusive to the respective discipline, as opposed to Philosophy, Interdisciplinary and Literature with their significant overlap
– History seems to be somewhat of an outlier as well, especially compared to the interconnectedness of Philosophy, Interdisciplinary and Literature. While most disciplines’ documents were primarily written over the past 10 to 15 years, with occasionally older texts, History documents were consistently spaced out over the years. Perhaps this discipline has developed its own vocabulary for discussing EH since it has been around for the longest?
This visualization utilizes a separate plug-in layout, “Circular Layout” (link). It also organizes nodes in terms of size depending on how often the words occur in the TM results. In this case, all topic modeling results of a particular discipline are grouped together. This visualization is more useful for quickly seeing what words are linked to each discipline, rather than the links between disciplines. The central interconnections are difficult to trace. Generally speaking, this visualization offers less potential for analysis and interpretation than the previous one. However, seeing as the links are weighted, some interesting connections can be seen:
– ‘Humans’ is most represented in Philosophy, but has significant links to both Interdisciplinarity and Visual Arts.
– ‘World’ and ‘nature’ are most represented in Literature, with significant links to Religion and Visual Arts
The above screenshots are of the PDF files of the Gephi graphs. Using Gephi, these graphs are more interactive- when a particular word/discipline is moused-over, the rest of the graph fades out and only the connections of that word can be seen.
It is clear that the interpretation of these visualizations is highly subjective and no definitive conclusions can be drawn. At the beginning of this project, I assumed that solid conclusions would be established in terms of what each discipline addresses and the perspective it offers to the EH. However, I realized that the results are much more subtle in nature. There is no clear divide to be found amongst the disciplines. Perhaps this should come as no surprise, seeing as interdisciplinarity is at the core of this emerging discipline. Crossover and overlap between disciplines simply illustrates this quality. However, one cannot say that environmental topics and streams of thought are equally represented across disciplines. The observations above offer clues as to which aspects of the corpus lend themselves to a more in-depth examination of similarities and differences. The visualizations above serve as a stepping stone, being the “big picture” of the topic modeling results, from which more precise interpretations and conclusions can be drawn. The next steps of this project are to compare and contrast specific disciplines against each other, as well as examine how particular environmental terms such as ‘place’ and ‘landscape’ are approached in the humanistic disciplines.
Interactive versions of the visualizations can be found here. Below are the two that were discussed in this post (open in a new page):