Using Topic Modeling for Multilingual Concept Comparison. Evidence from the Hebrew Bible and the Septuagint

In previous work we have already discussed the value of topic modeling techniques to study the Hebrew Bible. These techniques extract keywords from raw text and cluster them together into discrete topics. The present research wants to explore what information can be gained by comparing topics extracted from the Hebrew Bible with ancient translations on the basis of their conceptual differences in extracted topics. Currently, we will limit our comparison to the Septuagint. In a first step we review our hierarchical approach to topic modeling, and discuss its main uses for data discovery in the Hebrew Bible, and results derived therefrom. We use the MALLET software,a package of natural language processing techniques mainly concerned with semantic problems such as topic modeling and named entity recognition. We use the software to test the Latent Dirichlet Algorithm, the most widely-used topic modeling technique, to assess the value of dynamically discovered topics in different distributions for the Bible. In a second step then, we argue that the same approach can be applied to the Greek translation of the books of the original Hebrew, and present a qualitative assessment of the conceptual specificities of the extracted topics, looking at the interpretational usefulness of clustered terms constituting a topic. For example, for a topic that can be interpreted as being concerned with kingship, terms such as ‘king’, ‘reign’ and ‘throne’ are useful, while ‘grass’ and ‘woman’ are not. Our goal is to discover for which topic distributions the algorithm finds most salient topics. In a third step, we go deeper into a quantitative evaluation of the discovered data, resulting in a discussion of useful similarity measures to compare the extracted topics for inherent coherence and cross-lingual similarity. More concretely, this means we will have to show how source and target topic distributions can be evaluated using important similarity measures such as cosine or dice similarity. Also we will compare the process of training topic models for both languages independently with training a bilingual topic model where topics in both languages will be extracted simultaneously. We show the problems we encounter with both approaches, discussing data sparsity for the separate Hebrew and Greek models, and overtraining for the bilingual model. Based on this information, we provide viable roads for further research, which provides a modus operandi exportable to other Bible translations. As a fourth and final point, we present the assets of this multilingual concept.