Towards a Comprehensive Computational System for Grammatical Analysis of Ancient Corpora

Current morphologies of ancient Greek focus on annotating attributes of individual words, such as case, number, gender, person, voice, etc. but omit information described in grammars including pronunciation, vowel change, euphony of vowels/consonants, syllables, accentuation principles, inflection categorization and word formation. Starting with openly available digital Greek texts we apply a rigorous methodology to perform grammatical analyses at scale with human-level precision. These analyses are not done by hand; rather, we encode grammatical rules in a formal language which the computer applies to the digital texts. We begin by creating a formal model of the source texts that faithfully represents all information such as letters, diacritical marks, punctuation, manuscript variants, etc. We proceed by defining a series of models and mappings between those models such that each subsequent model encodes more information with greater precision than the previous model. As a whole this creates a formal system for digital texts whereby we can validate or refute grammatical hypotheses postulated in existing grammars against primary source texts. Using this system, we then produce a browsable text where every word is automatically tagged with the grammatical rules that were applied to that word. Furthermore, a list of all of the these rules organized like grammars of ancient Greek (e.g., Smyth) shows all of the words in texts that were affected by that rule. This methodology reduces the annotation time while maintaining and even enhancing the quality of analysis in comparison with hand annotation. This project represents a significant advancement—both in the breadth and depth of analysis—of a pilot project presented at the annual meeting of the SBL in November, 2015. We broaden our analysis to include the Perseus Digital Library and other sources. Moreover, we also deepen our analysis by including euphony rules, paradigms, and basic syntactic constituents.