Computational alignment of Greek and Hebrew with Bible translations, using Swahili as a proof of concept

Aligning Bible translations with the original text has a venerable and mixed history, popularised by the hand-coding of groups such as Online Bible and CrossWire. Bibles have been successfully tagged in a few languages, including English, Spanish, Chinese, Russian, and German, but each takes about a decade of a scholar's time. Computational methods are hampered by the limitations of Strongs tags (which do not separate Hebrew pronoun and prepositional affixes), the small Hebrew vocabulary (so one word has many meanings), differences in textual sources, and a very high proportion of rare words. The Greek New Testament lacks many of these problems but its smaller size and wider textual variants reduce accurate alignment. Additionally, most translations are not as verbatim as the King James. This project is based on a set of Hebrew and Greek tags which are aligned to academic lexicons (BDB & LSJ) and have separate tags for affixes, with options for major sets of textual variants. Rare words are linked with similar-meaning popular words during alignment. The target language is stemmed by removing repetitive prefixes and endings and then put through the Berkeley aligner with the tags representing the original text. Swahili was chosen for a proof of concept, because each verb can have half a dozen one-syllable prefixes and suffixes surrounding a small root, making this an extreme test for computational analysis. A survey of the results suggested several post-processing methods for increasing the accuracy, and ways to highlight individual words or phrases that required manual checking. The output can also be used to create a lookup dictionary between Greek/Hebrew words and the translation language. The techniques and tools assembled for this project can be used for other languages, though adaptations for specific languages produce better results. Nevertheless, a quick trial with English and Spanish Bibles produced usable results even without tweaking for the characteristics of those languages. The methodology and coding is being released on a public licence as part of the project of Tyndale House, Cambridge.