In 2018, University of Washington graduate students in Computational Linguistics addressed the problem of extracting significant information from primary source historical documents using computational methods. This material was of varying quality and content, and included travel journals, letters, excavation reports and ephemera related to Egyptian archaeology at the end of the 19th and beginning of the 20th centuries. These significant yet understudied documents give the historian a detailed view of the social, geographical and political history of Egypt. Starting with a preliminary collection of journal texts marked up in TEI/XML with a content tag set capturing named entities, students used computational techniques to encode and map a much larger corpus of material.
Students working with this primary source material explored how their knowledge of techniques such as named entity recognition, domain adaptation and sentiment analysis could be applied to support work in digital humanities. Challenges included finding solutions to low-resource scenarios and noisy texts, which will likely extend to many other contexts in current applications of NLP.