MESA Banner
Handwritten Text Recognition for 18th Century Ottoman Turkish Documents
Abstract
Handwritten text recognition (HTR) has emerged as an important method for digital humanities practice in the last decade, especially for the field of history. Transkribus, an EU-funded software package that is designed to train machine-learning based HTR models, has seen great success for multiple, primarily European, languages. In this presentation, I will discuss my ongoing work on training a HRT model for eighteenth century Ottoman Turkish on Transkribus. Having completed two rounds of training, I will comment on the process and the potential of text recognition in Ottoman studies. In my model, I use eighteenth century bureaucratic documents that contain summaries of events and newspaper articles compiled in western borderlands of the Empire. I argue that machine-learning based models for text recognition will expand the horizons of history research and archival studies, allowing for what I call “Ottoman distant reading”. Distant reading is not a replacement for close reading and engagement with historical materials. Computer-generated transcriptions will allow researchers to gather large corpora to apply new questions and approaches to historical sources. I am particularly interested in how Ottoman officials in the borderlands conceptualized news and information. Did they signify that the value of the content that they received from their spies was different from those that they received from their servants or from newspaper translations? Did the location of their sources affect their reliability? These questions and many others would benefit from new approaches. Moreover, there are no tools available for full-text keyword search in Ottoman Turkish documents. Researchers are limited to summaries provided by the archivists in their search queries. These summaries, while helpful, are produced from the perspectives of the archives and may omit some important topics, including those related to marginalized groups and non-hegemonic communities. In my paper, I will outline the process and challenges in editing and curating documents for model training and reflect on the results of different stages of the model training process. Considering that text technologies, including Transkribus are developed with European languages in mind, scholars working with under-resourced languages like Ottoman Turkish have additional barriers. While Ottoman Turkish was written Right-to-Left using Arabo-Persian script, modern academic conventions follow transcription guidelines that produce texts in Latin script. The transcription process also includes the addition of vowels and the reversal of the writing direction. These characteristics of Ottoman Turkish complicate model training but in turn advance the frontiers of text technologies.
Discipline
History
Geographic Area
Ottoman Empire
Sub Area
13th-18th Centuries