MESA Banner
The Difference that Training Data Makes: A New Query into Book History
Abstract
This paper begins with discussion of a training data set that marks out chains of transmission (isnads) in a 1.5 billion-word corpus of Arabic texts (running ca. 700-20th century). The presenter discusses how she and a team of historians and computer scientists have been working together to train a model to detect isnads (chains of names documenting transmission) across different types of Arabic texts (crucially, not limited to Hadith collections). The historians providing the training data began with an understanding of isnads grounded in decades of work on all genres of Arabic writing. They believed they knew what an isnad was. But weeks working together, discussing parameters for in-text isnad annotation, plus a test study comparing their annotations of a single text, showed them that their definitions remained at odds. The variety of forms and features of isnads were far higher than they expected. Which made the performance of the computer scientist’s model, based on their training data, more remarkable (at present, achieving 85% precision and 81% recall (with training data generation continuing across 2020). The paper’s main goals are three-fold and based on a larger Arabic DH project. First, I will consider the ways that “humanist” researchers can and should inform models guiding our emerging understanding of matters such as isnad identification, text reuse detection, and optical character recognition. The exchanges between computer science and the humanities can be mutually informing. I discuss aspects of project design relevant to gaining the best benefit for both, and also the challenges we have faced. Secondly, I want to show what the case of isnad identification can teach us about book history, specifically. Our work on isnads forces a rethink of many assumptions in current scholarship on book history (including how isnads relate to one another in large works). Finally, I propose that a crucial difference of DH is the way it can bring research to the public. For this, I show a web application that makes use of our project's training data.
Discipline
History
Geographic Area
None
Sub Area
None