Digital Humanities: Creating Big Data Sets and Ensuring Their Reuse into the Future
RoundTable III-4, sponsored byThe Centre for Digital Humanities at Aga Khan University, Institute for the Study of Muslim Civilisations, London and the European Research Council, 2023 Annual Meeting
On Friday, November 3 at 8:30 am
This year another Digital Humanities (DH) project will conclude its work with the release of a series of large data sets including: a large corpus of Arabic texts, metadata about those texts, text reuse data documenting the connections between them; and other, smaller data sets, for example, relating to citation practices. The team members working on the project have used the data for their own specific purposes, focussing primarily on the study of pre-modern book history and the cultural memory of the pre-modern Middle East. The textual corpus, however, expands to cover much more than just the medieval period and includes a wide variety of texts. The texts and associated data sets are, therefore, relevant to the field of Arabic and Middle Eastern Studies as a whole.
It was the goal of this DH project to produce properly-encoded data and reliable pipelines to ensure that the data sets would be used and cited by scholars well beyond the conclusion of this project. Project team members have taken time to consider appropriate procedures for releasing and archiving data, and processes for updating this data into the future. Also, the project’s host institution now funds a center focused on Islamicate DH, which will carry on the project’s work and expand its reach into new areas.
This roundtable will discuss the data sets and some of the uses that are envisaged for it. The panel will include two project team members, in addition to scholars working in a variety of specialisms who are contributing to the corpus, using the datasets, or producing work in collaboration with the new center. . All roundtable panelists will have been invited to view and to experiment with the data sets prior to the panel, and to critically reflect on how they might use the data in their own research or teaching.
It is our hope that the roundtable will prompt a broader discussion about the future of specifically Islamicate DH. In particular, we would like to discuss how research with large textual corpora can move beyond the use of full-text search, towards other exploratory digital methods. More broadly we hope that attendees will be able to reflect on the production of large data sets in the Digital Humanities and measures that should be taken to ensure the reuse of data sets beyond the narrow time limits of individual funded research projects.
In the Islamicate Digital Humanities data sets are becoming larger and more cumbersome. With the improvement of automatic transcription of manuscripts, the publication of more tweets and Youtube videos, etc., in ten years Digital Humanities corpora (and the data derived from the study of them) will be significantly larger. Already the corpus produced by the project that is the subject of this roundtable exceeds five gigabytes in size and some of the data sets produced by the project team are even larger. This poses enormous challenges to scholarly principles of citation and reproducibility. In order to cite and reproduce digital humanities research, the data sets upon which they are based need to be stored and accessible for decades into the future. In order for the work behind digital methods and corpora to be properly credited, we need robust frameworks for citation.
This short presentation will be composed of two parts. In the first part, I will outline the data production pipelines developed by our project, the different ways in which we have endeavored to make our data accessible and citable, and the core challenges that we have faced in doing so. In the second part, I will pose a series of questions to the roundtable concerning citation, accessibility and data preservation in the Digital Humanities. I will further ask participants to consider how we might work collectively as a discipline to set up publication and citation frameworks that properly credit work in the Digital Humanities and ensure that research outputs continue to be useful and reusable well into the future.
As a participant in several Islamicate digital humanities projects oriented in one way or another towards the production and use of textual corpora using digital tools, my work has largely focused on the 'supply side' of data ecosystems, both in the area of supplying and evaluating training data but also in exploring the shape of the larger Islamicate textual tradition from its origins to the present and ensuring that our technical efforts reflect both the internal diversity of that tradition and the diverse use-case scenarios of scholars and others in the present working with it. My primary goal in this presentation will be to discuss that process of discovery and data generation, with a particular focus on our current work on handwritten text recognition for Islamicate manuscripts, and to reflect on possible uses, rewards, and challenges that might emerge out of the 'datafication' of the manuscript tradition vis-a-vis the technologies upon which we are working. I will also speak to my experience working with and helping to develop a dedicated user platform for text transcription and the challenges of building tools that are adapted to diverse use cases and to diverse audiences around the world. Finally, I will briefly discuss future computational and other quantitatively-rooted projects that we might wish to pursue as HTR technology unlocks more of the vast textual tradition to greater digital legibility.
As PI of the project, I will present a brief summary of the project’s aims; its corpus of over 2 billion words-worth of machine-readable Arabic texts; our text reuse data, comprised of over 2.3 million files documenting relationships between pairs of texts; visualisations of these relationships between books; citation practices data sets; and the blogs, research chapters and books produced by team members. My main goal will be to highlight resources for the field and to invite scholars to use them. These resources include an application that enables users to search the corpus, and to read and download its texts; Zenodo releases of the corpus and text reuse data; book-to-book and other visualisations; and training sessions offered through the new center.
I am joining this roundtable not as a project member but as a scholar who has engaged with the dataset and computational tools such as Text Reuse Detection in my dissertation research. Specifically, I have used Text Reuse Detection in order to explore where selected ḥadīths appear within this large dataset in order to trace their circulation history and gain a broader understanding of the evolution of ḥadīth literature, the diversity of the ḥadīth corpus and its canonization. Working computationally with ḥadīths presents unique challenges due to their formulaic yet highly diverse nature, but the importance of the ḥadīth corpus to all major fields of the Islamic textual tradition and thus the vastness of its circulation history makes computational research especially suited for this challenge. I will present my work as a case study that illustrates one example of how scholars may engage with the dataset and its tools.
In my presentation, I will detail how I engaged with the dataset and tools on a practical level, how I adapted both to fit my needs and my research question, the challenges that I faced in my engagement with this dataset and tools and how I addressed them. Moreover, I will present different outputs of my research such as code, the raw data output, and various data visualizations to discuss new formats of academic scholarship. Finally, I will add a reflection on how my engagement with this large dataset and computational tools has enabled and informed my research in Ḥadīth Studies and which research questions it allowed me to explore that would have otherwise remained unanswerable but also the limitations of computational research on a large scale and the importance of complementing computational large scale studies with traditional humanistic methods and close readings in order to make sense of the data: I didn’t read any less primary source material in using computational tools but I read different material and I read it differently.
I would be participating in the discussion around Islamicate Digital Humanities projects and what they mean for the teaching of Arabic digital humanities, an area of growing interest both within and outside the Arab/Islamic world. As a rich source of data, such projects provide new avenues for exploration and debate not just about the specific data, but also regarding the broader issues of reusing data, designing DH projects for a variety of teaching purposes, digital archiving and accessibility and digital technologies for Islamicate languages. Furthermore, as co-editor in chief of the new Journal of Digital Islamicate Research (Brill), I will also highlight how the new journal will benefit from the existence of such projects by opening the space for scholars in this area to get their publications in a specialized platform.
As someone who is currently leading an initiative at Nile University in Egypt to establish a new school of Digital Humanities and Social Sciences which will be the first of its type in the Middle East, I am interested in participating in the discussion around the DH projects globally and particularly in the Arab world. Arabic digital humanities is an area of growing interest both within and outside the Arab/Islamic world. I am also interested in Cultural Analytics for American Studies from the Arab World that will help remedy a gap in this area, examining the United States from the perspectives of Arab and Middle Eastern peoples across the world. Unlike first-generation digital humanities research, the focus of cultural analytics is not the preservation and archiving of humanities data, but rather the use of metadata, text and image analysis to provide meaningful insights into digital corpora. Furthermore, as the founding co-editor in chief of the new Journal of Digital Islamicate Research (Brill), I will also highlight how the new journal will benefit from the existence of such DH projects by opening the space for scholars in this area to get their publications in a specialized platform.
Our mission is to assemble a corpus of relevant legal and historical texts for Islamic legal canons (interpretive tools) and data science tools to analyze them. To build the corpus, we rely on and adapt many sources of digital texts that are or need to be converted into OCR-ready texts. My aim in the presentation will be to present project work on the use of digital tools to gain insight into digitized texts of Islamic law and history, and to explore whether and how we can devise shared protocols and formats for using text-as-data and for metadata of the texts.