MESA Banner
Finding Meaning in 2 Billion Words of Arabic Books

Panel IV-8, sponsored byAga Khan University-ISMC, Centre for Digital Humanities, 2023 Annual Meeting

On Friday, November 3 at 11:00 am

Panel Description
The premodern Islamic world offers a spectacular opportunity to consider the construction of the cultural stores of knowledge from which groups derive awareness of their unity and particularity. The thousands of texts surviving from the period of 700 to 1500 CE can now be studied in completely new ways using pioneering digital technology that operates much like genetic sequencing or anti-plagiarism software to identify matching textual units. With it, we can gain an unprecedented view of the development of this world-historical tradition, its main patterns of transmission, and the networks that participated in its generation and regeneration across time and across a geography stretching from modern-day Spain to South Asia. This panel presents the fruit of the first research project to develop this technology for Arabic. The project concludes this year, and includes among its accomplishments a corpus of over 2 billion words-worth of Arabic texts in machine readable format, including: scholarly vetting of over 1,000 texts; the development of text reuse detection and other methods to study this corpus; the creation of data visualisations; the building of an open-access platform to study the relationships between works in the corpus; the innovation of a pipeline that brings these components together, and that is designed for large text corpora; public events, including a major exhibition; and the development of a semester-long course for teaching Digital Humanities methods for all-Arabic-script languages. In addition, the project has produced much research, which is the primary focus of this panel. The panelists represent members of the project team. They present the project’s general findings, and also research they have personally undertaken using the project’s data and digital resources. Through the papers, the panelists seek to illustrate how digital methods can enrich our understanding of history, its authors, and their creations in Arabic books across time and space.
  • The largest Arabic book prior to 1500 is The History of Damascus (Taʾrīkh Madīnat Dimashq, hereafter, TMD), written by Ibn ʿAsākir (d. 571/1176). Totalling over eight million words, it consists of two volumes, treating the history and religious significance of the city and Syria, followed by 72 volumes treating the biographies of the elites who lived in or passed through the region up to Ibn ʿAsākir’s time. How, in premodern times, did he compose such a massive book? What were his sources? An argument has been advanced that he relied upon a library, and its contents are proposed based chiefly on isnāds and citations within the TMD itself. The paper relies on a new data set to propose a different understanding of the source base of the TMD and indeed, the character of the book itself. The first element of the data set is 77,000 isnāds extracted from the book, plus the names he cites (including authors) and the terms by which he cites them. A second element is from search results across the TMD, including for titles and author names. The final element is text reuse alignments between the TMD and 37 books which scholars today say Ibn ʿAsākir used to compile the TMD and for which we have machine-readable files. We use this data to argue that the character of his source base is better represented by notes, excerpts and compilations (e.g., various lists, including those that start with the term tasmiya) than by books sitting on the shelf of a library. Indeed, he rarely cites books, even those we can see he used. Rather, Ibn ʿAsākir spells out what he is doing, thousands of times: relying on many connections and numerous personal relationships running back decades and spanning a vast geography. He assembles pieces of reporting for the TMD from this network, and indeed, would memorialise through the isnāds the community of scholars that it represents. He wants to show us these personal relationships, including their connections back in time to the Prophet. We conclude with a description of the many other works by Ibn ʿAsākir that focus on isnāds and which likely provided material for the TMD, including those that are lost and in an unpublished manuscript. These also help us to understand how he worked and more about the character of the TMD itself.
  • Distant reading methods allow us to identify and analyze shared passages (direct text reuse and paraphrase) across our premodern Arabic corpus, but also identify similarities and differences in topics and style, based on linguistic features of the texts, word frequencies, function words, rare words, and so on. Stylometry or computational stylistics deals with the latter. While text reuse and stylometry may work against each other, (unacknowledged borrowing affects topic-agnostic quantitative measures such as word frequencies), they can also be complimentary. Anonymous or unacknowledged authorship attribution, and genre detection are among the main applications in stylometry and we have already done some work in that direction for premodern Arabic stylometry. These applications and their potential will be explored through the case of the  Epistles of the Brethren of Purity (Rasail Ikhwan al-Ṣafāʾ) - a collection of 52 treatises plus two other "sister" –treatises (Risālat al-Jāmiʿa and Risālat Jāmiʿat al-Jāmiʿa) on a wide range of subjects, purportedly all known disciplines, by one or more anonymous authors supposedly in tenth-century Basra. The authorship, arrangement, and composition date of the Epistles are disputed and competing theories have been put forward. The widely varying manuscript tradition, attested only from about two centuries after the composition of the Epistles, suggests an open-text tradition since the beginning. Problematic modern editions have further complicated the matter and the potential of digital scholarship on the Epistles has yet to be realized. As the first application of stylometry to the Ikhwanian corpus, this paper aims to make a two-fold contribution: a contribution to these scholarly debates, by considering the stylistic features of the epistles, and towards the application of computational linguistics to a corpus of premodern texts in Arabic.
  • Pre-modern Arabic texts (particularly chronicles) bear witness to shockingly violent events. The tradition is often primarily concerned with providing records of battles, sieges, invasions and their outcomes. Less common, but still quite present, are accounts of famines and epidemics and their aftermath. Although modern scholarship has discussed the causes and impacts of famine in the pre-modern Middle East, there has been little critical consideration of the Arabic accounts themselves. Anyone who has read an Arabic chronicle, knows that famine is frequently outlined in a formulaic manner, but very occasionally a chronicler provides a longer, more dramatic account. Through text reuse detection, our project has shown how historians frequently recycle material from earlier historical accounts for their narratives. Ibn al-Athīr’s Kāmil fī-l-Taʾrīkh relies heavily on descriptions from al-Ṭabarī’s famous Taʾrīkh and al-Nuwayrī’s Nihāyat al-Arab recycles material from Ibn al-Athīr’s Kāmil. Other computational methods, such as topic modelling, allow us to see how different events are described across the Arabic textual corpus using the same phrasing (that is, to study how formulaic phrasing is used through a corpus). Through a combination of these methods, this paper will explore how Arabic chroniclers dealt with famine. This will show how, quite contrary to Arabic accounts of battles and sieges, detailed descriptions of famine are rare, and when they do appear they are infrequently reused. This is particularly the case for the most horrific descriptions of famine. These patterns remind us of the humanity of the historians that we study. They lived in a world where famine was a near-constant fear and where an author was likely to witness such an event first-hand or hear about such an event from witnesses. These experiences were likely to impact how they dealt with and reproduced accounts that dealt with similar, traumatic, events.
  • When studying a text, one is often interested to discover the sources used to write it. Principally, there are two sources of information one can use to track down a source: citations, in which the author directly points to a source, and text reuse without such acknowledgement, which require the reader to know the source well enough to recognize it. While a manual approach to source retrieval works well for small-scale settings where the reader knows both the text and its possible sources well, the problem becomes much more complicated and intractable as the number of texts of interest and possible sources increases, as one seen in a corpus the size of the 2 billion word corpus under discussion in these papers. In this paper we will explore the possibility of applying recent developments from natural language processing to learn to replicate human judgments of source attribution. Modern large language models are a mathematical tool used to encode the meaning of texts as points in high-dimensional space. Texts with similar meanings have similar representations. Some models, like BART, can be used to determine how useful a piece of text (here we call it the source) is for generating a target text of interest. In our experiments, we use the BART model as a signal for ranking possible sources by measuring how useful possible sources are for reconstructing the text that employs them. As a case study, we will work with annotated data from al-Maqrizi’s (d. 845/1442) Mawa’iz (perhaps the most extensive premodern topographical history of Egypt) and several sources he used in writing it. The Mawa’iz is an excellent source for this task as the author uses a wide variety of sources, citing some of them but not all. A study of the text in this manner promises to shed light on al-Maqrizi’s citation practices and contemporary attitudes towards citation and reuse. We compare several forms of model, ranging from basic n-gram retrieval models to reranking with language models and using a generative model to guide a retrieval model based on dense embeddings.
  • This paper will focus on the results of a years-long initiative aimed at recovering the “lost” versions of the Sīrah collected and composed by Muḥammad b. Isḥāq (85/704-150/767 or 159/775-6). Ibn Isḥāq performed from, or produced copies of, a collection of scripts for approximately 149 people, 38 of whom possessed “complete” story collections. The most famous collection is the copy witnessed by Ziyād b. ʿAbd Allāh al-Bakkāʾī (d. 183/799), which is preserved in the redaction and commentary of ʿAbd al-Mālik b. Hishām (d. 218/833). Since the groundbreaking work of Johann Fück, scholars have maintained that there were only superficial differences between witnessed versions, even though scholars such as Ibn Ḥajar al-ʿAsqalānī (d. 852/1449) demonstrated how different ṭuruq of the Sīrah contained significant variations. The results of the project support this premodern contention and demonstrate that there were major differences in the narrative content and structure, characters, and events as they are presented in different witness collections. Working in collaboration with a team of digital humanities scholars, we developed methods for the collection, organization, and presentation of each witness. Textual fragments of each “original” witness version have been extracted from a digital corpus of classical Arabic works using a variety of automated methods, especially the Targeted Isnād Locator (TIL), an application that uses an algorithm to identity authors and extract quotations by isolating naming combinations and a range of transmissive terms in an isnād. The TIL, in combination with other methods of textual extraction, has recovered over 1.5 million words of text for 112 witnesses. After extraction, each quotation is analyzed, annotated, and placed in an order that reflects the “logical” structure of the “original script” as it was theoretically presented to each witness. The first iteration of the project will be published via an open-access website that allows scholars to analyze the Sīrah corpus using a variety of analytical tools. A prototype reading environment will go online in 2023, presenting the earliest version of the script produced by Ibn Isḥāq, and witnessed by Ibrāhīm b. Saʿd (d. 183/799), while both men were still living in Madīnah before the end of the ʿAbbāsid Revolution. The presentation will highlight the functions of the TIL and the digital reader and demonstrate how the collection of the Sīrah changed over time as Ibn Isḥāq shaped his narrative to reflect the interests and tastes of his various audiences.
  • The Islamic tradition is characterized by a very high level of text reuse and more or less identical chunks of text reappear in works centuries apart. In fact, many works from the Islamic tradition, albeit clearly attributed to a specific author, contain very few words actually written by that author. In extreme cases, disregarding introductions, de facto none of the actual content was written by the individual figuring as the author. This paper focuses on pre-modern Arabic lexicographical works as an example and contends that such a high degree of text reuse is a function of how pre-modern scholarly fields collectively moved forward. On the one hand, considering that scholars worked on similar subjects independently, wrote sequels, commentaries (etc.) to each other's works, relevant knowledge inevitably ended up scattered across separate works which furthermore often circulated in different versions. This situation was periodically met with the creation of new works that attempted to force the scattered material into one new and unified context of circulation and hence effectively, and more or less temporarily, “un-scattered” the field. On the other hand, similarly reflecting practical needs, pre-modern authors often betray an interest in the intelligent organization of their material. The intelligent organization of works not only made their navigation and the retrieval of the information found in them easy, but also allowed avoiding the frustration caused by works that were hard to use. Often the purpose of one and the same work, both the attempt to bring existing material together in one place and the interest in the intelligent (re-)organization of that material require the reuse of text. This paper seeks to show how data generated by computational and algorithmic methods of text reuse detection can be used not only to visualize pre-modern efforts of “un-scattering” and intelligent (re-)organization, but also to explore which works succeeded in establishing themselves as stable vantage points for subsequent scholarly activities and thus lastingly kept textual material in circulation in a particular constellation.