MESA Banner
A New Corpus for the Islamicate World and Methods for Its Exploration

Panel 179, sponsored byMiddle East Medievalists (MEM), 2017 Annual Meeting

On Monday, November 20 at 3:30 pm

Panel Description
The written heritage of the "Islamicate" cultures that stretch from modern Bengal to Spain is as vast as it is understudied and underrepresented in the Digital Humanities. The sheer volume and diversity of the surviving works produced in Persian and Arabic by denizens of these lands in the premodern period makes this body of texts ideal for computational forms of analysis. Efforts to utilize these new digital forms of macro-textual analysis and digital scholarship, however, have been stymied by the lack of a reliable corpus. In an effort to address this desideratum, an international group of scholars have created the first version of a the machine-actionable scholarly corpus of premodern Islamicate texts. This corpus currently includes 740 million words of 4,300 unique texts in Arabic and 9.3 million words of Persian and is already openly available. The panel will present the corpus to the field of Arabic and Persian studies, explaining how it can be used for various scholarly purposes and sharing the team's long-term vision of how to build the digital infrastructure for the computational study of Islamicate textual traditions. The participants also will present their own individual case studies of texts from the corpus, showcasing a series of digital methods of algorithmic text text analysis, which will include such approaches as text-reuse detection, stylometry, and topic modeling.
Disciplines
History
Participants
  • Dr. Paul M. Cobb -- Discussant
  • Prof. Sarah Bowen Savant -- Organizer, Presenter
  • Dr. Maxim Romanov -- Presenter
  • Dr. Nancy Khalek -- Chair
  • Dr. Matthew Thomas Miller -- Organizer, Presenter
  • Mr. Elijah Cooke -- Presenter
Presentations
  • Prof. Sarah Bowen Savant
    The Arabic tradition is populated by extremely prolific authors who wrote dozens or even hundreds of works filling many volumes. The historian and exegete al-Tabari (d. 923), for example, wrote a history totaling nearly 1.5 million words and a Qur’an commentary totaling about 2.8 million. These were only two of his many works. Whatever assumptions one makes about rates of work, it is hard to understand how a man could be so prolific – and he was less productive than some authors of later times. How were so many authors so productive: what strategies did they employ in their works? With text reuse detection methods, we now can see at scale how authors reused past works, often extensively, and also reused their own. In particular, it is possible to now reconsider the picture of a solitary author producing works and to entertain other possibilities, including for example, something like workshops producing works under the guidance and name of an author. In this paper, I begin with a discussion of the size of the tradition and its growth over time, including the increase from the eleventh century onwards of both the number of highly prolific authors and very large works. I then turn to data showing the largest 1,000 instances of text reuse in our corpus. From this list, I discuss three authors and their works: al-Tabari, Ibn 'Asakir (d. 1176), and Ibn al-Jawzi (d. 1200).
  • Dr. Maxim Romanov
    With about 50 titles, al-Dhahabi (d. 1347) is one of the most prolific Muslim authors. Not only was he prolific, his books are also among the longest in the treasury of Arabic written tradition, particularly his 50-volume “History of Islam” (Ta'rikh al-islam). This monster of a book is understood to be a compilation of earlier sources and our computational analysis of text reuse---identifying shared passages among texts---shows that from 20 to 40% of the volume of this book consists of quotations. The texts reuse detection method, however, allows one to identify quotations only through the direct comparison with the actual source of quotations. Stylometric approach offers a perspective that helps us to surpass this limitation. Closely associated with authorship attribution, stylometric analysis---particularly, rolling stylometry---allows one to identify text reuse through shifts and changes in the writing style, “the authorial fingerprint”, within the same book. And the application of this method does show that “al-Dhahabi’s” style in the early volumes is *completely* different from the style in the latest ones. Additionally, with our corpus of 4,300 Arabic texts, one can design a large-scale experiment to identify all possible sources of al-Dhahabi’s book through similarities in writing style, rather than through direct quotations. The presentation will begin with a brief explanation of the stylometric approach and will offer two experiments. The first experiment will focus on finding al-Dhahabi in al-Dhahabi’s writings through multifaceted comparison of all his available writings with each other. The second one---on possible sources of his “History of Islam” identified through the large-scale comparison and how the results of stylometric analysis compare with the results of text reuse detection method.
  • Dr. Matthew Thomas Miller
    In some of the earliest manuscripts of Sana’i’s (d. 1131) divan and ‘Attar’s (d. 1221) Mokhtar-Nameh, the term qalandariyat is applied to a group of poems that have a shared concern with a variety of different antinomian and transgressive figures, settings, and motifs. The central figure of the poetic world of the qalandariyat is the libertine “rogue” or “rascal” (qalandar/qallash/oubash/rend) and its poetic axis is the winehouse (kharabat/mey-khaneh)—a heterotopic space in which the poet, adopting the persona of “poet as rogue,” exhorts the readers to reject the pretenses of superficial Islamic piety in favor of a “true infidelity” (kufr-e haq?q?). Several prominent scholars of Persian poetry, however, have questioned whether or not the qalandariyat should be regarded as a generic category. Pace proponents of this view, I will argue in this presentation that the new computational mode of textual analysis called topic modeling corroborates the early manuscript evidence indicating that the monothematic forms of qalandariyat constituted a flexible thematic genre in early Persian poetry. At the methodological level, this study also will demonstrate the utility of topic modeling for the study of medieval Persian poetry and illustrate the various ways in which its results can be leveraged for macro-level computational textual analysis of Persian and Arabic poetry.
  • Mr. Elijah Cooke
    While there are a number of existing collections of Persian and Arabic digital texts online, these collections each have certain limitations. The available digital Persian collection, for example, requires more prose chronicles and philosophical treatises. The collection of digital Arabic texts would more fully represent the Arabic literary tradition if there were more scientific texts and texts written by representatives of smaller Arabic-speaking religious communities. The most efficient way to address these lacunae and develop the Persian and Arabic digital corpora is to develop a robust Optical Character Recognition (OCR) solution for Arabic-script languages. The existing OCR solutions suffer from a variety of critical problems—foremost amongst which are that they are not open source and their accuracy rates are notoriously low (often not even achieving 70% accuracy on medieval texts). In this presentation, I will present a new Arabic-script OCR solution that my team has developed which has achieved accuracy rates over 97% on Arabic-script languages (e.g., Persian and Arabic) and Syriac. This new OCR solution uses a neural networking approach that far outperforms earlier segmentation-based Arabic-script OCR models and can be trained to recognize new typefaces on as few as one thousand lines of training data. I will conclude with a presentation of our new OCR pipeline, which automate the entire process of OCR process—from file submission to post-correction—and thus allow novice users to produce their own digital texts from printed works.