Abstract
When studying a text, one is often interested to discover the sources used to write it. Principally, there are two sources of information one can use to track down a source: citations, in which the author directly points to a source, and text reuse without such acknowledgement, which require the reader to know the source well enough to recognize it. While a manual approach to source retrieval works well for small-scale settings where the reader knows both the text and its possible sources well, the problem becomes much more complicated and intractable as the number of texts of interest and possible sources increases, as one seen in a corpus the size of the 2 billion word corpus under discussion in these papers.
In this paper we will explore the possibility of applying recent developments from natural language processing to learn to replicate human judgments of source attribution. Modern large language models are a mathematical tool used to encode the meaning of texts as points in high-dimensional space. Texts with similar meanings have similar representations. Some models, like BART, can be used to determine how useful a piece of text (here we call it the source) is for generating a target text of interest. In our experiments, we use the BART model as a signal for ranking possible sources by measuring how useful possible sources are for reconstructing the text that employs them. As a case study, we will work with annotated data from al-Maqrizi’s (d. 845/1442) Mawa’iz (perhaps the most extensive premodern topographical history of Egypt) and several sources he used in writing it. The Mawa’iz is an excellent source for this task as the author uses a wide variety of sources, citing some of them but not all. A study of the text in this manner promises to shed light on al-Maqrizi’s citation practices and contemporary attitudes towards citation and reuse. We compare several forms of model, ranging from basic n-gram retrieval models to reranking with language models and using a generative model to guide a retrieval model based on dense embeddings.
Discipline
Geographic Area
Sub Area
None