KITAB by Blazon Publishing and Media Ltd

Uncovering connections between Arabic texts The Arabic textual tradition is large and complex, and historically authors frequently made use of earlier texts in their own works. The KITAB project is harnessing the power of technology to detect examples of text reuse across a large corpus of material, helping to build a deeper picture of the relationship between different authors and their books, as Professor Sarah Bowen Savant explains. The Arabic written tradition is enormously inter-textual, with authors from different periods often using ideas and excerpts from other scholars in their own works. Researchers have historically identified relations between texts by essentially lining them up and reading them side-by-side, without the use of technology, as Professor Sarah Savant explains. “Previously you couldn’t do this in a modern, digital way,” she says. Based at the Aga Khan University in London, Professor Savant is the Principal Investigator of the KITAB project, an ERCfunded initiative which is developing tools to help researchers identify where authors have re-used material from other works. “The term that’s used is text reuse. It signifies repetition, in whole or in part, of one text in another, with or without citation,” she outlines. “It can include paraphrasing, while it can also be very literal. The ways in which it occurs affect how we are able to find it, and also the ways in which we are able to measure it.” An algorithm called passim is central to this, enabling researchers to identify cases where authors borrowed language and ideas from other works. Created by David Smith, a computer scientist at Northeastern University, it operates in ways broadly similar to the anti-plagiarism software that is used in many universities across the world, as it looks for common sequences. “The original idea with KITAB was that we can use software to find similarities in the machinereadable Arabic files widely available online today,” says Professor Savant. Adapting passim for Arabic and the large written tradition has had its challenges. “We have also had to work to tailor passim to show us meaningful patterns of text re-use,” acknowledges Professor Savant.

KITAB Knowledge, Information Technology, and the Arabic Book / Studying the formation and development of the written Arabic tradition with digital methods Project Objectives

Each of the dots on this graph represents a book within the OpenITI corpus. They are plotted according to the century in which the author died (x-axis) and what percentage of the work consists of transmissive chains. The red line represents the middle two quartiles for all books (blue = histories only).

distantly related, where authors are writing about the same topics and using common language, but it’s very difficult to say who borrowed from whom.” There may also be variations between versions of ostensibly the same works, another area of interest in the project, with researchers comparing differences mathematically. The corpus itself is being regularly updated, and is being published regularly for the wider research community. “Other researchers can also use the corpus, for example researchers

There are lots of texts pertaining to the religious history of Islam, on theology and lots of commentaries on the Quran. An enormous quantity of the texts that we have in the corpus pertain to the Prophet.

Manuscript KÖprülü 01589 from the Süleymaniye Library in Istanbul. Members of the KITAB project are transcribing folios such as these ones.

KITAB project A large corpus of material has been assembled in the project, dating from around the 8th century right up to the 20th century. Altogether, the corpus comes to approximately 1.5 billion words. By harnessing the power of technology, researchers aim to build a deeper picture of the relationship between different

texts. “We’re building a large corpus, while we’re also looking to reveal connections between these texts,” says Professor Savant. The texts themselves address a wide variety of different topics. “There are lots of texts pertaining to the religious history of Islam, on theology and lots of commentaries on the Quran. An enormous quantity of the texts that we have in the corpus pertain to the Prophet,” continues Professor Savant. “What there’s not a lot of are works on groups that were not part of the main narratives of early Islamic history. These groups were not part of the major scholarly apparatus of Sunni or Shi’i Islam.” There is a degree of survival bias in the corpus, as, for example, it’s more difficult to obtain texts on the less prominent groups in early Islamic history. It’s important in the project to build a broadly representative corpus, so Professor Savant and her colleagues are working on ways to add texts that treat all branches of knowledge in Arabic. “Our aim is to look at as many texts as possible,” she says. A large number of cases where there is a relationship between two texts have already been identified, and now the research team is focusing its

The KITAB project is developing analytical tools to study the relationships between texts. This PowerBi dashboard shows the many books in the project’s corpus that align to a major anthology of the 14th century entitled The Ultimate Ambition in the Arts of Erudition by Shihāb al-Dīn al-Nuwayrī (d. 732/1332). A user can click on a dot in the top part of the visualization and an alignment between al-Nuwāyrī’s book and an earlier book will appear in another panel.

EU Research

attention on the most interesting examples. One particular interest is isnads. “In Arabic books we commonly find isnads, which are chains of transmission. These isnads are also found in histories and works of geography; they’re all over the Arabic textual tradition,” explains Professor Savant. “The Islamic textual tradition involves a lot of citations and as an author, who you cite matters.” This does not mean that authors invariably cited all of the different scholars that influenced their own work, however. For example, Professor Savant is studying a historian from the 10th century, who was generally known for making numerous citations, yet in a specific case he did not credit a major influence. “He didn’t cite a scholar who lived about a generation earlier than him at all, even though he took a substantial portion of his work on the reign of one of the caliphs from him,” she says. This kind of issue is fairly common in book history. “On the other hand, sometimes the citations are so numerous and complex that we can’t figure out what they actually mean,” outlines Professor Savant. “Citation is a part of the work, while a lot of texts are also more

www.euresearcher.com

in computational linguistics,” says Professor Savant. The methods that have been developed in the project could also be applied to analyse texts in other languages, a possibility that Professor Savant is keen to explore in the future. “These methods, including the ways in which we built the corpus, the OCR and the text reuse detection methods, are re-usable for other languages,” she continues. “One goal is to do the same kinds of work with other languages used by Muslims, such as Persian and Urdu for example. We’ve also been asked if we could do similar work with French, and the answer is absolutely yes.” The project’s work with Arabic is still ongoing, and Professor Savant has been sharing their results at conferences and events within the wider research community. A number of articles are being written as part of the project, and a virtual reading environment is being developed, which will provide a forum for discussion and debate. “We will give users the ability to see our texts and text reuse data, and will use technology to showcase the results of team members. We’ll have research articles written by team members, and we’ll display them on our website,” says Professor Savant.

The KITAB project uses the latest digital technology to investigate the history of the Arabic textual tradition, one of the most prolific in human history. Its chief interest lies in how authors assembled works out of earlier ones. The circulation of texts will show how cultural memory was negotiated and shaped in the medieval Islamic world (ca. 700-1500).

Project Funding

KITAB is an initiative funded by the European Research Council (ERC). Grant agreement ID: 772989. The project also has funding from the Qatar National Library and the Andrew W. Mellon Foundation.

Project Partners

• Qatar National Library • Northeastern University • University of Maryland • University of Vienna • Leipzig University

Contact Details

Professor Sarah Bowen Savant KITAB Aga Khan University (International) in the United Kingdom, Aga Khan Centre 10 Handyside Street N1C 4DN London United Kingdom T: +44 207 380 3843 E: sarah.savant@aku.edu W: http://kitab-project.org Professor Sarah Bowen Savant

Professor Sarah Savant is a cultural historian specialising in Iran and the Middle East, based at the Aga Khan University. She is the author of The New Muslims of Post-Conquest Iran, Tradition, Memory and Conversion, which won the Saidi-Sirjani Book Award, given by the International Society for Iranian Studies, on behalf of the Persian Heritage Foundation.