Let’s Get Technical — Text and Data Mining Support at the University of Chicago Library By Jessica Harris (Electronic Resource Management Librarian, University of Chicago Library) <jah1@uchicago.edu> and Kristin E. Martin (Director of Technical Services, University of Chicago Library) <kmarti@uchicago.edu> Column Editors: Kyle Banerjee (Sr. Implementation Consultant, FOLIO Services) <kbanerjee@ebsco.com> www.ebsco.com www.folio.org and Susan J. Martin (Chair, Collection Development and Management, Associate Professor, Middle Tennessee State University) <Susan.Martin@mtsu.edu>
Abstract The University of Chicago Library has long supported research projects using non-consumptive text and data mining (TDM) methods, including acquiring datasets, licensing platforms, and directing users to APIs. As the number and breadth of research projects utilizing TDM has grown, the Library has adapted to support this growing area of research. The article will cover the evolution of the process to support TDM within the Library, including the licensing work, types of resources available, and expertise needed to have a successful TDM program.
Introduction The University of Chicago is a mid-sized doctoral university classified as Very High Research Activity. As of fall 2021, graduate students outnumber undergraduates, with FTE numbers of 10,279 and 7,618, respectively. Researchers have long had an interest in being able to explore corpora of text and data, both for exploratory purposes and for specific research questions. To help understand the types of research the Library needs to support, we’ll begin with a definition of Text and Data Mining (TDM). A good overview, including a comprehensive definition can be found at the Carnegie Mellon University Libraries website: “The extraction of natural language works (books or articles, for example) or numeric data (i.e., files or reports) and use of software that read and digest digital information to identify relationships and patterns far more quickly than a human can.”1 The use of automated tools to process large volumes of digital content can identify and select relevant information and discover patterns or connections. In general parlance, text mining extracts information from natural language (textual) sources, while data mining extracts information from structured databases of facts. TDM incorporates both types of sources. TDM is also sometimes called non-consumptive research (e.g., not a “close reading” of the content). Recent legal rulings have supported this type of computational, non-consumptive research as acceptable under U.S. Copyright law, allowing corpora of text still under copyright, such as the full set of content available within HathiTrust, to be available for text mining.2
Researchers frequently approached projects with significant technical expertise and might even develop their own technical infrastructure to host and analyze the data directly, such as the ARTFL project of French-language texts.4 Researchers approached the Library with specific desires to obtain data sets digitized by commercial vendors. The Library then negotiated special access to obtain the data sets, which were often delivered via mail on hard drives. Large scale research projects, such as the Knowledge Lab,5 focused on collecting comprehensive data sets of full text from large scholarly publishers and metadata from academic research databases. While some providers were accommodating in supplying desired datasets, as “While some the TDM market has grown, many providers were viewed it as an opportunity to monetize a new research method, accommodating and pricing could be unaffordable. in supplying Other providers had concerns about desired datasets, sharing full data, viewing it as a loss of intellectual property, or if as the TDM aggregating content from other market has rights-holders, did not have rights grown, many to be able to supply data.
viewed it as
The Situation
As the idea and use of nonan opportunity consumptive research grew, new types of users began approaching to monetize a the Library for services, including new research many graduate students and some method.” undergraduates as well. The Electronic Resources Management (ERM) Librarian would partner with a subject specialist to help identify the research question and types of resources needed, and then work with content providers to attempt to obtain the data. Content providers often lacked timely methods for supplying the data, leaving disappointed students unable to wait for months for requests to go through a licensing review. Others, more technically savvy, might attempt to script and download large quantities of articles from databases and licensed websites, violating the University’s license agreements and potentially interrupting access for the entire campus community. Clearly, things needed to change.
Initial TDM research at the University of Chicago focused on collecting data sets or compiling corpora of text. For example, the Library is a member of the Linguistic Data Consortium (LDC),3 and collects linguistic data sets, first by CD-ROM, and later through downloadable websites. Early on, vendors often restricted licensing of TDM content to specific research projects; files could not be shared with the full University community.
New mechanisms for obtaining data for TDM began to gain more traction. First, some content providers were willing to supply the full text XML and files of large, purchased archives, particularly for public domain materials, and license them for the entire university community. Other providers began offering APIs for TDM purposes on their website. This allowed researchers to extract the data they wanted and the publishers
Against the Grain / April 2022
<https://www.charleston-hub.com/media/atg/>
57