8 minute read

Let’s Get Technical — Subject Heading Prediction

By Tris Shores (Developer, PredictiveBIB project) <trishores@gmail.com>

and Alisha Taylor (Monograph & Media Cataloging Coordinator, University of Illinois at Urbana-Champaign, and former cataloger at Ingram Books) <alisha@illinois.edu>

Column Editors: Kyle Banerjee (Sr. Implementation Consultant, FOLIO Services) <kbanerjee@ebsco.com> www.ebsco.com www.folio.org

and Susan J. Martin (Chair, Collection Development and Management, Associate Professor, Middle Tennessee State University) <Susan.Martin@mtsu.edu>

introduction

One of the most labor-intensive aspects of original cataloging is Library of Congress subject heading (LCSH) assignment as it requires familiarity with subject area terminology, interpretation of subject authority records, and construction of compound subject headings with subdivisions. Since LCSH vocabulary numbers are in the hundreds of thousands and can be quite esoteric/specific/ dated, catalogers may find it easier to categorize a book’s content using terms (aka descriptors) drawn from a simpler and more contemporary controlled-vocabulary relevant to a collection’s subject areas. Descriptors (from a custom vocabulary) and LCSH for two books are shown below:

Assuming it’s faster for catalogers to come up with descriptors than LCSH terms, this article describes a technique for automated prediction of LCSH based on a cataloger’s selection of book descriptors. At first glance, the extra step of assigning descriptors appears to slow down the cataloging process, but in reality the extra step is akin to taking the time to chop up a potato before eating it. This technique is especially relevant to organizations that create original bibliographic records in a production environment where time-savings and reduced complexity are important considerations.

Predicting LCsh using a Descriptor-LCsh Map

At the core of this technique is a descriptor-LCSH map (Map), which associates book descriptors with LCSH. Not only are descriptors drawn from a controlled vocabulary, they should individually have a one-to-many relationship with LCSH (in other words, be a more generalized vocabulary).

The following Map fragment, portrayed as a concept map, shows the descriptors ‘Combat’, ‘Paranormal’, and ‘Outer space’ individually mapped to various LCSH:

A Boolean AND search for LCSH associated with the descriptors ‘Combat’ and ‘Paranormal’ returns: ‘Women superheroes’ and ‘Yoda (Fictitious character from Lucas)’.

Another Map fragment, portrayed as a Venn diagram, shows the descriptors, ‘People’, ‘Young’, ‘Animals’, ‘Relationship’, ‘Paranormal’, and ‘Religion’ individually mapped to various LCSH:A Boolean AND search for LCSH associated with the descriptors, ‘People’, ‘Animals’, and ‘Relationship’, returns ‘Human-animal relationships’ and ‘Aviculture’. A mature Map is likely to contain hundreds of descriptors mapped to thousands of LCSH, but can be rapidly queried by a computer to extract the associated LCSH for a given set of descriptors (using a Boolean AND search). One implementation strategy is to incorporate the Map in a cloud API service. Catalogers would utilize the cloud service by making an API request call sending the descriptors for a book and an LCSH type (main topic, subtopic, geographic, or chronological), and in return receive a list of auto-suggested LCSH (of the requested type) ranked by LCSH usage popularity. Optimally, cataloging software will automate the cloud API calls on behalf of catalogers.

Just as important, catalogers may enrich the Map by making an API feedback call on completion of each book cataloged to report the book’s assigned descriptors and LCSH (complex subdivisions must be separated and tagged by LCSH type). Consequently, as books are cataloged the Map grows organically using crowdsourced cataloger-generated metadata. Crowdsourcing allows the map to expand more quickly, but requires contributing catalogers to use a common descriptor vocabulary. Different collection types (e.g., medical, legal, public library) would have different descriptor vocabularies and the Maps would be very different. Some institutions or consortiums may prefer to adopt a private descriptor vocabulary and a private Map (internally crowdsourced) to retain control over those components.

Automated Construction of the Descriptor-LCsh Map

The following steps outline how automated processing of cataloger-assigned descriptors and LCSH terms can extend the Map and enhance LCSH prediction. On receipt of a cataloged book’s descriptors and LCSH terms, a computer will: 1. Increase the set of LCSH terms to also include variant terms. 2. Identify all possible descriptor-LCSH pair combinations by pairing up each descriptor with each LCSH term. 3. Mark all descriptor-LCSH pairs as tentative since the validity of each pair is conditional on a semantic association between a descriptor and its paired LCSH term. 4. Use a word/phrase thesaurus to help evaluate the validity of each descriptor-LCSH pair. For a book with descriptors ‘Outer space’, ‘Combat’, and ‘Romance’, and the LCSH term ‘Space warfare’, semantic analysis would assess the pairs as follows: a. Descriptor ‘Outer space’ and LCSH term ‘Space warfare’: valid pair due to a common word with same contextual meaning. b. Descriptor ‘Combat’ and LCSH term ‘Space warfare’: valid pair since combat and warfare are synonyms. c. Descriptor ‘Romance’ and LCSH term ‘Space warfare’: invalid pair since they are unrelated semantically. 5. Use pattern matching to help evaluate the validity of each descriptor-LCSH pair by looking for consistent usage of the pair across a large dataset of bibliographic records (see step 7). Using the preceding example, pattern analysis might assess the pairs as follows: a. Descriptor ‘Outer space’ found in 98% of records that contain the LCSH term ‘Space warfare’: valid pair due to high consistency. b. Descriptor ‘Combat’ found in 95% of records that contain the LCSH term ‘Space warfare’: valid pair due to high consistency. c. Descriptor ‘Romance’ found in 42% of records that contain the LCSH term ‘Space warfare’: invalid pair due to low consistency. 6. Only validated descriptor-LCSH pairs are added to the Map. LCSH type tags (main topic, subtopic, geographic, or chronological) are preserved. 7. A usage counter for each validated (and invalidated) descriptor-LCSH pair is incremented to assess occurrence frequency and facilitate future consistency checks. 8. A usage counter for each LCSH term is incremented to track LCSH usage frequency (popularity). Since the Map auto-updates in real-time, newly added LCSH terms are immediately available as candidates for auto-suggestion. As a quality control measure, an updated map can be scoped to the cataloging institution that effected the change pending a manual review and approval for general release (analogous to a Git pull request). It is also prudent for participant institutions to restrict API feedback calls to experienced catalogers so that inexperienced catalogers are prevented from modifying the map but are still able to request LCSH and leverage the knowledge of more experienced catalogers.

self-Evolving Prediction Algorithm

This technique for auto-suggesting LCSH is a prediction algorithm, with the Map as its core component. Since the Map auto-updates following API feedback calls, the prediction algorithm is self-evolving. With periodic manual review of auto-validated Map pairs, the algorithm’s evolution is semi-supervised.

PredictiveBIB, an experimental desktop cataloging app that generates MARC & BIBFRAME records, connects to a cloud API service for LCSH prediction as described above. The app was used to catalog 373 public library books. The Map has 574 unique descriptors (drawn from a custom vocabulary), 1030 unique subject headings, and 4087 descriptor-LCSH pairs, indicating that each descriptor is associated with on average 7 LCSH terms. 14% of the LCSH terms were not assigned by catalogers, indicating that these are variant LCSH terms that the prediction algorithm injected and associated with at least one descriptor.

LCsh Prediction Examples

The auto-suggested LCSH terms (drawn from the Venn diagram Map fragment) for each descriptor combination are:

8 Library Services That Will... Make You Smile.

1) Publishing Sources - Almost 200,000 at our disposal 2) New Title Selection Plan - Immediate notifi cation

3) Electronic Ordering - Simple online ordering system 4) Early Release Program - Immediate availability guarantee 5) Cataloging - We do the busy-work for you 6) Comprehensive Reporting - Up-to-the-minute

order status

“ Things always go smoothly with Emery-Pratt. Their people are knowledgeable, and always provide friendly, rapid service.”

Cameron University

Lawton, OK

7) Duplicate Order Alert - We’re on guard,

you avoid hassles 8) Paperback Reinforcement & Binding

Avoid expensive wear & obsolescence Dependability. Reliability. Smileability.

For more details, visit: emery-pratt.com

1966 W M 21, Owosso, MI 48867-9317 Phone: 800 248-3887 • Fax: 800 523-6379 emery-pratt.com

Alternative Descriptors Uses

A static descriptor-LCGFT lookup table can be used to predict Library of Congress Genre/Form Terms. For example, the descriptor ‘Paranormal’ is related to the following three LCGFT:

Auto-suggestion of LCGFT based on a cataloger’s selection of descriptors will reduce the likelihood that inexperienced catalogers will omit applicable genre/form terms.

Descriptors, if added to a bibliographic record, lend themselves to a simple human-readable public-domain classification scheme and can also be leveraged by an ILS-based readers’ advisory service. Using a MARC example, descriptors that are hierarchically organized (e.g., ǂa for book type, ǂb for book categories, ǂc for book sub-categories, ǂd for mood terms, ǂe for tempo terms) might look like: 907 __ ǂa Novel ǂb Suspense ǂc War ǂc Spy ǂd Grim ǂe Fast-paced

Conclusion

Automated LCSH prediction empowers inexperienced catalogers to take on one of the most challenging areas of cataloging and accelerates cataloging. However, the quality of predicted LCSH is dependent upon accurate book assessment and descriptor selection by contributing catalogers, as well as automated & manual Map review processes. Subject heading prediction is independent of bibliographic record format, material-format, language, organization type, subject heading vocabulary, or collection subject area.

resources

Florida Institute for Human & Machine Cognition, Concept Maps. Lang, Ruth Emmie. Beasts of Extraordinary Circumstance. New York: St. Martin’s Press, 2017.

Pepys, Samuel. The Diary of Samuel Pepys. New York: Alfred A. Knopf, 2018.

PredictiveBIB website, https://predictivebib.org. Accessed 26 June 2021.

This article is from: