15 minute read
The Call of the Wild: New Roles for Librarians and Publishers
By Stephen Rhind-Tutt (Coherent Digital, LLC, Alexandria, VA) <rhindtutt@coherentdigital.net>
Overview
The first website was created on August 6th, 1991.1 Since then, there’s been unbroken growth in the amount and types of web content, expanding the initial plain text to images, audio, video, social media posts, and ever more exotic types. Much of this new content is “wild” from a librarian or publisher standpoint: it lacks DOIs, ISBNs, ORCIDs, MARCs, PIDs, and other tools that are standard for journals and books. In many cases, websites and items on them are undated, have no identifiable author, and lack titles.
In this issue, we bring this wild content into focus. We define it, quantify its importance, identify challenges associated with it, and make suggestions as to how librarians and publishers might respond to this “call of the wild.” The need is urgent; critical information is hard or impossible to find and, in many cases, disappearing.
How Much Wild Content is There and How Much is it Used?
The amount of wild content in the world is hard to quantify but clearly astronomical. In 2016, Google identified over 130 trillion web pages,2 and it’s grown substantially since then. Today, the number of pages available in Google’s index averages 50 billion,3 suggesting that a fraction of 1% of sites they’ve identified are indexed. Furthermore, this is only the surface web. The deep web includes sites that require registrations, so they’re invisible to search engines. The deep web is estimated to be at least twenty-five times larger than the surface web, implying the existence of quadrillions of web pages.
Wild content on the surface web is heavily used — 62% of the world’s population uses it, averaging more than six hours a day.4 Ninety-five percent of American teenagers use YouTube.5 More than 50% of Internet users aged 16 to 24 use it weekly for learning.6
How does this compare to “tame content” in the library and publishing world? WorldCat has 540 million bibliographic items. 7 The Library of Congress has 173 million items, including 4.2 million recordings and 1.9 million moving-image items. 8 More than 3.2 million of these items are available online. One of the largest indexes of journal articles and books, the Bielefeld Academic Search Engine (BASE), has 315 million items.9 ResearchGate claims 135 million publication papers. 10 Even the highest of these estimates represents less than 1% of what’s out there. The rest is wild and untamed — uncataloged, undiscoverable, uncitable, prone to link rot, and likely to disappear.
The number of users and the amount of content will continue to grow. A third of humans have yet to become Internet users, and the population of the world is growing by almost a percent per year.13 New formats and types of media will be invented, and existing media will grow faster as technology makes their creation easier and faster to create. The cost of storage will decline, scanners will become cheaper and easier to use, and outputs of all kinds will grow.
Skeptics say that most of this content has little or no relevance to the academy, learning, or scholarship. My colleagues and I disagree and will make our case by focusing on four types of wild content. We’ll show how wild content is frequently used for research and learning, and we’ll explain why the traditional skills of publishers and librarians must urgently be applied to the tasks of improving the organization, discovery, curation, and preservation of this material.
We focus on grey literature, video, open educational resources, and primary sources.
Grey Literature
In this issue, Toby Green documents the world of grey literature. Historically, this material has been interesting but marginal. As Toby shows, that’s changed. He identifies over 9,000 reputable organizations publishing “some $33 billion of research a year,14 which is larger than the $25 billion STM publishing sector.”15 The reports, briefings, guides, conference papers, news reports, and other non-book and non-journal material are all of direct and vital importance, as indicated in this graphic from 2015.
The same is true of usage. Similarweb reported that ProQuest had 17.1 million monthly users in September 2022; JSTOR had 26.8 million; ScienceDirect, 100 million; and ResearchGate, 128 million.11 These numbers are dwarfed by visits to wild content: YouTube alone claims 13 billion monthly visits.12
As much as 88% of this content has been through a form of peer review17 and is recognized as being “high informational content.” 18 Often the material is more current and more inclusive than what’s published in journals and books.19 Toby discusses efforts to tame this material by curating it, enriching it with metadata, and preserving it.
Open Educational Resources
Our second author is Andrea Eastman-Mullins, who documents another type of wild content: Open Educational Resources (OER). She points out that “even before COVID, 78% of all U.S. academic libraries supported course materials,20 and nearly half listed OER as a top priority.”21
Just as traditional journal formats are no longer sufficient to accommodate the needs of researchers, traditional textbooks serve students and teaching faculty less effectively than these new resources. Studies have shown this to be true across a wide range of courses, including medicine,22 anatomy,23 and physics.24 YouTube videos are praised as being “of high quality, reliability, and rich content.”25
Studies also point out the challenge of finding quality.26 In a study of patient advice videos, only 10% were found to be accurate.27 Pew found that 63% of YouTube users found material that seemed to be “obviously false or untrue.”28 There’s high quality material in the wild, but it demands the traditional librarianship skills of evaluation, selection, curation, and preservation if it’s to be useful.
The savings from taming this content and making it useful are substantial. SPARC estimates that adopting OER has already saved students an average of $160 per course, or over $1 billion worldwide so far.29
Video and Related Media
Jessica Lawrence-Hurt examines the increasing use of video in the academy in more depth and discusses continued barriers to its adoption. An idea of the potential impact of video can be found in a study by Mohamed Ahmed Mady and Said Baadel, which showed that YouTube and social networks in general significantly improved learning outcomes across a wide range of disciplines, with more than 95% of education majors using YouTube at least once a week, and 82% reporting a positive experience.30
Jessica points out how certain disciplines require video in order to explain concepts or demonstrate practices. At the same time, existing library and publishing systems often deprecate such work — for example, when a peer review submission system cannot accommodate video or alternative media. She presents yet another case for why librarianship and publishing are needed to make materials in this space useful.
Primary Sources
The fourth type of wild content is discussed by Genevieve Croteau, who documents the Saving Ukrainian Cultural Heritage Online (SUCHO) project and how librarians mobilized in a matter of days to rescue an extraordinary number of websites. This story serves to remind us that enormous challenges can be overcome by innovative thinking and collaboration.
Wikipedia defines a primary source as “an artifact, document, diary, manuscript, autobiography, recording, or any other source of information that was created at the time under study. It serves as an original source of information about the topic. In journalism, a primary source can be a person with direct knowledge of a situation, or a document written by such a person.”31 Primary sources don’t need peer review; they need only be verified for authenticity.
This category includes speeches, news reports, political pamphlets, social media posts, and born-digital content. It’s been growing exponentially and promises to grow more. There are multiple instances of large-scale loss of archives in this space. In 2009, for example, Yahoo shut down Geocities, resulting in the loss of some seven million personal websites, and in 2019 they closed and removed materials from the Yahoo Groups platform.32 In 2022, the EPA announced the closure of its web archive.33
Personal websites are particularly at risk. Consider Threedecks.org , 34 an extraordinary collection of materials in maritime history, the Manuals Directory, 35 which has over 700,000 instructional manuals through time, or the William Blake Archive. 36 These are just a few of countless high-value, at-risk websites.
An analysis of a LibGuide from a major academic institution, with links to hundreds of carefully selected primary sources relevant to research and learning, revealed that fewer than 50% of the URLs still worked a year later.37 The remaining sites had disappeared, and many had not been captured by the Internet Archive or other initiatives. Gary Price and Curtis Michelson examine the challenge of link rot in depth.
This problem is particularly acute in the Global South, where sites have lower funding. And the problem is endemic; one study identified that 69% of librarians report broken links in their collections and they spend 1.3 hours a week fixing38 them. Sites as important as the African National Congress archive have largely disappeared — the outline of the site39 has been preserved by the Internet Archive, but the in-depth content is missing, much like an ancient ruin with only the pattern of the walls still visible.
Responding to the Challenges
Finally, Gary Price and Curtis Michelson provide more detail on how and where to find quality items in the wild. They provide an effective summary of how librarians can help make grey literature, OERs, video, and primary sources useful for teaching and research. Best of all, they provide tools that can be used immediately by librarians to improve existing efforts in this space.
Summary
The average academic library spends 59% of its materials budget directly on books and journals, along with another 35% on databases, most of which are aggregations of and indices to books and journals. Only 6% of the budget goes to “other resources.”40 This spend is hugely disproportionate to the size, usage, and importance of non-book/nonjournal content.
The need for librarians and publishers to help tame wild content is real. More than half of the citations in this article are from “wild” sources and have no DOIs. If you’re reading this issue, I suspect you regularly consult reports and recommendations from ARL, CNI, Ithaka, CLIR, and others. You use Ebsco’s price survey, and you subscribe to the ATG podcast or ARL’s Day in Review. We rely on these publications, but will we be able to find them tomorrow? Many appear on the web with no DOIs, no release dates, no author names… Will future researchers even know these items once existed? We need to make these items and other examples of important wild content part of the scholarly record — and the call is urgent.
Richard Ovenden, librarian at one of the oldest libraries in the world, reminds us of our collective responsibility to future generations: “Digital preservation is becoming one of the biggest problems we face. If we do not act now our successors in future generations will rue our inaction.”41
The Gutenberg Bible was the first substantial book printed from movable type in the West. Some 50 copies of this landmark work survive. 42 The web is similarly a vital step in human knowledge, but no one thought to capture a copy of the very first webpage. It wasn’t copied until the following year,43 so we can only assume what it looked like. Just as our ancestors developed systems to organize, preserve, and disseminate print materials, we must do the same for wild content.
Endnotes
1. Business Insider. This is what the first website ever looked like, Alyson Shontell Jun 29, 2011. https://www.businessinsider. com/flashback-this-is-what-the-first-website-ever-lookedlike-2011-6
2. The 2008 estimate was retrieved from the Google Blog. Posted by Jesse Alpert & Nissan Hajaj, Software Engineers, Web Search Infrastructure Team, July 25th, 2008. At this URL https:// googleblog.blogspot.com/2008/07/we-knew-web-was-big. html. The 2016 estimate is from https://www.kevin-indig.com/ Growth Memo Newsletter, posted July 29th, 2020 at https://www. kevin-indig.com/googles-index-is-smaller-than-we-think-andmight-not-grow-at-all/
3. https://www.worldwidewebsize.com/. This work was carried out as a Master thesis project at the Faculty of Arts of Tilburg University), within the ILK Research Group. It shows the actual size of the Google index fluctuating between 20 Bn and 50 Bn webpages. More information about the method can be found here (written in Dutch).
4. The Global State of Digital 2022, Hootsuite, https://www. hootsuite.com/resources/digital-trends, page 9.
5. Teens, Social Media and Technology, 2022. Report by Pew Research Center, August 10, 2022. https://www. pewresearch.org/internet/2022/08/10/teens-social-media-andtechnology-2022/
6. The Global State of Digital 2022, Hootsuite, https://www. hootsuite.com/resources/digital-trends, page 56.
7. Retrieved from Wikipedia, November 23rd, 2022. https:// en.wikipedia.org/wiki/WorldCat
8. https://www.loc.gov/about/general-information/ , Retrieved November 23rd, 2021
9. Searches conducted on November 23rd, 2021 at https://www. base-search.net/
10. Retrieved from Researchgate on November 23rd, 2022, at https:// www.researchgate.net/
11. Statistics from Similarweb.com as of 11/13/22.
12. Most popular websites worldwide as of November 2021, by total visits (in billions), Statista. Retrieved Nov 13th, 2022 from https://www.statista.com/statistics/1201880/most-visitedwebsites-worldwide/
13. The Global State of Digital 2022, Hootsuite, https://www. hootsuite.com/resources/digital-trends, page 20.
14. Lawrence, A. Influence seekers: The production of grey literature for policy and practice. Information Services & Use, vol. 2017: 37, no. 4, pp. 389-403. https://doi.org/10.3233/ISU170857
15. Johnson, R., Watkinson, A. & Mabe, M. The STM Report, 5th edition: An overview of scientific and scholarly publishing, STM. Netherlands. 2018 Retrieved from https://policycommons.net/ artifacts/1575771/2018_10_04_stm_report_2018/2265545/ on 30 Sep 2022.
16. Lawrence, A. Influence seekers: The production of grey literature for policy and practice. Information Services & Use, vol. 2017: 37, no. 4, pp. 389-403. https://doi.org/10.3233/ISU170857
17. Amanda Lawrence, Julian Thomas, John Houghton & Paul Weldon (2015) Collecting the Evidence: Improving Access to Grey Literature and Data for Public Policy and Practice, Australian Academic & Research Libraries, 46:4, 229-249, DOI: https:// doi.org/10.1080/00048623.2015.1081712
18. Managing grey literature: technical services perspectives. Edited by Michelle Leonard and Susan E. Thomas in collaboration with Core Publishing, Chicago, ALA Editions, 2022, 136pp., ISBN 978-0-8389-4881-1 (soft cover), 978-0-8389-3821-8 (PDF).
19. As of 11/23/22 Policy Commons had materials from over 162 countries including over 850 African organizations. https:// policycommons.net/search/?i=organizations®ion=Africa
20. Libraries Play a Key Role in Campus OER Adoption, Library Journal Website, Ex Libris, Feb 3rd, 2020. https://www. libraryjournal.com/story/libraries-play-a-key-role-in-campusoer-adoption
21. Ithaka S+R US Library Survey 2019, April 2, 2020, Jennifer K. Frederick, Christine Wolff-Eisenberg DOI: https://doi. org/10.18665/sr.312977 https://sr.ithaka.org/publications/ ithaka-sr-us-library-survey-2019
22. 2020-05-22. “Learning ENT” by YouTube videos: perceptions of third professional MBBS students. Ajeet Kumar Khilnani, Rekha Thaddanee, Gurudas Khilnani. https://www.ijorl.com/index. php/ijorl/article/view/2213 DOI: https://dx.doi.org/10.18203/ issn.2454-5929.ijohns20202211
23. Ayman G. Mustafa, Nour R. Taha, Othman A. Alshboul, Mohammad Alsalem, Mohammed I. Malki, “Using YouTube to Learn Anatomy: Perspectives of Jordanian Medical Students”, BioMed Research International, vol. 2020, Article ID 6861416, 8 pages, 2020. https://doi.org/10.1155/2020/6861416
24. Integrating Physics in Disaster Risk Reduction (DRR) through YouTube Videos, Jhoanne Catindig, Maricar S. Prudente, Mark Joseph Orillo, IC4E 2020: Proceedings of the 2020 11th International Conference on E-Education, E-Business, E-Management, and E-Learning, January 2020 Pages 244–48 https://doi.org/10.1145/3377571.3377629
25. ONCOLOGY| VOLUME 145, P181-189, NOVEMBER 01, 2020, Can YouTube English Videos Be Recommended as an Accurate Source for Learning About Testicular Self-examination? Ismail Selvi, Numan Baydilli, Emre Can Akinsal. Published: August 10, 2020 DOI: https://doi.org/10.1016/j.urology.2020.06.082
26. Archives of Physical Medicine and Rehabilitation, VOLUME 101, ISSUE 12, P2087-2092, DECEMBER 01, 2020 Suitability of YouTube Videos for Learning Knee Stability Tests: A Crosssectional Review, Myungeun Yoo, MD Juntaek Hong, MD, Chan Woong Jang, MD , Published:June 25, 2020. DOI: https://doi. org/10.1016/j.apmr.2020.05.024
27. Youtube as a Source of Patients’ and Specialists’ Information on Hemorrhoids and Hemorrhoid Surgery Authors: Sturiale, Alessandro; Dowais, Raad; Porzio, Felipe C.; Brusciano, Luigi; Gallo, Gaetano; Morganti, Riccardo; Naldini, Gabriele. Source: Reviews on Recent Clinical Trials, Volume 15, Number 3, 2020, pp. 219-226(8). Publisher: Bentham Science Publishers. DOI: https://doi.org/10.2174/1574887115666200525001619
28. Many Turn to YouTube for Children’s Content, News, HowTo Lessons, Aaron Smith, Skye Toor, and Patrick Van Kessel, Pew Research Center, November 7th, 2018. https://tinyurl. com/4ctzc4dm
29. Hilton, J. Open educational resources, student efficacy, and user perceptions: a synthesis of research published between 2015 and 2018. Education Tech Research Dev 68, 853–876 (2020). https:// doi.org/10.1007/s11423-019-09700-4
30. Mady, Mohamed Ahmed, and Said Baadel. “Technology-Enabled Learning (TEL): YouTube as a ubiquitous learning aid.” Journal of Information & Knowledge Management 19.01 (2020): 2040007.
31. https://en.wikipedia.org/wiki/Primary_source
32. Yahoo Groups Is Winding Down and All Content Will Be Permanently Removed, Jordan Pearson, October 16th, 2019. Motherboard: Tech by Vice. https://www.vice.com/en/ article/8xwe9p/yahoo-groups-is-winding-down-and-allcontent-will-be-permanently-removed
33. EPA Archive Retiring July 2023, retrieved from EPA Web Archives on November 23rd, 2022 at https://archive.epa. gov/#:~:text=EPA%20Archive%20Retiring%20July%20 2023&text=Ongoing%20efforts%20to%20share%20a,is%20 performed%20every%20four%20years.
34. https://threedecks.org/
35. https://www.manualsdir.com/
36. https://blakearchive.org/
37. Unpublished research conducted by Coherent Digital staff, November 2022.
38. Lawrence, A. (2017) Influence seekers: The production of grey literature for policy and practice, Information Services & Use, 37 (4) 389-403
39. https://web.archive.org/web/20040318070055/http://www. ufh.ac.za/collections/
40. Library & Information Spend Predictions for 2021, Ingenta, Results of Telephone Survey Research Study undertaken by Ipsos MORI and PCG. Retrieved from https://www.ingenta.com/ wp-content/uploads/Library-Info-Spend-Study-2021-FinalPublic-v3-002.pdf
41. Ovenden, Richard. “Burning the books,” John Murray, 2021. 308pp. P 229.
42. The Morgan Library & Museum website, retrieved on November 23rd, 2022. https://www.themorgan.org/collection/GutenbergBible?gclid=CjwKCAiApvebBhAvEiwAe7mHSImxZ40XBaaWf z7Db4rtPG27zU7zMa6sm7RFySAqv0HoCkDAhb6f8xoCiaMQA vD_BwE
43. Business Insider. This is what the first website ever looked like, Alyson Shontell Jun 29, 2011. https://www.businessinsider. com/flashback-this-is-what-the-first-website-ever-lookedlike-2011-6