13 minute read
Grey Literature is Booming. It’s Time to Turn it into an Asset.
By Toby Green (Co-Founder, Coherent Digital, Paris, France) ORCID: 0000-0002-9601-9130 <Toby.green@coherentdigital.net>
Ahead of COP27, five major climate reports have been released, one with new data showing that temperatures in Europe are rising faster than anywhere else. They generated headlines around the world. Citing one of them, UN Secretary-General, António Guterrez, tweeted “we are headed for economy-destroying levels of global heating.” Each was vetted by experts before release, yet none was published in a peer-review journal. Each is free to download, yet none is available through publication supply channels or specialist discovery services. Each needs to be preserved in the scholarly record, but how?
These are not edge cases, they are but the tip of a growing mountain of vital research that’s being posted to websites around the world as grey literature. Whilst the scholarly communications community, rooted in academia, has been wrestling with the challenges of pivoting books and journals to open access, quite quietly a growing community of researchers, based in non-academic organizations, has been posting their results and discoveries on open websites, like the five mentioned above.
Grey literature is “wild,” it lacks the metadata of “tame” formally-published works, making it hard to find and catalog. It’s open, but it’s at risk of link rot and lacks preservation protocols. As this article will show, grey literature is growing in volume and significance. Taming it for the scholarly record will take effort, time, and resources. Let’s step back to understand why.
To qualify for inclusion in the scholarly record, Dougherty proposed that an item must advance or summarise knowledge, have an identifiable author, be issued through an academic publisher, be cataloged by a university library, appear in curated research databases, and belong to a recognised discipline.1 OCLC defines the scholarly record as “published outcomes of scholarly enquiry” such as “journal articles and monographs,”2 even though others recognise that formats have, of late, become much more diverse, encompassing “protocols, code and data.”3 Let’s put these definitions to the test.
In 2019, just a year after Dougherty made his proposition, two London-based professors published a podcast which summarised two papers authored by economists from the Bank of England and four UK “Russell Group” universities. The papers had made headlines in the UK media, including the Financial Times, and were cited in a blog — which has a following larger than most journals — run by a professor from University of London’s Royal Holloway. Looking for any of these items in subject databases and library catalogs will be in vain because, as with the five climate reports, no journal or monograph publishers were involved in publishing this impactful research and commentary.4
According to Dougherty and OCLC, this content doesn’t qualify for the scholarly record — which suggests to me that their definitions need updating to take into account how digital tools are changing the ways knowledge is published, as the examples above illustrate.
Dougherty’s definition made sense before the digital era because the cost of self-publishing and dissemination in print was beyond the means of most. Organizations and authors had little choice but to find a publisher for their works. Equally, the cost of organising and maintaining archives meant that only institutions could offer readers meaningful and useful libraries of published materials. So, it’s no surprise that publishers and libraries were central to the creation and maintenance of the analog scholarly record.
In fact, the scholarly record took a village to create. Besides authors creating, publishers selecting and librarians collecting, booksellers and agents developed an efficient, near-global, supply chain that carried publications to libraries around the world. To reduce administration costs and speed delivery, publishers, booksellers, agents, and librarians co-developed processes (e.g., ICEDIS) and metadata standards with unique identifiers (ISBNs in 1969, ISSNs in 1975). In parallel, secondary services and catalog systems emerged to ease the challenge of discoverability. A whole industry was born.
Since 2000, digitization has driven an end-to-end transformation of the industry. It enabled new features like persistent identifiers (PIDs) for content (DOIs), authors (ORCiDs), and their institutions (Ringgold). Standardized content capture opened the door to data and text mining. Dark archives have been established to guarantee preservation. Crucial to the efficient working of what is now a more technically complex scholarly record is detailed, standardized, machinereadable, metadata. This metadata not only drives discovery and simplifies cataloguing, it enables impact evaluation. A complex system was born.
Not all scholarly publications were captured in the analog scholarly record. Some institutions chose to self-publish because doing so had advantages, such as control over branding, timing, and pricing. Of these, some, like OECD and Brookings, established in-house “presses” that used the same metadata standards and supply chains as mainstream publishers to channel their publications to libraries. However, others, especially smaller organizations, didn’t. In eschewing publishing norms and supply chains, their content was hard to source and missing from secondary discovery services — frustrating for librarians and readers alike. It was this, informally published content, that gave rise to the term “grey literature.”
Grey Literature
In 1984, Wood coined the term to describe material “which is not available through normal bookselling channels ... leading to problems for the producers of secondary services, for librarians who wish to collect it and for end users.” Whilst noting that grey literature had “variable standards of editing and production, poor publicity, poor bibliographic control, and poor availability in libraries,” Wood rejected as “mistaken” the belief that grey literature was “essentially ephemeral and of local interest only” because “it contains information likely to be of use to a considerable number of people.”5
It is often thought that grey literature hasn’t been peerreviewed. This is a big misunderstanding because, as the climate examples illustrate, more than 60% is reviewed by experts prior to release.6 So, no wonder Wood reckoned grey literature “a costly public asset going largely to waste.” How costly? A recent estimate put it at $33BN a year,6 which is larger than the $25BN STM publishing sector.7
In 2010, the Prague definition8 attempted to build on Wood — but the additions, for me, add nothing to the essential and defining characteristic of grey literature: that it is hard to find, capture and use.
The essential problem identified by Wood in 1984 is unchanged today: poor bibliographic control. However, Wood would probably be shocked by the scale of the today’s “public asset going to waste” because there has been a significant increase in the supply of grey literature. Let’s look at these two issues — bibliographic control and supply — starting with supply.
Supply
More researchers lead to more content, 9 and since the 1980s, the number of researchers in OECD countries tripled as higher education expanded.10
Yet the number of jobs in academia barely changed. Using data from the UK as being representative for the OECD group of countries: in the 1980s, around 15% of freshly-minted PhDs could expect to work in academia. By the 2000s, this had fallen to around 3%. 11 So, if not into academia, where did this growing number of highly trained, research-capable people go? And did they undertake research when they got there?
Some went into industry and government, but the third sector (comprising NGOs and think tanks) was booming. Since the end of WWII, there has been strong growth in the number of new third sector organizations (see chart) and these organizations must have been hiring researchers because they have been busy releasing research content. Policy Commons, which indexes grey literature from 9,000 IGOs, NGOs, think tanks, and research centres drawn from around the world, shows 55% more grey literature was released in 2020 compared with 2010 (287,545 items and 184,514 respectively).
Unlike their cousins in academia, researchers in government, industry, and third sectors don’t have to publish in books and journals to further their careers. They are free to work with their employers to self-publish their research as reports, papers, and other digital-first formats — their careers will not perish otherwise. In the field of policy alone, I estimate that each year sees around 400,000 newly published items of grey literature — equivalent to ~10% of the world’s entire journal output. If it remains outside the scholarly record, this is a lot of knowledge going to waste.
The Incredible Complexity of Bibliographic Control
Today, desktop publishing, web 2.0 tools and websites make it easy for anyone to self-publish. As Clay Shirky, an early internet “guru” and professor at the Interactive Telecommunications Program at NYU said in a 2012 interview: “ Publishing is not evolving. Publishing is going away. Because the word “publishing” means a cadre of professionals who are taking on the incredible difficulty and complexity and expense of making something public. That’s not a job anymore. That’s a button. There’s a button that says “publish,” and when you press it, it’s done.”12
Shirky was half right. It is indeed easy to press a button and publish something online. The problem is that most people who press that button are not from that cadre of professionals who understand the incredible complexity of preparing content so that it’s discoverable and useful for its readers. They don’t know how to ensure it is included in supply chains that lead to specialist discovery services and a place in the scholarly record. Nor do they understand, any more than Shirky’s interviewer did, that it isn’t “done” until the work has been safely preserved for the long run. It’s beyond ironic that links to Shirky’s interview, published in the blog, Findings, returned a “404 — page not found” within months of its publication when the blog closed and went offline.
Like Findings’ publisher, most organizations that produce grey literature have no access to the cadre of professionals who understand the incredible complexity of the bibliographic control that’s required if the content is to be easily found, cited, and not suffer link rot.6 So, it’s hardly a surprise that 75% of links to grey literature cited in scholarly journals lead to broken links or the wrong content.13
What makes this problem worse is that the ease of using Shirky’s “button” has tempted some organizations to switch from working with professional publishers to self-publishing on their websites. One of the five climate reports mentioned above is an example of this. Published annually since 1998, and now posted for free on their website, the IEA’s World Energy Outlook abandoned ISBNs and ISSNs in 2021. The IPCC used to publish with CUP but transitioned to “grey” after 2014. What was formally published is now grey literature, what previously took its place in the scholarly record is now missing.
Now, you might imagine that grey literature can be quickly found via public search engines that scan open websites. The trouble with public search engines is that they deprecate content with poor metadata on low-traffic websites — so grey literature is usually crowded out by content from “optimized” websites run by digital marketers.14 Besides, public search engines tailor results to each user’s “bubble” of preferences, attitudes, and even location. Results can change from day-to-day as algorithms evolve.15, 16 This is why most scholars and students still turn to specialist search engines where, of course, grey literature is absent.17 It’s no wonder that researchers at Concordia undertaking a literature review recently used Twitter to appeal to “the crowd” for grey literature recommendations.18
Persuading thousands of grey literature-producing organizations to employ a cadre of professionals to take on the incredibly difficult and complex and expensive work of publishing their content to the standards needed for the scholarly record is unlikely to work. My conversations with major IGOs and NGOs tell me that they are, if anything, less likely to employ staff with publishing skills, preferring instead to employ communications and social media experts. As the examples of the IEA and IPCC illustrate, non-academic organizations are moving away from scholarly publishing norms not towards them. So, what’s to be done to stop this valuable content going to waste?
If they can’t be persuaded to tame their own content, we’ll have to tame it on their behalf — and this is what we’re doing with Policy Commons. Taking a leaf out of DOAJ’s book, we start by selecting organizations that produce trustworthy content. Once accepted and entered into Policy Commons’ organizational directory, we begin the “taming” process by harvesting content from their websites. The central challenge is to find elements on their websites and in the content itself to construct a standardized metadata record for each item harvested. We look for author names and publication dates. We look for a summary to serve as an abstract and, if we can’t find one, we “calculate” one. We look for tags and keywords and use AI to sort the content into pre-defined topics. Thumbnail cover images are generated and items that pass successfully through our filters is associated with its producing organization (rather as articles are associated with journals). The full text is poured into our search engine, which, now that it has ingested over 3.5 million items, is incredibly powerful — some users report their expectations are exceeded every time they look for content. Finally, we deposit a copy of every item in a dark archive, accessible should the link to the original version break.
The metadata we’re generating is “good enough” but still far from perfect. Sometimes file names appear as the item’s title. Author names and dates can be hard to find and extract. Summaries can be jumbled nonsense. We continue to refine our “taming” tools and welcome feedback and suggestions on how we can improve. To date, we’ve tamed just over 3.5 million items from 8,500 IGOs, NGOs and think tanks and nearly 100,000 items from the websites of the 600 largest municipal authorities across northern America. As a first step towards including this content in the scholarly record, every item is now discoverable by Google Scholar.
Over the past two decades, publishers and librarians have been focussed on capturing research findings from the academy — mainly in books and journals — to create a digital scholarly record that’s overlaid with sophisticated discovery systems for use by the academy. At the same time, they are attempting to pivot a $25BN industry to open access so the scholarly record becomes an asset not just for the academy but also for society at large.19
In parallel, and largely ignored, a growing number of researchers at non-academic institutions and organizations have been using digital publishing tools to post their research findings — as reports and papers — openly, via their websites. This is also a “$25BN” information industry, but, as I’ve shown, this grey literature is missing both from specialist discovery systems and library collections. As Wood saw, capturing grey literature for the scholarly record is incredibly complex and difficult. With Policy Commons, we are attempting to take on the challenge so grey literature will no longer be a wasted asset.
Endnotes
1. Dougherty, MV. Defining the Scholarly Record. In: Correcting the Scholarly Record for Research Integrity. Research Ethics Forum, vol 6. Springer, Cham; 2018. https://doi. org/10.1007/978-3-319-99435-2_2
2. Lavoie, B et al., The Evolving Scholarly Record. OCLC Research. Dublin, Ohio. 2014. https://doi.org/10.25333/C3763V
3. Tay, A. http://musingsaboutlibrarianship.blogspot. com/2022/06/diversity-of-scholarly-record-push-to.html
4. Links to all the items mentioned in this paragraph can be found in this List https://policycommons.net/lists/228/wait-whattheres-lots-of-vital-stuff-missing-from-the-scholarly-recorduksg-lightning-presentation/
5. Wood, DN. Management of Grey Literature. K.G. Saur. 1990 https://doi.org/10.1515/9783111514598.61
<https://www.charleston-hub.com/media/atg/>