The Dangerous Complacency of “Web Archiving” Rhetoric

from V33#6 December, 2021/January, 2022 Full Issue

Booklover — Timely

By Clifford Lynch (Executive Director, Coalition for Networked Information) <clifford@cni.org>

The World Wide Web turned 30 years old in 2021. During the past three decades, it has vastly evolved and changed in character. A huge number of information resources and services are accessible through the web, but not genuinely part of it; they share few of the characteristics typical of web sites in the 1990s. Indeed, “the web” has become a sloppy shorthand for a hugely diverse universe of digital content and services that happen to be accessible though a web browser (though more and more of these services are used through custom apps, particularly on mobile devices). We no longer genuinely understand the universe this shorthand signifies, much less what “archiving” it means or what purposes this can or cannot serve. This brief article will expand these themes in a little more depth; I hope to do a more extended examination of the issues elsewhere in future.

My friend, Carol Mandel, has contributed a companion paper in this issue that makes a powerful and optimistic case for approaching this new digital universe from a collection development rather than more mechanistic archiving perspective: first decide what institutions want to collect and why, only then beginning to work out how to manage the mechanics of acquisition, access, and preservation. She provides a great inventory of some of the potential treasures waiting there. I hope this will complement Carol’s argument by providing some insights into the variety of the materials that might be considered by collectors, their technical characteristics, and how these might shape both the goals and the limits of the collecting efforts.

The early web consisted of relatively static sites containing interlinked HTML files (both within a site and from site to site), plus links to other files such as graphic images, PDFs and the like. Two users coming to a site at the same time had the identical experience. Rapidly, sites expanded to include forms-based interfaces to the so-called “deep web” (basically databases), so that one might look up transit schedules, descriptive records for books, news articles, court records, or any number of other databased records. Already, with that early development, we saw that trying to capture the look of sites providing access to databases was entirely different than capturing the databases behind the sites, setting up the first of many gaps between “web archiving” and the much broader issues of collecting and preserving content and services accessible through the web.

It’s critically important to understand that there are still a large number of important web sites that are conceptually consistent with the early web. We know what to do with these. The Internet Archive, various national libraries, and more tightly targeted web capture programs based at many universities (most commonly in collaboration with the research libraries there) have been, and continue to, collect, capture, and preserve these sites. This is important, essential work and it is being done quite well. In recent years, we’ve seen the evolution of an infrastructure for distributed, collaborative web archives developing, and a growing body of studies analyzing the properties of this infrastructure and the archives that exist within it (see, for example, the work on the Memento system and the work of Michael Nelson and his colleagues). This work must be supported, and it must continue. It’s absolutely necessary. In fact, there’s good news here: a substantial amount of the material institutions will want to collect resides in this neighborhood of the web. In particular, much of the modern “grey literature” essential to so many scholarly disciplines resides in these web sites and files such as PDF documents linked to them. It would be very valuable to try to map, measure or quantify this. But we must understand that this work is far from sufficient to address the collection development and preservation challenges offered by the universe so casually thought of as “the web” today. In fact, one of my concerns is that the success of these organizations in what we might think of as “traditional” web archiving has given rise to a good deal of complacency among much of the cultural memory sector. And much worse: the broader public “indeed, ‘the web’ has a sense that everything is being has become a taken care of by these organizations; no crisis here! This is particularly sloppy shorthand troublesome because we will need for a hugely the support of the broad public in diverse universe changing norms and perhaps legal frameworks to permit more effective collecting from the full spectrum of digital content and services … of participants in this new digital We no longer universe. genuinely

Before looking at the range of understand content and services that are now accessible through the web (but are not part of the web as we’ve the universe this shorthand historically understood it), I want signifies, much to highlight a few additional technical shifts in how people interact with the web. While historically less what ‘archiving’ it web pages were static, today they means or what are assembled dynamically through purposes this can complex JavaScript computations that interact with various remote or cannot serve.” sites and services, and the viewing of a page from one second to the next is typically non-repeatable, not because the base content has changed, but because all of the surrounding advertising, scaffolding, running heads, lists of other interesting articles, etc., have changed. When archiving a copy of a page today, there are delicate, complex technical decisions embedded in the archiving processes about when to store the results of these computations, and when to store instructions for computations that would be done as part of rendering the archived page during retrieval from the archive. Perhaps the most game-changing new technology, however, is personalization, which is now pervasive in webbased services. This can be explicit, based on who you are, the preferences you declare, and your past history with the service you are interacting with, or it can be extremely opaque

and unpredictable, based on who the service thinks you are, what demographics it thinks you match, what it thinks your history is interacting with other sites, and even perhaps based on matches to data such as your credit history, voting records, or the like. Facebook is perhaps the poster child for this latter kind of personalization.

Think about this: if you are trying to collect or preserve Facebook, or even a news site, what are you trying to do? You could try to collect content that a specific individual or group of people posted or authored. You could try (though this is probably hopeless, given the intellectual property, terms of use, and privacy issues) to capture all the (public) content posted to the site. But if the point is to understand how the service actually appeared to the public at a given time, what you want to know is what material is shown to visitors most frequently. And, with personalization, to genuinely and deeply understand the impact of such a service on society, what you really need to capture is how the known or imputed attributes of visitors determined what material they were shown at a given point in time.1 These challenges are at the core of current debates about the impact of social media on society, the effects of disinformation and misinformation and the failure of social media platforms to manage such attacks, and related questions. It is clear, though, that both laws and terms of service are aligned to prevent a great deal of effective and vital work here, both in preservation and in researching the impact of these services; many social media platforms seem actively hostile to the accountability that preserving their behavior might bring. Part of the challenge will be to shift law and public policy to enable this work. Let me conclude this survey with a look at some of the diversity of content that is broadly and sloppily considered part of “the web” but diverges in some crucial senses from the early web that I previously described. All of this content, all of these services, need to be carefully examined by collection developers rising to Carol Mandel’s challenge to thoughtfully and selectively collect from the web, and also by digital archivists, scholars, preservationists, and documentalists seeking to capture context, presentation, disparate impacts, and the many other telling aspects of the current digital universe.

Today’s “web” includes the following content services (and this is only a selective and incomplete list). It’s interesting to note how many are now only secondarily accessible via web browser, with preference given to apps, including apps that live on “smart TVs,” “smart cars,” “smart phones,” and the like. It is also worth noting that there is a new generation of content services emerging that are increasingly less accessible via web browser, even on a secondary basis, and are even more sequestered gardens. These will present new challenges for collection and for preservation. And we don’t have authoritative data collection services drawing timely maps of this volatile landscape. • Social Media services: Facebook, Twitter, Instagram, Reddit, TikTok, etc. • Shopping services: Amazon, but also a myriad of specialized sites for other materials, as well as Amazon competitors like Walmart. • News sites: Note this includes not only “traditional media” like NBC, CNN, The Wall Street Journal, The New York Times, Bloomberg, etc., but also services that just select pointers to traditional media (“news arrangers”), e.g., Google News, Apple News, etc. The selecting and indexing services are part of tracking impact of the news on society. And the division between streaming news and broader streaming services has blurred, particularly around major sports events or news broadcasts. • “Content creator” driven sites, offering a wide range of subscription-based materials, such as OnlyFans, Substack, Patreon and the like. • Streaming services: Music (Apple, Pandora, Spotify, etc.) and movies and video (Netflix, HBO, and a horde of other competitors). Understanding how to think about documenting and preserving these services and their content is a poorly examined problem, particularly when juxtaposed against the constantly shifting libraries of materials that they can offer their customers for streaming.

Almost all the services mentioned above are immense corporate projects and products, many of which are not friendly to memory organizations; it will often be genuine and challenging work to acquire these materials. Creators, would-be public intellectuals, activists, and journalists/documentalists of all sorts are also flocking to digital environments (perhaps the traditional web, perhaps not); what these people and groups are actually doing in this environment is not well understood or well tracked, as far as I know. William Gibson’s observation that “the street finds its own uses for things” resonates here. These developments should be of great interest to collection developers and others looking to the now and to the stories critical to the future.

Endnotes

1. Clifford A. Lynch, “Stewardship in the ‘Age of Algorithms,’”

First Monday 22, no. 12 (2017). http://firstmonday.org/ojs/ index.php/fm/article/view/8097/6583

The Dangerous Complacency of “Web Archiving” Rhetoric

Next Article

Booklover — Timely

Endnotes

More articles from this publication:

Booklover — Timely

Bet You Missed It

Don’s Conference Notes

at an Academic Library

Back Talk — Rip Van Winkle Returns to Charleston

Let’s Get Technical — Subject Heading Prediction

Curriculum Design and Course Technology Support Centered on Affordability, Engagement, and OER

Reader’s Roundup: Monographic Musings & Reference Reviews

Questions and Answers — Copyright Column

This article is from:

V33#6 December, 2021/January, 2022 Full Issue