V33#6 December, 2021/January, 2022 Full Issue

Page 22

The Dangerous Complacency of “Web Archiving” Rhetoric By Clifford Lynch (Executive Director, Coalition for Networked Information) <clifford@cni.org>

T

he World Wide Web turned 30 years old in 2021. During the past three decades, it has vastly evolved and changed in character. A huge number of information resources and services are accessible through the web, but not genuinely part of it; they share few of the characteristics typical of web sites in the 1990s. Indeed, “the web” has become a sloppy shorthand for a hugely diverse universe of digital content and services that happen to be accessible though a web browser (though more and more of these services are used through custom apps, particularly on mobile devices). We no longer genuinely understand the universe this shorthand signifies, much less what “archiving” it means or what purposes this can or cannot serve. This brief article will expand these themes in a little more depth; I hope to do a more extended examination of the issues elsewhere in future. My friend, Carol Mandel, has contributed a companion paper in this issue that makes a powerful and optimistic case for approaching this new digital universe from a collection development rather than more mechanistic archiving perspective: first decide what institutions want to collect and why, only then beginning to work out how to manage the mechanics of acquisition, access, and preservation. She provides a great inventory of some of the potential treasures waiting there. I hope this will complement Carol’s argument by providing some insights into the variety of the materials that might be considered by collectors, their technical characteristics, and how these might shape both the goals and the limits of the collecting efforts. The early web consisted of relatively static sites containing interlinked HTML files (both within a site and from site to site), plus links to other files such as graphic images, PDFs and the like. Two users coming to a site at the same time had the identical experience. Rapidly, sites expanded to include forms-based interfaces to the so-called “deep web” (basically databases), so that one might look up transit schedules, descriptive records for books, news articles, court records, or any number of other databased records. Already, with that early development, we saw that trying to capture the look of sites providing access to databases was entirely different than capturing the databases behind the sites, setting up the first of many gaps between “web archiving” and the much broader issues of collecting and preserving content and services accessible through the web. It’s critically important to understand that there are still a large number of important web sites that are conceptually consistent with the early web. We know what to do with these. The Internet Archive, various national libraries, and more tightly targeted web capture programs based at many universities (most commonly in collaboration with the research libraries there) have been, and continue to, collect, capture, and preserve these sites. This is important, essential work and it is being done quite well. In recent years, we’ve seen the evolution of an infrastructure for distributed, collaborative web archives developing, and a growing body of studies analyzing the properties of this infrastructure and the archives that exist within it (see, for example, the work on the Memento system and the work of Michael Nel-

22 Against the Grain / December 2021 - January 2022

son and his colleagues). This work must be supported, and it must continue. It’s absolutely necessary. In fact, there’s good news here: a substantial amount of the material institutions will want to collect resides in this neighborhood of the web. In particular, much of the modern “grey literature” essential to so many scholarly disciplines resides in these web sites and files such as PDF documents linked to them. It would be very valuable to try to map, measure or quantify this. But we must understand that this work is far from sufficient to address the collection development and preservation challenges offered by the universe so casually thought of as “the web” today. In fact, one of my concerns is that the success of these organizations in what we might think of as “traditional” web archiving has given rise to a good deal of complacency among much of the cultural memory sector. And “Indeed, ‘the web’ much worse: the broader public has a sense that everything is being has become a taken care of by these organizations; sloppy shorthand no crisis here! This is particularly for a hugely troublesome because we will need the support of the broad public in diverse universe changing norms and perhaps legal of digital content frameworks to permit more effective and services … collecting from the full spectrum of participants in this new digital We no longer universe. genuinely

Before looking at the range of understand content and services that are now the universe accessible through the web (but this shorthand are not part of the web as we’ve historically understood it), I want signifies, much to highlight a few additional techless what nical shifts in how people interact ‘archiving’ it with the web. While historically means or what web pages were static, today they are assembled dynamically through purposes this can complex JavaScript computations or cannot serve.” that interact with various remote sites and services, and the viewing of a page from one second to the next is typically non-repeatable, not because the base content has changed, but because all of the surrounding advertising, scaffolding, running heads, lists of other interesting articles, etc., have changed. When archiving a copy of a page today, there are delicate, complex technical decisions embedded in the archiving processes about when to store the results of these computations, and when to store instructions for computations that would be done as part of rendering the archived page during retrieval from the archive. Perhaps the most game-changing new technology, however, is personalization, which is now pervasive in webbased services. This can be explicit, based on who you are, the preferences you declare, and your past history with the service you are interacting with, or it can be extremely opaque

<https://www.charleston-hub.com/media/atg/>


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.