![](https://assets.isu.pub/document-structure/210211172306-d3b07e785d1b25fc3a91199e40fcd5db/v1/ca0f0416eac626ae4855d275438ad762.jpg?width=720&quality=85%2C50)
4 minute read
op Ed – Web Archiving: The Dream and the Reality
Op Ed — Opinions and Editorials
Op Ed — Web Archiving: The Dream and the Reality
by Joseph Puccio (Collection Development Officer, Library of Congress) <jpuc@loc.gov>
Here is the web archiving dream. It is a hundred years from now, and a researcher wants to know how all aspects of an important topic from today were understood, communicated and debated. Using a nearly comprehensive collection of archived websites and other content, that researcher is able to be more or less in the moment of a century past and have access to all relevant subject matter. News, published reports and books, videos, podcasts, photographs, journals, music and social media postings are available. Using that access to archived content, the researcher is able to fully understand the contemporary information and communications encompassing that issue.
Here is the reality. Although the Internet has now been in wide usage for a quarter of a century, web archiving as a technology and as a means of accessibility to past digital content is still a work in progress. There are many areas that must be improved if we hope to preserve the historical record for future generations.
What should be Preserved?
The landscape is enormous and in a constant state of change. Internet Live Stats reports that there are over 1.8 billion websites, but the majority are inactive. Fewer than 200 million sites are active, according to Netcraft. There is a continuing deluge of new websites being established and existing sites going dark. Individual sites can be very complex — some are fantastically vast. For instance, a Google search of site:espn.com returns millions of results.
What portions of this environment should be preserved in its current state for that researcher a century from now? What will be of value? Selecting web content for the future user is a difficult task. Who will do it? The Internet Archive, a non-profit organization, began archiving from the Internet in 1996, stepping in to fill a void by capturing web content and making it subsequently available via its Wayback Machine. It has done far more than any other entity to capture aspects of the web, collecting very widely but not necessarily with a strategic collection development plan. The Internet Archive has also provided an archiving capacity to over 600 other institutions and organizations through its Archive-It program, although many of those partners are using the service to self-archive their own web presence.
Technical Limitations
A number of limitations currently stand as challenges to web archiving. One might assume that when a website is crawled, for instance, everything within a domain (such as espn.com noted above) is captured. However, for such sites, it is not possible to crawl their entirety in a single visit. Pages are missed, and the crawler attempts to harvest them on its next scheduled visit, based on a pre-determined frequency ranging from daily to annually.
Crawlers also can have difficulty capturing video content, the files of which are far larger than textual material and are often embedded as streaming material from other third-party sites. In addition, websites can include a robots.txt file to provide guidance regarding limitations on crawling of the site. Some, in essence, state, “Do not crawl here.” Beyond that, site managers may well monitor and block crawlers at points, given that the load of crawling a site can itself strain the ability of a site manager to provide access to the content.
Another impediment to web archiving is that crawlers cannot easily go beyond paywalls and/or sign-ins that are required from many sites. Unless special arrangements are made with the website owner, harvesting from such sites is not possible.
Many websites provide access to information within databases, for example a staff directory on an agency’s site. Information from an underlying database is provided on the fly in response to specific search queries. A standard web harvest can only traverse the links provided within a site and cannot capture the database.
Sites that continuously add a tremendous amount, such as news websites, pose another challenge for web archiving. Daily crawls of such sites have proven to be inadequate. However, some success has been realized through the archiving of content being distributed through RSS feeds, which for a website provides a continuing stream of new content being added to the site.
Rights
Even though many websites are openly available, the content is not necessarily free to use and re-use. Of course, there are sites that are explicitly in the public domain or state that they are free of restrictions on use and re-use. In other cases, it is likely that someone, or some entity, owns the intellectual property rights for the site’s content. Although the Library of Congress has in place a notice and permissions process for its crawling and subsequent display of websites, most other web archiving programs do not follow such a procedure.
To date, there has not been a precedent-setting court case in the United States centered on web harvesting and the use of captured content. The risk to collecting institutions that do not use a permissions process is that legal precedent may eventually be set that will severely limit use of captured content. In the meantime, web archiving programs that do not seek permissions respond to takedown requests from content owners as they arise.
continued on page 25