Guest Editors: Edward A. Fox, Robert M. Akscyn Richard K. Furuta John J. Leggett
L Digital
This year marks the 50th anniversary of Vannevar Bush’s seminal article [3] that paved the way for fields like information retrieval and hypertext. Bush’s efforts helped raise the level of public support for scientific research at the end of World War II. Today we have a similar opportunity to invest in research and education to bring the world closer together, improve our environment, and fulfill the age-old dream of every human being: gaining ready access to humanity’s store of information. This era and what we are building go by many names, including Cyberspace, Global Information Infrastructure, Infobahn, Information Age, Information (Super)Highway, Interspace, and Paperless Society. They are all supported by networking (e.g., the Internet). However, their essence is information. Information is what flows over the networks, what is presented to us by our consumer electronics devices, what is manipulated by our computers, and what is stored in our libraries.
ibraries
COMMUNICATIONS OF THE ACM
April 1995/Vol. 38, No. 4
23
L
ibraries exist in many forms and are of many types. In computing, code libraries have been a part of the world of software engineering. Object libraries are part of object-oriented programming efforts. With multimedia technology we now have image libraries, audio libraries, and even digital video libraries. We might also think of libraries when we refer to collections that now reside in databases, knowledge bases, text bases, gopherspace, or the World-Wide Web (WWW). At the same time, traditional libraries’ budgets devote an ever-increasing share of their funds to electronic services, whether in the form of CD-ROMs, online public access catalogs, or online databases. This trend will continue, as digital storage costs go down relative to the cost of library shelf-space, and as electronic services become more useful, affordable, available, and usable. Other forces are encouraging the demand for and supply of electronically accessible information, like the hunger for news and learning, the pride of authoring as evidenced by local or vanity publishing, the desire to collaborate or at least share with colleagues, the pressures of reorganization and restructuring, the ubiquitous presence of electronic publishing, the excitement of exploring in an expanding sea of information, and the push to use new technological tools. These are some of the many reasons that humanity is now working toward the grand challenge of a World Digital Library System. Here and throughout this special issue, we present a snapshot of that process. . History, Extreme Perspectives, and Definitions
For various reasons, digital library has stuck as the term to use for this field. Indeed, the majority of articles presented in this issue mention the phrase prominently. As we consider many of the discussions and activities in this area over the period 1991-1993 [5], we note a shift from electronic library to digital library as the preferred term, perhaps following the growing interest in digital networks, digital audio, and digital video relative to electronic publishing. In addition to a variety of other activities around the world, U.S. government legislation and a number of funding initiatives were launched in 1993 with the digital library as a prominent theme, and journal special issues began to appear on the topic [7]. In 1994 there were numerous talks, panels, tutorials, workshops (e.g., [10]), and conferences (e.g., [9]) on digital libraries. Discussions of digital libraries often begin by picking an extreme point on a spectrum or scale (see also Levy and Marshall in this issue). Yet digital libraries will simply shift the point of equilibrium in each of these scales. For example, some claim we are finally “Beyond Paper,” which is the title of a recent work on the Adobe Acrobat product family [2]. Unhappily (from an ecological per24
April 1995/Vol. 38, No. 4
COMMUNICATIONS OF THE ACM
spective), increased use of computers has contributed to increased use of paper; but tools are finally emerging that may ameliorate that trend. Thus, CD-ROMs have eliminated much of the demand for printed encyclopedias, large reference works, and computer manuals; in one application of digital libraries to education there are already a number of paperless courses (see http://ei.cs.vt.edu/EIproj.html). We hope use of paper will decrease but doubt it will disappear. Others claim we will no longer need printed books or journals (see also articles by Levy and Marshall and by Marchionini and Maurer). In spite of jokes about people being unwilling to curl up in bed with computers to read novels, Voyager does sell individual books on diskette for PowerBook computers, Project Gutenberg makes available a large number of out-of-copyright volumes, and many CD-ROMs contain large collections of books. More progress has been made with electronic journals, however. Initial and individual efforts to launch new journals as electronic services (e.g., the Online Journal of Clinical Trials by AAAS and OCLC) are giving way to large commercial ventures. For example, Elsevier’s TULIP project, with some 40 journals about material science and engineering available as page images, will soon lead to electronic access to over a thousand of their journals. Another extreme statement calls for elimination of intermediaries (search intermediaries, librarians, retailers, distributors, and others) who interfere in the process of interchange between authors and readers (whose roles will also be blurred as a universal hypertext library system evolves), as is considered by Wiederhold in this issue. Though the co-authors of this article are extensively involved in electronic and other publishing activities, and make significant personal use of hypertext systems (e.g., KMS [1]), they all have a growing appreciation of the value of talented intermediaries and believe that considerable research is required to incorporate their knowledge into expert library systems. Finally, some assume that with digital libraries everything will be in digital form (see Levy and Marshall). As in all of the cases mentioned here, we see a shift in the indicated direction that will certainly lead to dramatic changes, but must heed the lessons of history, which show us that new technologies rarely completely supplant the old, and that new points of balance eventually are achieved. The phrase “digital library” evokes a different impression in each reader. To some it simply suggests computerization of traditional libraries. To others, who have studied library science, it calls for carrying out of the functions of libraries in a new way, encompassing new types of information resources; new approaches to acquisition (especially with more sharing and subscription services); new methods of storage and preservation; new approaches to classification and cataloging, new modes of interac-
Digital tion with and for patrons; more reliance on electronic systems and networks; and dramatic shifts in intellectual, organizational, and economic practices. To many computer professionals, a digital library is simply a distributed text-based information system (see article by Croft in this issue), a collection of distributed information services (see article by Wilensky), a distributed space of interlinked information (see Schatz et al.), or a networked multimedia information system. It may have materials that are mostly from outside an organization, that are generally of high value, and that have had special electronic services add to its quality during creation, collection, organization, and/or use [10]. To modern-day users of the WWW it suggests more of the same, with sure-to-come improvements in performance, organization, functionality, and usability. Hypertext researchers recall Bush’s vision of linked multimedia objects that encompass humanity’s store of information [3]. Those studying collaboration technologies see digital libraries as the space in which people communicate, share, and produce new knowledge and knowledge products. Those working on education technology see digital libraries as support for learning, whether formal or informal (see Marchionini and Maurer). The metaphor of the traditional library is both empowering and constraining. We have acknowledged the value of talented intermediaries and recognize the importance of the knowledge systems they have evolved over centuries of handling and managing traditional collections. Much of the power of the digital library is the flexibility it permits in allowing processing of our collections of tangible objects and their electronic representations. However, the knowledge developed over the years is quite flexible too, and it is feasible, perhaps even desirable, to apply it also to the collection of things without direct physical analogs, for example, algorithms, real-time data feeds, computational states, relationships among versions of a physical object showing the historical progression of an idea, multimedia annotations, and tours. In this issue we embrace all of these perspectives. We adopt the pragmatic approach of letting exemplary pilot, research, and development projects provide an operational definition; having descriptions of key supporting technologies provide insight into future trends; and offering a set of in-depth feature articles that attempt to explain digital libraries in the light of interface and retrieval techniques, education, needs of information analysts, and the world of scholarly publishing. In this Issue
We have developed this special issue as a nourishing meal, with appetizers, main courses, and desserts aplenty. We invite you to indulge in all of the appealing works, to think deeply about their implications, to connect them with your own plans and activities, and
Libraries
to build your own view of digital libraries. We are pleased that toward the end of this issue the ACM Publications Board has laid out its plans for electronic publishing (Denning and Rous), along with interim statements about copyright policies and guidelines for authors who submit works for ACM to publish. These statements expand on earlier discussions (e.g.,[4]) and will have a direct impact on you, the computing profession, and the world of electronic publishing. What is particularly exciting in light of this Special Issue is that the cornerstone of ACM Publications is its emerging digital library, which will support a wide range of new as well as replacement services. We hope that this issue will help prepare you for the new world of ACM publishing!
W
e open with our most visually interesting article, which shifts the focus to users and visualizing information. Rao et al. discuss a variety of tools, systems, and studies at Xerox PARC that illustrate the future style of rich interaction users will have with digital libraries. They provide a human-computer interaction perspective on the field of information retrieval. As in the Envision project (see Heath et al.), they strive to empower users to manage the vast amount of information that will be available in digital libraries, but go further in applying visualization to complex processes like clustering and in incorporating interaction into more aspects of user sessions. Having introduced a number of approaches to information retrieval and visualization, we continue with a selection of short pieces about supporting technologies for digital libraries. Bell et al. discuss an information retrieval system available by anonymous FTP and how compression techniques make it suitable for handling large text collections. Croft describes how information retrieval methods, enhanced with artificial intelligence techniques, can be used in digital libraries. Fox summarizes a range of efforts to make computer science reports readily available, leading into French et al.’s discussion of the WATERS project and Lagoze and Davis’ explanation of the Dienst system. Please avail yourselves of these computer science report services, and help build and use an integrated worldwide virtual computer science report library. Finally, concluding this section is a short piece on a powerful approach to handle spatial (e.g, GIS) information (by Kacmar and Jue). The section on projects spans the range of pilot digital library efforts from small to large. Huser et al. describe technologies and approaches to constructing a large encyclopedia. These will be a crucial part of future digital libraries: developing a knowledge base and object database with the aid of text analysis and parsing, network editing and enrichment, and COMMUNICATIONS OF THE ACM
April 1995/Vol. 38, No. 4
25
automatic generation of presentations on demand. Merrill et al. explain a 135-GB university information system exploiting CD-ROM technology. Heath et al. describe a user-centered discipline-level research project to build a digital library prototype from the ACM literature, and present a new style of interface for managing search results (see also [7]). Entlich et al. give an overview of the CORE project to build a digital library in chemistry from American Chemical Society publications.
C
ontinuing this section is a collection of pieces regarding the NSF/ARPA/NASA $24.4-million Digital Library Initiative. First is Olson’s “An Appreciation of Laurence Rosenberg,” whose many years of devoted service to NSF ended while he was working on this initiative. Short overviews of the six projects funded through this initiative follow, listed in alphabetical order by institution name.These cover a broad range of digital library content areas and types, technical approaches, system architectures, and research problems. For each, pointers are given to project home pages on the WWW, so readers can track progress over the four year period of each award. The last discussions in this section deal with national libraries. Purday details the efforts of the British Library to apply digital technology. The several pilot initiatives described will pave the way for broader access to the vast holdings of this key institution. Becker then briefly summarizes efforts under way at the U.S. Library of Congress. The activities at these two libraries are illustrative of national efforts in France, Japan, Singapore, and other countries. The feature articles presented here give definition and perspective to the field of digital libraries. Since most people think of education when libraries are discussed, we begin with Marchionini and Maurer’s coverage of teaching and learning with digital libraries. These authors are experienced in library and information science, electronic publishing and journals, hypertext systems and collections, information resources, evaluation, and using scientific data in education. They provide a rich insight not only into digital libraries but also into the future of learning. The next two articles force us to think deeply about digital libraries, how they relate to current work practices, traditional libraries, and the revolution now under way in electronic publishing. Levy and Marshall focus our attention on information workers and their needs, making it clear what support digital libraries must provide and why we will need ongoing bridges among digital libraries, objects in the real world, and people’s communicative processes. They help us enlarge our view of digital libraries to include facilitation of richer collaboration, varied media, and works that are more transitory than 26
April 1995/Vol. 38, No. 4
COMMUNICATIONS OF THE ACM
archival publications. Wiederhold pushes us even further, to consider the whole enterprise of scholarly and electronic publishing. He points out the new capabilities that digital libraries will provide and the implications of networked information interchange for the world of publishing, and predicts shifts in roles, responsibilities, revenues, and services. Concluding the issue are the ACM Electronic Publishing articles, which illustrate many of the points raised earlier. Denning and Rous continue Wiederhold’s analysis of shifts in the role of publishers. They summarize current practice in scientific publishing, list the many breakdowns now visible, and lay out ACM’s response to these challenges: a digital library, two tracks of publications, experiments, new services, and careful development of policies and guidelines. Areas and Technologies
This issue touches on many aspects of digital libraries. Table 1 is a short list of phrases that are mentioned here or in related works. Each area can be studied on its own, but special insights are gained by considering an area in the context of digital libraries. Digital libraries that will be developed with all of these topics properly considered will come much closer to providing full service. For full-service, large-scale digital libraries to develop and interoperate with others, a fair degree of standardization is required. Such standards must rest on some agreed-upon framework and reference model, building upon careful definition of requirements [8]. Clearly the suite of standards needed for interoperability is more comprehensive than that needed for interchange, and even in that more limited arena, there are deficiencies. If we consider the most crucial aspects of digital libraries on the one hand, and the collection of international standards that have been approved on the other hand we often find no standards where one is needed, and several available when one might do. We urge corporations, funding agencies, national libraries, networking groups, and research and development teams involved with digital libraries to work together with standards bodies to correct this situation. Even greater integration and collaboration is needed if humanity accepts the challenge of constructing a World Digital Library System. Let us consider briefly some of the key supporting technologies that are discussed in this article and in this issue. First is the field of electronic publishing. Tools like Adobe Acrobat and its Distiller or Capture facilities allow us to directly convert documents to be printed on a PostScript printer to electronic ones, and also add value in terms of miniature or thumbnail sketches of pages, hypertext links, and search services [2]. Standards like SGML facilitate dual-use publishing—for paper and more flexible electronic delivery—and are one of the pillars of the success of the WWW. Thus, Mosaic (see Schatz et al.) depends on HTML, which is defined by
Digital an SGML Document Type Definition. We see SGML used in the CORE (see Entlich et al.) and Envision (Heath et al.) projects and as an ongoing foundation for ACM publishing plans (see [4] and the three pieces in this issue). Clearly, significant changes in publishing are occurring as a result of electronic publishing technology and its blending with networked information (see Wiederhold). Second is the field of hypermedia. Even more radical changes will occur in the publishing process when authors learn to write with the aid of powerful hypertext systems and to collaborate through them on articles (as was done with this piece). As scholarly activities yield rich hyperbases and knowledge bases, multiple uses will be made through the help of intelligent presentation systems that generate personal-
Libraries
of CS reports are educational institutions, and experience with projects to disseminate these reports electronically (see articles below by Fox, French et al., and Lagoze and Davis) has shown that most of their usage is connected with education. Fourth and last is the broad field of data and information management. Specialized technologies are needed for spatial and geographic information (see Kacmar and Jue and Smith and Frew), compression (see Bell et al.), and multimedia information (see Christel et al.). Database management methods, whether extended relational (see Wilensky) or object-oriented (see Huser et al.), not only are needed to support direct use of data collections (see Merrill et al.) in digital libraries, but also will help handle catalogs, royalty administration, usage logs, security
Table 1. Areas of study, attributes, contents, features, issues, roles
Abstracting Accessibility Agents Annotation Archive Billing, charging Browsing Catalog Classification Clustering Commercial service Content conversion Copyright clearance Courseware Database Diagrams (e.g., CAD) Digital video Discipline-level library Distributed processing Document analysis Document model Economic study
Education-support Electronic publishing Ethnographic study Filtering Geographic information system Hypermedia Hypertext Image processing Indexing Information retrieval Intellectual property rights Interactive Knowledge base Knowbot Library science Mediator Multilingual Multimedia stream playback Multimedia systems Multimodal National library Navigation
ized organizations and render multimedia object collections in the most suitable fashion (see Huser et al.). With next-generation multimedia systems, highly motivational video resources will be transformed and connected into applications for education, training, reference, and entertainment (see Christel et al.). Third is the field of education and digital libraries. Marchionini and Maurer provide an insightful overview of this area. We see education as one of the main applications of the Envision project and note that the Digital Library Initiative was tailored to take place at leading universities. Many of the producers
Object-oriented OCR OODB support Personalization Preservation Privacy Publisher library Repository Scalability Searching Security Sociological study Storage Standard Subscription Sustainability Training support Usability Virtual (integration) Visualization World-Wide Web
control, and other services. Text analysis and information retrieval techniques are crucial for converting, indexing, representing, searching, and presenting desired information (see Croft). Humancomputer interaction methods for handling information are of particular importance to help users more effectively search, learn about, organize, and utilize the vast stores that will be found in digital libraries (see Rao et al.). They are emerging as powerful aids when a user-centered design approach is adopted (see Heath et al.). As a snapshot, this issue is necessarily incomplete in COMMUNICATIONS OF THE ACM
April 1995/Vol. 38, No. 4
27
its characterization of digital libraries. Many important projects and perspectives have been omitted. Here we give some pointers to aid further exploration, and of course we encourage interested readers to attend the numerous conferences and workshops scheduled in this field, many sponsored by or in cooperation with ACM and its SIGs. One early journal special issue is introduced in [6]. It includes articles on copyright and intellectual property rights, a subscription model for handling funds transfer related to digital libraries, a description of the evolution of the WAIS search system in general and its interfaces in particular, an overview of the Right Pages system and its use of OCR and document analysis algorithms, and an early overview of the Envision system [7]. We note that to many, intellectual property rights issues and ways to obtain revenue streams to sustain digital libraries are the most important open problems. The largest digital library conference makes its proceedings available over the WWW [9]. These contain many insightful discussions, proposals of new research ideas, descriptions of base technologies, and explanations of how the broad concept of a digital library fits in with the needs of specific user communities and the information they require. Readers can find a variety of works on agents, architectures, catalogs, collaboration, compression, document analysis from OCR and page images, document structure, electronic journals, heterogeneous sources, knowledge-based approaches, library science, numerical data collections, object stores, and organizational usability. For more details on the origins of the Digital Library Initiative, and for a variety of perspectives on open research problems, we refer the reader to [5]. This work also has numerous pointers to people, projects, institutions, and other reference works in the area. For a perspective on the role the computer industry should have in this field, see [10]. This report outlines IBM’s perspective on key supporting technologies and on the unique challenges highlighted by the emergence of digital libraries. We expect considerable interest from the corporate sector as well as from government agencies in this important area of information technology. For lack of space, we have had to omit many publications on networking and storage technologies, sociological and ethnographic studies, library and information science, OCR and document analysis or conversion, and rights management. These and other works are needed to round out the discussion of digital libraries. However, we encourage you to read the rest of this issue as a good starting point for your future studies of this important field. We invite you to not only use but also help in the creation of a future World Digital Library System! C References 1. Akscyn, R., McCracken, D., and Yoder, E. KMS: A distributed hypermedia system for managing knowledge in organizations. Commun. 31, 7 (July 1988), 820-835.
28
April 1995/Vol. 38, No. 4
COMMUNICATIONS OF THE ACM
2. Ames, P. Beyond Paper: The Official Guide to Adobe Acrobat. Adobe Press, Mountain View, Calif, 1994. 3. Bush, V. As we may think. Atlantic Monthly 176 (July 1945), 101-108. 4. Fox, E.A. ACM press database and electronic products — New services for the information age. Commun. 31, 8 (Aug. 1988), 948-951. 5.Fox, E.A. Sourcebook on digital libraries: Report for the National Science Foundation, Tech. Rep. TR-93-35, Computer Science Dept., VPI&SU, Blacksburg, Va. 1993. Available by anonymous FTP from directory pub/DigitalLibrary on info.cs.vt.edu or at http://fox.cs.vt.edu/DLSB.html 6. Fox, E.A., and Lunin, L. Introduction and overview to perspectives on digital libraries. J. Am. Soc. Inf. Sci. 44, 8 (Sept. 1993), 441-443. (guest editors’ introduction to special issue) 7. Fox, E.A., Hix, D., Nowell, L., et al. Users, user interfaces, and objects: Envision, a digital library. J. Am. Soc. Inf. Sci. 44, 8 (Sept. 1993), 480-491. 8. Gladney, H., Fox, E., E.Ahmed, Z., et al. Digital library: Gross structure and requirements: Report from a March 1994 workshop. In Proceedings of the 1st Annual Conference on the Theory and Practice of Digital Libraries: Digital Libraries ‘94 (Texas A&M Univ., College Station, Tex., June 1994), pp. 101-107. Longer versions available as IBM Research Report RJ9840, IBM Almaden Research Center, May, 1994, or Virginia Tech Dept. of Computer Science Tech. Rep. 94-25, June, 1994, or by anonymous FTP from directory pub/DigitalLibrary on info.cs.vt.edu as RJ9840.ps 9. Schnase, J.L., Leggett, J.J., Furuta, R.K., and Metcalfe, T. Proceedings of the 1st Annual Conference on the Theory and Practice of Digital Libraries: Digital Libraries ‘94 (Texas A&M Univ., College Station, Tex, June 1994). Available in electronic form at http://atg1.wustl.edu/DL94. 10. Slonim, J. Networked information systems as digital libraries. Summary Report from IBM Academy Digital Library Workshop, Briarcliff Manor, New York, Sept. 12-13, 1994. IBM Academy of Technology, 1995. About the Guest Editors: EDWARD A. FOX is an associate professor of computer science at Virginia Tech where he also serves as associate director for research of the Computing Center. His research interests include digital libraries, electronic publishing, information storage and retrieval, multimedia systems and their uses in education. Present address: Dept. of Computer Science, Virginia Tech, 562 McBryde Hall, Blacksburg, VA 24061-0106; email:fox@vt.edu ROBERT M. AKSCYN is co-founder and president of Knowledge Systems, a 14-year-old software spinoff from Carnegie Mellon University that specializes in distributed hypermedia technology for wide-area networks. He is also an adjunct professor in the School of Computer Science at CMU. His research interests include enterprise-wide collaboration and large-scale commercial digital library technologies. Present address: Knowledge Systems, RD2 #213A Evans Rd., Export, PA 15632; email: rma@centro.soar.cs.cmu.edu RICHARD K. FURUTA is an associate professor in the Department of Computer Science and director of the Hypermedia Research Laboratory at Texas A&M University. His current research interests, besides digital library systems, include hypermedia systems and models, structured documents and electronic publishing, computer-supported collaborative work, and management systems for 3D, gesture-based user interfaces. Present address: Dept. of Computer Science, Texas A&M University, College Station, TX 77843-3112; email: furuta@cs.tamu.edu JOHN J. LEGGETT is an associate professor of computer science at Texas A&M University, where he directs the Center for the Study of Digital Libraries. His current research interests include digital library systems, collaborative hypermedia systems, and humancomputer interaction. Present address: Dept. of Computer Science, Texas A&M University, College Station, TX 77843-3112; email: leggett@cs.tamu.edu © ACM 0002-0782/95/0400