ONLINE PERSPECTIVE by ROGER K. SUMMIT
Keynote Address Delivered at the "ONLINE 1980" Meeting in San Francisco, November 12, 1980
ONLINE PERSPECTIVE
Looking through the program for ONLINE '80 gives one the impression that online information retrieval is a highly specialized, esoteric and rather complex activity better left to experts and high priests. Take, for example, the following titles: • "Maximizing the Use of ERIC" • "Code and Document Type Searching in Social Science and Business Databases" • "Online Derivation and Use of CAS Registry Numbers" • "Search Strategy for Toxicologic Computerized Information Retrieval" Who would realize that behind all of this apparent complexity exists in the computer a fairly simple file structure and/set of data processing sequences which are used more or less universally across all systems and files to provide access to information. The many ways in which information can be represented - numerical data, telegraphic word lists, coding structures, narrative abstracts - and the many ways a given concept can be expressed, tend to obscure the fundamental and rather elegant internal processes involved in online information retrieval systems. In the early 1960s online retrieval systems as we know them today did not exist.
Furthermore, what is taken for granted today in terms of systems
and databases was in no way obvious at that time, nor were the ways and means of bringing such a system into widespread use. To provide a bit of perspective for the many papers which follow in this most interesting conference, let me identify and describe some of the key objectives, decisions and gratuitous events that guided online systems and searching to the point we find it today.
Because of my familiarity with DIALOG, its
development will be my point of departure. Setting the Stage Why do we use computers for information retrieval?
Let us move back in
time to the 1950s to gain perspective. We are all familiar with manual
-1-
solutions to information retrieval. The Dewey decimal system provides a correspondence coding between physical shelf location and subject by classifying books into a single subject category, and shelving them accordingly. subjects. problem.
The problem is that a book or document may pertain to several
The familiar library card catalog provides an answer to this As we know in this case, a document is assigned one or more
subject headings. Cards are produced for each heading and the cards are sorted together to provide an alphabetical subject index in card form. But what do you do with journals or magazines where each article may pertain to a different subject?
The obvious answer is to index each article, the
result of which can be carried in a card catalog (which is likely to outgrow any space designed to contain it), or to publish the results in a bound volume such as the Reader's Guide to Periodical Literature; thus the book catalog and the evolution of abstracting and indexing services. There are, however, several problems in the retrieval of information from manual systems. With the Dewey approach of faceted classification, to discover the proper category to access, the user's point of view must parallel that of the indexer or classifier for retrieval to occur. If the document can be classified several different ways, it may be missed entirely. Furthermore, for a manual system to be manageable at all, the number of subject headings in the overall classification scheme must be rather severely limited or the problem of discovering the proper category to look under approaches the difficulty of finding the relevant document with no classification. With a limited number of subject headings, the number of documents in any single category becomes large and necessitates much serial scanning. How exciting it must have been for Allen Kent and J. w. Perry of Western Reserve University to contemplate a mechanical solution to information retrieval in the design of the Searching Selector in the mid-1950s as shown in the first slide. The woman at the right has programmed the device to search 10 questions simultaneously from the punched paper-tape library in the foreground, with a sufficient match between -2-
SmMnq Qd&m)
The Western Reserve University Searching Selector is programmed for a search of encoded metallurgical literature. Unit at left is a Flexowriter, which "reads" the encoded "library" from punched paper tape. The unit at the right is programmed for search of ten questions. Documents from the "library" hearing on the questions are then identified when the Flexowriter automatically types out the serial number corresponding to those l documents.
-2A-
descriptor codes and codes in the paper-tape library, the Flexowriter will type out the serial numbers of identified documents. Moreover, the library can be as large as the patience of the young woman.
This
device is one of the earliest approaches to electro-mechanical bibliographic information retrieval and is a functional precursor to the batch-search computer systems developed and used with first and second generation computers. Several organizations developed and/or provided batch search services using first and second generation computers in the late 1950s and in the early 1960s including - among others - Western Reserve University,"NASA, Defense Documentation Center, the National Library of Medicine, and MIT. Batchsearch systems are characterized by tape or punched-card input, printer output, and single job processing.
In batch searching a set of queries
is coded by combining keywords or descriptors in Boolean fashion on punched cards and matching each query against each record (a bibliographic citation) on a large, sequential tape file.
If there is a match, the record is
listed; if not, the search proceeds to the next record.
The deficiencies
of batch searching are widely known; if the query is too specific, (i.e., too many 'ands') the result is likely to be zero; if slightly less specific, a large portion of a large file can be printed.
Any subsequent revision
of the search must be resubmitted and reprocessed, often requiring days or weeks to get the next set of results. Furthermore, the processing time for a batch of searches on one of the computers of the day could take several hours to complete. NASA, one of the leaders of the day in sponsoring the development of retrieval systems, required - for example eight hours on an IBM 1410 to process a batch of searches against its file of 200,000 aerospace reports citations. Enter Online Retrieval Several events occurred in 1964 allowing many organizations to dream the impossible dream of online retrieval. •
The King report
In our case they were:
"Automation and the Library of Congress"
(published in 1963) was reviewed at Lockheed where it tickled the imagination of a Lockheed executive vice president (2) •
A prototype online retrieval system called CONVERSE
had been
successfully tested at Lockheed which utilized an RCA computer with random access disks and an online input device -3-
•
IBM announced its third generation computer series, the IBM/360 in April of that year
•
A proposal "LMSC Information Storage and Retrieval Study Plan" was submitted to management in April 1964
The King report provided the precipitant stimulus; the CONVERSE experiment provided demonstrable credibility; the IBM third generation hardware with mass random access storage and interactive processing capability provided the means; and the proposal to Lockheed management resulted in the reorganization and funding necessary to establish the Lockheed Information Retrieval Laboratory in January of 1965.
The objective of the new organization was
stated in the proposal as follows: "Establish a program to investigate the information retrieval problem. Initial emphasis will be placed on a particular problem area, namely, the library problem which includes storage, retrieval, dissemination and display elements which are applicable to other information retrieval areas." An IBM 360/30 computer with 32K of internal memory, a 400 megabyte data cell mass storage device, and two 5.5 megabyte disks were installed in November of 1965. This configuration, shown in Slide 2 was little more powerful than the microprocessors of today, and was to be the development environment for DIALOG. Birth of DIALOG The design of DIALOG originated from a belief that proper use of third generation computer technology could overcome most, if not all, of the problems of manual and mechanical retrieval approaches described previously.
One
of the problems with batch searching was the necessity to formulate extremely complex expressions for all but the simplest searches. Why not design a language that would allow the searcher to break down a complex formulation into a series of simpler steps which could be executed one at a time, and then provide a facility to combine the results of the individual steps into the more complex formulation?
If the technique works in problem
solving, perhaps it would work in searching. Furthermore, such a design philosophy would allow the searcher to know the quantitative result of each step which could help to guide formulation of the next step. -4-
-4A-
To assist in vocabulary selection, it was decided to provide some form of alphabetical index display showing posting counts and related term counts. Furthermore it seemed desirable to allow the searcher to display "hits" at any time to provide for intermediate validation and/or search redirection. To facilitate this "cut and try" philosphy, it was necessary to save the results of each search statement, and to allow this result to be treated as a single element in any subsequent statement. This feature of recursion is probably the single most powerful aspect of online searching, and yet it was the most controversial in our design discussions.
Some of the
group argued for the saving of only a single set for sake of economy. Finally, the system must provide a quick response, and allow the results of any search to be printed in bulk offline.
In addition, the procedures
had to be simple to understand and efficient in execution.
If a system
could be designed to fulfill these objectives, it was felt that we could leap-frog over the existing batch-oriented .systems, and establish ourselves as a predominant force in the then infant field of information retrieval. The first consideration was the design of the language. How could it be designed in a manner that was easy to learn and yet powerful in result? Computer languages of the time were oriented to number manipulation, and were relatively complex and difficult to learn. But why not design "commands" to be used by the non-specialist that would themselves call up detailed computer programs to perform the specified operation?
The
idea of providing a command to define a functional processing step and an operand or data string which would tell the command what and how to process a given set of data seemed to have the power and generality that were required. There needed to be just five basic commands as shown in the third slide. In parallel with design of the commands, the overall file structure had to be considered.
Sequential search techniques as used in batch systems
simply could not perform responsively in an online system. was an inverted file structure, which was selected.
The alternative
An inverted file is
like a concordance or a back-of-the-book index wherein every word is associated with a list of record numbers (accession numbers) of the citations which contain that word.
It is produced by processing the
sequential or linear file to extract accession number/keyword pairs.
-5-
EARLY DIALOG COMMANDS
BEGIN -
TO INITIALIZE THE USER AREA
EXPAND (TERM) -
TO DISPLAY THE ALPHABETICALLY NEAR TERMS WITH POSTING COUNTS TO AN INPUT TERM
SELECT (TERM) -
TO INDICATE TERMS TO BE USED IN SEARCH. SUCH TERMS WERE ASSIGNED SET NUMBERS FOR EASY REFERENCE LATER
COMBINE (SETS) -
TO PROVIDE FOR BOOLEAN COMBINATION OF SETS. THE RESULT OF A COMBINE WAS A SET THAT COULD BE USED IN SUBSEQUENT COMBINES
DISPLAY) TYPE ) (SET/FMT/ITEMS) PRINT )
TO OUTPUT' INDIVIDUAL CITATIONS/ABSTRACTS TO CRT'S, TYPEWRITERS,, OR OFFLINE PRINTOUTS, RESPECTIVELY
END -
ENDED THE*SEARCH AND REQUESTED THE USER TO ENTER AN EVALUATION OF THE SEARCH RESULTS
i
These pairs are then sorted into word order, with the accession numbers for each word being placed in a list or string.
Such an arrangement would
allow a Boolean query for Russia and satellites, for example, to be processed by matching the string of accession numbers associated with <*Russia"with the string associated with 'satellites." If the same accession number appears in both strings, the citation it identifies must contain both words. An OR condition is accomplished by merging the strings; a NOT by merging the strings and removing common accession numbers. The result of any query statement is a set of accession numbers. Aha!
Such an approach would not only allow for rapid searching, but would
also provide the user with a count of the hits prior to any access to the sequential (linear) file. Furthermore, utilizing the random access capability of disk storage devices, we could build an index to the linear file which would allow any item to be immediately called up and displayed. Another thought - why not build an index to the inverted file as well which would allow the indexing vocabulary to be displayed to assist the user in selecting terms.
Finally, if the index would contain the word
count, the user would have an immediate idea of the utility of the word as a search word.
How elegant! I can still remember the morning in January
of 1966 when the whole concept came together and the preliminary design specification for DIALOG was completed. Key System Development Milestones Between 1966 and 1970 there were five key events which provided the project internal viability within Lockheed, and external visibility around the world: •
NASA prototype and development contracts
•
COSATI panel demonstration
•
ESRO and AEC development contracts
•
ERIC services contract
•
ASIS exhibit
With the encouragement of Mel Day, we submitted a proposal to NASA in early 1966 for a prototype online retrieval service which resulted in the award of a $20K contract to install and operate a remote terminal at Ames Research center utilizing DIALOG to access the NASA file of 260K -6-
citations. DIALOG first became operational on the file in November 1966 and the remote terminal, an IBM 2260 display terminal, was installed at NASA in April 1967. The controller for the display, incidentally, was too large to be transported up the staircase to the second story library and had to be installed by knocking out a window casing and raising the controller by crane. The installation cost nearly equalled the cost of the project itself.
The results of this project are reported in Reference 3.
Online retrieval proved to be popular among the scientific staff at Ames. One of the few problems which arose came from a librarian who complained that there was so much demand for searching, she had been forced to forego a committee meeting and several coffee breaks to keep up with "the backlog. The NASA/Ames prototype contract was a key event in that it not only established the viability of the concept with Lockheed management, but it proved that people would voluntarily use ,a terminal to communicate with a computer to retrieve information. We learned that both librarians and engineers could understand the use of Boolean operators (and, or, not) in searching a database. Moreover, this contract gave the project the exposure needed to attract future opportunities.
The NASA/RECON develop-
ment contract first suggested a realistic possibility of information retrieval as a formal line of business at Lockheed. In 1968 COSATI (the Government Interagency Committee on Scientific and Technical Information) invited several online retrieval systems to demonstrate their capabilities on a file of project descriptors, and produced a film of the demonstrations under the auspices of Battelle Memorial Institute. The COSATI demonstration could be likened to an invitational state-of-the-art conference. The conference was attended by Lockheed, Mead, SDC, and Computer Corporation of America.
Only Mead and Lockheed demonstrated online retrieval
systems.
The SDC and CCA systems were forerunners of database management
systems.
In the interest of time I can show only about 5 minutes of this
film. While viewing, remember this was 1968 state-of-the-art.
It is surprising
the extent to which the approach has survived and has reappeared in other subsequently developed systems. Also note the foresight of the narrator at the end of the film in predicting international networks of computers and users. The young innocent you see demonstrating the DIALOG system seems to bear little resemblance to the surly, shifty-eyed characters associated with the business today.
But that perhaps is the difference between Research and -7-
Development and competitive commercial enterprise. There were several classic one-liners that came earlier in the film from the narrator: â&#x20AC;˘
"These systems are simple enough to be operated by almost anyone. Your secretary can probably do it better than you - she can type faster."
â&#x20AC;˘
During the Data Central demonstration, one query resulted in 26 hits. The narrator comments, "This search is obviously too broad, better narrow it down."
As a result of the COSATI demonstration in 1968, Harvey Marron of the U.S. Office of Education (USOE) became convinced that their ERIC file of educational research would be well-served by this new technology.
A series of
contracts initiated services on this database to some six USOE-sponsored terminal sites between 1970 and 1972. These contracts marked the shift in our emphasis from that of systems developer to that of service vendor. The European Space Research Organization and Atomic Energy Commission software installation contracts were our final development contracts, and they reinforced our decision to shift to services in that they diverted significant amounts of human resources from database loading and system enhancements which were needed to support the services environment. Online had in no way yet captured the imagination of the information community, however.
The 1969 ASIS Meeting in San Francisco included a special online
bazaar which was not nearly as interesting to attendees as Doug Englebart's word-processing system (Augumented Human Intellect) which was also demonstrated. It was soon after this meeting that the first database supplier contract was struck for a commercial database called PANDEX.
I remember the negotiation -
Dick Kolin of Crowell,Collier and Macmillan wanted considerable up-front money and a percentage of gross for any use of PANDEX online - probably a result of his role as a New York publisher. We finally settled for royalties of something on the order of $10 per hour and 5<= per offline print. Little would either of us know we were setting an industry contracting standard. Key Milestones in the 1970s In 1970 DXAIOG began to provide a true online retrieval service. In addition to the ERIC centers at Stanford University and in Wash.,, D.C. (ERIC -8-
contained 12,000 citations at the time), we initiated an inhouse service at Lockheed on the AEC, NASA, and PANDEX files. 1970 included several other interesting events: •
First Computer Communications, Inc. (CCI) terminal arrived (heavy, but portable - allowed dialup)
•
ERIC was demonstrated at The White House Conference on Children (55K citations)
•
First transoceanic demonstration of online retrieval held (Paris to Palo Alto searching the Nuclear Science Abstracts database)
By most measures, 1970 was the year - one decade ago- that marked the true beginning of third party online retrieval service (i.e., service to organizations who were neither the supplier of the database nor the operator of the computer system).
Developments followed thick and fast during the
early 1970s. 1971 •
Free-text or proximity searching was added to DIALOG
•
Systems Development Corporation (SDC) survey on the potential of online services was conducted
•
Two additional ERIC sites were added - C.E.C. and RISE
•
Council for Exceptional Children (C.E.C.) database was added to DIALOG
I remember my shock at seeing the SDC survey sent by Carlos Cuadra to some 8,000 organizations to inquire of their interest in online searching which databases, how much would they pay, etc. The "secret" of the potential for online searching was out, it appeared.
This survey served to convince
my management at Lockheed that competition was nipping at our heels and that we must redouble our efforts to maintain our position in the field. Ironically it turned out, unbeknownst to us at the time, that only 80 of the 8,000 questionnaires were ever returned, and that SDC almost decided not to enter the online arena as a result. 1972 Search-save feature was added to DIALOG Dialup service initiated CALSPAN and GE San Jose signed up for service Competitive National Technical Information Service (NTIS) award was received for service to 5 terminals -9-
A project report written by Bob Donati, then and still Manager of our New York Office, summarized 1972 achievements: 1 Jan. 1972
31 Dec. 1972
Number of terminals
6
16
Subscribing organizations
5
13
10
35
100K
900K
Daily search hours Records online 1973 •
Several databases were added: TRANDEX from C.C.M., Psychological Abstracts from the American Psychological Association, AGRICOLA (a ..name adopted much later) from the National Agricultural Library, and Science Abstracts from INSPEC
•
First advertisement was prepared - 15 exposures during 1973
•
First Users' Meeting held in our New York Office
•
Washington Office opened with Rick Caputo
The most significant development in 1973, however, was the inter-connection of DIALOG and the TYMSHARE communications network in 1973. This connection provided a means for potential customers in over 40 U.S. cities to access DIALOG through a local telephone call at a flat charge of $10.00 per hour for telecommunications.
The significance of such a service is that it
ultimately has allowed the concentration of world-wide demand at a single processing center. Such a concentration provides the rationale for the offering of many small and specialized databases which otherwise would not be viable. This TYMSHARE service must be noted as one of the critical occurrences in the evolution of online searching. 1974 saw a continuing buildup of databases with the addition of Predicasts, ISI, IFI/PLENUM, and Chemical Abstracts databases. The final event of importance in 1974 was the award of a grant from the National Science Foundation to study the utility of online searching in the public library.
The significance of this grant was that it provided a
potential avenue for use of online retrieval services by the general public. Let me skip to the present. From the humble beginning on the IBM 360/30 computer, the service has grown until it now is operated by two large-scale IBM computers with access to over 150 disk drives. For those of you who -10-
will not have a chance to visit us in Palo Alto, slides 4 and 5 show what the computer facility looks like today.
The processors are two large-scale
IBM 303X series computers which are connected to 150, 200-megabyte disk drives. Lessons Learned We learned several important lessons during this development period as we became more sophisticated in the ways of computers, telecommunications, and online services. Our first lesson resulted from erroneously assuming that the mere idea of online retrieval was so obvious that the world would beat a path to our door. The better mousetrap notion was soon dispelled.
At an early'project review
meeting, one of Lockheed's senior scientists stated that such a system would never be popular because it was too cumbersome and inefficient.
If he
wanted information he called Dr. X or Professor Y which was much easier than "talking" to a computer.
One of the earliest public announcements
of DIALOG was at a California Library Association meeting held at Stanford in April of 1967. I described, among other things, how online retrieval could bring dramatic changes to the role of the reference librarian. Not only were there few questions, it was clear that the overall reaction was one of ho-hum. We gained technical insight into the mysteries of telecommunications early in the game when we learned that communications outages were not caused by birds roosting on telephone lines - even though this had been suggested by a TYMSHARE employee at that time. We also determined that cosmic ray exposure as a result of shipping tapes by air is not a major cause of tape-read failures during database loads in spite of one programmer's insistance to the contrary.
It turned out that every time we had a read
failure on a database tape, the programmer had checked the manner of shipment and, sure enough, the tape had been shipped by air. What irrefutable logic.
(Of course all the tapes, good and bad, were shipped by air.)
We learned to be very careful of our phraseology in the wording used in online messages.
"DIALOG will be available Thanksgiving day." A new user
called to complain at our inconsideration in requiring her to come to the office on Thanksgiving to use the system. As she further explained, she had been dialing up every day of the past week only to encounter the availability message, after which she hung up. -11-
-11A-
-11B-
Several times "the customer is always right" addage was reconfirmed. One example occurred when Ann Hubbard (subsequently Ann Caputo) received a call on our Customer Services line from a man with a thick German accent asking if our databases had anything on "corporate morality."
As our
Washington Representative, Rick Caputo, sometimes called in jest with a similar accent to make ridiculous demands of the group, Ann was not sure whether this was a bonafide call or not. Although skeptical she went along with the conversation, seeking better definition of the question. Finally she decided she had been duped again and responded, "Knock it off, Rick, you turkey!"
Came back the reply, "Excuse me, but vat iss un .turkey?''
It turned out that it had been a legitimate inquiry. Finally we ourselves became victims of a customer's highly tuned sense of humor. While reviewing the evaluation sheets from a successful User's Meeting, I came across one with nothing but scathing remarks, low ratings and general dissatisfaction with us/ our service, the databases we offered and our policies. While reading I hoped that the writer would have the courage to sign the evaluation so that we could contact him. Finally came the signature - Carlos Cuadra.' It is difficult to find a fitting close for a talk that attempts to cover so much in so little time, but to continue reporting the development of the online service would read a bit like the old testament of the Bible: Gregg Payne begat ABI/INFORM; Engineering Index begat COMPENDEXj CAS begat CA SEARCH; Eric Boehm begat America, History and Life; Leo Chall begat Sociological Abstracts; IAC begat Magazine Index; and these databases nourished the users of online retrieval services who flourished in the world as a result of their enlightenment. Nonetheless, let me encourage you, brethern and sistern, to go forth unto this conference to gain an ever greater understanding, and thence out into the world to spread this good word to those unfortunates who have been unable to attend.
-12-
REFERENCES
(1)
King, Gilbert W., Automation and the Library of Congress, Library of Congress; Washington, D. C , 1963
(2)
"An On-Line Technical Library Reference Retrieval System," American Documentation, January 1966
(3)
Summit, Roger K., Remote Information Retrieval Facility, National Aeronautics and Space Administration (NASA CR-1318), Washington, D. C., April 1969 _
-13-