[EN] "Multilingual Information and Retrieval Systems Technology and Applications" | Dr. Ulrich Kampf

Page 1

Multilingual Information and Retrieval Systems Technology and Applications

Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Multilingual Information and Retrieval Systems Technology and Applications IMC Congress, Brussels 1993 Dr. Ulrich Kampffmeyer · VOI Verband Optische Informationssysteme, Roßdorf / Darmstadt German Association of Manufacturers and Resellers of Digital Optical Media, Systems and Software (Chairman of the Board) · PROJECT CONSULT Unternehmensberatung Dr. Ulrich Kampffmeyer GmbH Wachenheim, Hamburg, Darmstadt

Abstract This paper on multilingual information and retrieval systems with optical mass storage describes the technical principles of software design. The different layers and modules from the user interface via transformation modules, thesaurus modules and fulltext interpretation to database management are explained in detail. Two examples of multilingual document imaging systems are presented: - wfBase

multilingual press and commerce information system base on four ISDN-knots in Switzerland;

- HEMIS

multilingual information system for CD-ROM distribution on environmental institutions, projects and programmes of the UN Environmental Programme UNEP/HEM.

Contents 1. The Importance of Multilingual Software Systems With Optical Storage Media for the European Economic Region 1 2. Software Design 3 2.1 Structural and Other Requirements for Multilingual Software ................................................................................................................. 3 2.2 User Interface and Application ................................................................................................................. 6 2.3 Transformation Modules ................................................................................................................. 9 2.4 Selection Lists ................................................................................................................. 11

Page

IMC Congress, Brussels © Copyright PROJECT CONSULT GmbH 1993

Page 1 of 28


Multilingual Information and Retrieval Systems Technology and Samples

Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

2.5 Thesauri ................................................................................................................. 12 2.6 Fulltext Translation ................................................................................................................. 16 3. Sample Applications 19 3.1 wfBase 20 3.2 HEMIS 23 4. Outlook and Summary 27 1.

The Importance of Multilingual Software Systems With Optical Storage Media for the European Economic Region

Europe 1993 is a catch-phrase that is often heard. But opening the borders and removing trade barriers will not eliminate the cultural and language differences between countries. These differences are a concern for all firms and organizations that operate in more than one country. Overcoming the language barrier is not simply a matter of lexical comprehension and translation. It involves many levels of differing interpretations, meanings in various contexts, and adaptation of specialized vocabulary. In business and commerce, mere translation is not enough; the unwritten laws of the target specialist language must be adhered to. In addition, the organization working across national boundaries must take into account differing units of measure, currency, and conventions (date formats, addresses, orthography). Multilingual software is a requirement wherever users require access to the same information regardless of the nature of the source. This is particularly the case for: - Trading firms - Service firms - International authorities and institutions - Manufacturers with suppliers and subcontractors in more than one country - Communications firms - Banks - Insurance companies - Authorities and bodies such as police, air-traffic control, disaster relief organizations, environmental monitoring agencies, etc. - Others IMC Congress, Brussels

Page 2 of 28

Š Copyright PROJECT CONSULT GmbH 1993


Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Multilingual Informations and Retrieval Systems Technology and Samples

English is often used as a de facto communications standard. However, the use of a language that is foreign to its speakers can lead to misinterpretations and misunderstandings when the user is not familiar with the exact meaning, interrelationships, and contextual significance of terms and phrases. A "working knowledge" of a language is not enough. As software and the information underlying it become more complex, the support provided by the software must become friendlier and more comprehensive. This is especially true for the user interface, information on current actions, status messages, context-sensitive information (especially with user mistakes or critical program branchings) and help screens. The latter must be available in index form as well as context-sensitive. Modern "Windows"-oriented programs generally includes these features. However, like most programs the information they contain is available in only one user language. Most standard software today comes from a few leading software houses in the United States. Consequently the software and documentation is available in English, or American English, first. The various language versions are then translated from the original English version. The translations become available with more or less delay in various release standards, depending on the relative importance of the national market. In such standard software, the screens, associated texts, etc. are contained in the main body of the program, making translation with adaptation of the screens and texts a very complex undertaking. Even when different users access identical information, they cannot change the user language while the program is running. Instead, the complete target language version must be started, at considerable cost in time. In addition, most standard software lacks the integrated database or resource management components that enable the administration of different language and function modules, not to speak of the creation and maintenance of such modules. Thus, "traditional" standard software has a built-in a bias against multilingual use. This article will examine database information systems that are suitable for multilingual applications.

2. Software Design Like the ability to load modular program segments and functions separately, multilingualism must be designed in from the start. It is well-nigh impossible to modify finished software to support multilingual operation. In such cases, it makes more sense to completely redesign the software using modern tools. 2.1 Structural and Other Requirements for Multilingual Software Multilingual software is subject to the following design criteria (Fig. 1): a) Modular design with clear logical and software partitioning of the various levels (user interface, main program, resources, transformation modules, database, etc.). Interaction is controlled through messages and global variables.

IMC Congress, Brussels Š Copyright PROJECT CONSULT GmbH 1993

Page 3 of 28


Multilingual Information and Retrieval Systems Technology and Samples

Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

b) No text components may be contained in the program segments responsible for execution, but must be referenced by variables. The user can switch from one language to another using a global variable. c) Texts are kept in resource libraries and accessed by variables. The libraries must be simple to maintain and the texts must be accessible and loadable in the application during runtime. d) All parts of an application must have defined interfaces. This is particularly the case for the user interface, the actual application itself, the operating system and all additional application modules. e) The application, user interface and operating system must support variable text field lengths and positions, since these can vary greatly from language to language. f) The application, operating system, screen and printer drivers, and the database must support a variety of fonts, character sets, sortings, data formats, etc. This requires that the underlying operating system support this.

Multilingual Software - Design Principles Modular design with clear separation of user interface, operating system and application (database) Every text component has to be referenced by a key variable in the application Resource libraries easy to link and to maintain (i.E. text editor) Defined interfaces between the user interfaces, operating system and application modules Variable textfield positions and field-lengths in the user interface modules of the application Support of different sets of fonts, language specific characters, keyboard layouts, date formats etc. by the underlying operating system Fig 1:

Multilingual Software Design Criteria

Such a multilingual application is thus divided into several inter-communicating modules and levels (Fig. 2). The actual application program, which can be part of the database, uses messages and global variables to control the language selection, display and printing, and the search and conversion functions.

IMC Congress, Brussels

Page 4 of 28

Š Copyright PROJECT CONSULT GmbH 1993


Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Multilingual Informations and Retrieval Systems Technology and Samples

Principles of Language Display during Runtime Transformation Modules

Screen

Language Resources

Text

Field

Thesauri

German

Please select ...

English

Selection Lists

French LX

Spanish LX

Screen

Application

Resource

Database

LX Data

The language selector (Lx) in the application defines which resource is used for display and how the information in the datafield is represented Fig. 2: Language display during runtime of a multilingual program

The variable "Lx" ("Language Resource") determines which texts will be displayed, and which transformation modules and selection lists will be used to control an entry or search in a selected language. The information in the database itself is not changed, but only the screen display and printout. Figure 3 shows the levels of a multilingual application.

Layers and Modules of Multi-Lingual Software User interface

User interface (Application)

(Windows, Presentation Manager, X-Windows, etc.) Language Resources

Transformation Modules

Selection Lists

Application

IRS Information Resources Management

Thesauri

Language Interpreter

Operating System

Database

1

2

3

Driver

4

Fig. 3: Multilingual software levels (1-4) and modules

Level one essentially handles the presentation of the information, Level 2 converts the information from one language to another, Level 3 manages the access information and handles searches, and Level 4 manages the "documents" (datasets, images, graphics, etc.) on optical storage media. This article will not go into Level 4, the "IRS" Information Resources Management Program, in greater detail (compare IMC Congress, Brussels Š Copyright PROJECT CONSULT GmbH 1993

Page 5 of 28


Multilingual Information and Retrieval Systems Technology and Samples

Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Kampffmeyer, Ulrich: "Combined WORM and Magneto-optical Mass Storage Devices and Procedure-Oriented Information Processing Systems", GI Gesellschaft für Informatik, Arbeitskreis "Datenbanken", Conference at the University of Oldenburg, Germany, on Feb. 19, 1990). Levels 1-3 and their components will be explained below. 2.2 User Interface and Application The user interface depends to a large extent on the underlying operating system. Many operating systems are not up to the demands of multilingual software, since they do not allow for reconfiguration during runtime and do not support international character sets and formats. Operating systems with graphic user interfaces, like Microsoft Windows and OS/2 Presentation Manager, and operating systems based on XWindows (OSF Motif, OpenLook, etc.) are suitable. These systems allow control of the screen largely independent of the actual operating system itself. There is a fundamental difference between a) The standard Windows interface, and b) The application-specific interface implemented on the basis of this interface. The application-specific interface uses the tools provided by the standard interface to represent the functions of the application. A graphic user interface has numerous advantages: A lower learning curve, integrated help functions, and simple operation by mouse, menus, or key combinations. Another advantage of Windows interfaces is the unrestricted usersizing of windows and other displays. It would be impracticable to give all displays of a multilingual application their own user interface, since this would severely limit the number of compatible screen and printer drivers. The application's user interface should use standard Windows interface routines wherever possible. The user interface (Windows as well as application) of a multilingual application should include the following (see Fig. 4 and 5): a) Change of key assignments for differing language keyboard layouts during runtime by the application b) Change of screen display during runtime by the application c) Display of language-specific character sets (e.g. German: ä, ö, ü, ß; French: é, è, ê, ç; Spanish: Í, ñ, ¿, ¡; Danish: å, æ; Hungarian: ÿ, ý, ï; Greek: a, b, c, etc.) d) Change of formats, as for date, currency, time, etc. during runtime e) Automatic adaptation of screens and fields to differing text lengths, special symbols, fonts, etc., under the given monitor resolution f) Language-specific context-sensitive help based on the cursor position, current program status and the feasible or just completed action. g) Modules loadable during runtime without leaving the program

IMC Congress, Brussels

Page 6 of 28

© Copyright PROJECT CONSULT GmbH 1993


Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Multilingual Informations and Retrieval Systems Technology and Samples

Operating System and User Interface Requirements for European Software The operating system and Window Interface must support several features to enable switching the language during runtime Change of keyboard setting and screen display during runtime via external program Enhanced keyboard setting with special characters: European languages ( ç, ê, æ, å, ø ,ä , etc. ). Support and change during runtime of date and time formates Graphic Interface with virtual Window architecture to allow different sizes of screens and fields while changing the language Context-sensitive help in relation to the actual position of the cursor Fig. 4: Operating system and user interface

User Interface (application) Requirements The user interface has to support several functions to enable change of language during runtime Object oriented software Change of screens, settings and styles during runtime Dynamic positioning of fields Automatic adaption of different field lengths Controllable by the application program Loadable modules during runtime for messages, windows and helptexts Dynamic data and message interchange with operation system and user interface, application program and database Fig. 5: User interface

The most important feature of a multilingual application is convertibility during runtime, without having to load and start another program and without changing the screen and screen information content (Fig. 4).

IMC Congress, Brussels © Copyright PROJECT CONSULT GmbH 1993

Page 7 of 28


Multilingual Information and Retrieval Systems Technology and Samples

Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

The text components are kept in separate files, called "resource libraries" (Fig. 2). The resource libraries can be loaded on the fly by language selection variable L x. For language resources to be usable, all texts in a program that are going to be displayed or printed must be referenced by an unambiguous key variable with the appropriate library. Resource libraries must exist for: a) All static texts in dialogue boxes and masks. These are texts which are associated with a given dialogue box and do not change. b) Dynamic texts in dialogue boxes and masks. These are texts which change, appear, or disappear according to status (messages). This includes the graying out of inoperative or unavailable functions on menus and buttons. c) Help texts which appear automatically or interactively. d) Error messages, system messages and other operation-related messages.

Language Resources Requirements Language resources are used for displaying texts related to the unique keys in the application Loadable modules for each language Every entry in the language resource is referenced by a unique key which may be used by different applications and the database itself Language resources are needed for Every text on a entry or search screen form Every message Every helptext Icons adapted for each country Editor or tools for translation support Fig. 6: Language resources

Many applications use icons and buttons to simplify option selection. If they bear text or abbreviations (such as "B" for bold), these must be converted when the language is changed (thus, in German "F" for "Fett" = bold). For this reason these icons and buttons should likewise be kept in dedicated resource libraries instead of being managed directly in the program. The same applies to icons with graphics, where the graphics do not bring across the same meanings in a different language area or country. IMC Congress, Brussels

Page 8 of 28

Š Copyright PROJECT CONSULT GmbH 1993


Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Multilingual Informations and Retrieval Systems Technology and Samples

Object-oriented programming languages and databases often support the use of loadable resources, making them preferable to traditional programming tools. The right choice of tools is important for the creation of applications based on a programming language or database. The application is the superposed, integrative component of the system as a whole (compare Figs. 2 and 7). The application contains not only the usual data-processing algorithms and input/output modules, but also the control and selection of language resources (transformation modules, selection lists, thesauri, help texts, messages, screen layout and display, etc.).

Application Characteristics Numeric keys for every text entry related to the screen display and database fields Direct control of database and user interface Object oriented message driven program Transformatters, selection lists, thesauri, language interpretors and language resources as loadable modules Database as loadable module or server-client-communication via SQL Fig. 7: Components of the application

Object-oriented programs with a "message" concept, such as Microsoft Windows, allow continuous control of the resources used and the condition of the screen. Direct communication should be set up for control of the modules on level 2 (Fig. 3). SQL can be used as a standardized interface for communication with the database in which the actual information is kept and managed. All modules on levels 2, 3, and 4 (Fig. 3) should be directly accessible or loadable during runtime. 2.3 Transformation Modules The numerical information in the database is stored in a format that can be converted as needed for a given onloaded language resource. This conversion is controlled by the variable "L x" (Fig. 2). Transformation modules are considerably easier to implement than text translators, since they work by exact rules and with numeric values only (Fig. 8).

IMC Congress, Brussels Š Copyright PROJECT CONSULT GmbH 1993

Page 9 of 28


Multilingual Information and Retrieval Systems Technology and Samples

Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Transformation Modules Types Transformation modules are used for the display transfomation of numeric values of the database Transformation of date formates

(supported by operating system)

Transformation of time formates

(supported by operating system)

Transformation of addresses

(position of postal codes, etc. )

Transformation of units of measure

(litre to gallon, km to mile, etc. )

Transformation of international standardized nomenclature

(country and city names, etc. )

Transformation of user-defined values

(see selection lists, etc. )

Fig. 8: Transformation modules

The most important standard transformation modules are: a) Date formats This module toggles the display format of dates between American (month-dayyear) and European (day-month-year). This function is often supported by the operating system directly, and allows use of either the months' full names or their abbreviations. The transformation module should be designed to cope with the conversion of pre-2000 dates into the next century. This is important for all data which must be retained for several years. The date transformer module must also ensure the proper sorting during display. b) Time formats The same applies to time-display formats. For firms active on an international scale, data is best stored in "Coordinated Universal Time" format (UTC). Date and time transformation modules can be set up to check whether the system's internal time setting is correct (the current date and time must always be later than that of the last document to be saved; calibration with standard working hours and days, etc., in order to be able to determine system down time if necessary). c) Address conversion Address-format conversion affects printouts more than it does on-screen displays. Addresses in Europe are not standardized, and use a variety of sequences of street, house number, and postal code. This transformer module recognizes the country of the addressee and selects the appropriate address format for printouts.

IMC Congress, Brussels

Page 10 of 28

Š Copyright PROJECT CONSULT GmbH 1993


Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Multilingual Informations and Retrieval Systems Technology and Samples

d) Units of measure This is an important requirement for international trade and manufacturing companies. For example, in the oil industry large quantities of different types of oil and petroleum products are transported and handled daily. Measurement values and with them customs and tax rates constantly fluctuate, depending on the type of product and its specific weight and even on the ambient temperature. In cross-border trade the units of measure as well as of currency must be automatically converted. The most important categories are units of currency, distance, weight, and volume. e) EDI data Standardised electronic data interchange (EDI), such as EDIFACT, allows entire business transactions to be handled electronically, without paper originals. The data is archived digitally. For display and printouts, EDI codes are converted into text. This conversion can be made language-specific through a language control variable. With EDI data it is necessary to know what version of a given EDI application the data will be converted with. Further transformation modules can be added to cover other requirements for specific industries and applications, for example converting product codes into text. 2.4 Selection Lists Graphic interfaces like Microsoft Windows support single and multiple selection lists (Fig. 9). With single selection lists only one item on a given list can be marked and processed. With multiple selection lists, one or more items can be selected.

Selection Lists Characteristics Selection lists are an easy way to translate information and to spare storage capacity The list displays a text on the screen related to a database value Every entry in a selection list refers to a value which is related to a database field Every entry in the different language versions of a list refers to the same value The database has to store only the numeric value of the entry Selection lists can be used as single and multiple-choice lists Selection lists help to standardize nomenclature in multinational and multilingual organizations Fig. 9: Selection Lists

IMC Congress, Brussels Š Copyright PROJECT CONSULT GmbH 1993

Page 11 of 28


Multilingual Information and Retrieval Systems Technology and Samples

Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Selection lists offer several advantages over regular text-entry fields in database applications: a) No typing mistakes b) Selection lists keep the database uniform and ensure that entries can be easily found again. Since the user must decide from among a set of given expressions, entries are standardized. c) The database stores only a reference number which refers to a text resource. This keeps space requirements low, and different text resources can be accessed depending on the language variable. Retrieval is faster, since the system must search only through predefined numbers instead of text sequences. d) Multiple selection lists facilitate the multiple allocation of a document and allow the user to select a number of related items if he/she is unsure about the allocation to a single one. c) above is the most important factor for multilingual applications. The use of reference numbers allows linkage to multiple lists in different languages. The reference numbers can also be used to limit access, so that only cleared items are shown in a search. Selection lists also facilitate data entry through the use of presettings for recurring entries. Selection lists can be created with standard text editors. However, this should be done only by authorised persons, since changes to and especially deletions of entries characteristics (entries in a selection list) can compromise the consistency of the database. Strict update and maintenance rules are a must for distributed systems and resources. Selection lists with restricted vocabulary are the ideal medium for standardising terminology within a company and for creating multilingual software systems. Multilingual systems should avoid free text entry wherever possible and use selection lists whenever feasible. 2.5 Thesauri This term has widely differing meanings. In its original meaning it refers to a defined specialist terminology, broken down hierarchically from the general down to the precise. The terms differ clearly from one another and are distributed over several hierarchical levels. A generic term at one level branches into a number of more precise terms on the level below it. All terms at a given level should be at a similar level of detail. However, in many word-processing programs the "thesaurus" is simply a utility showing possible synonyms. This familiar kind of thesaurus is completely unrelated to the structured terminology system described above, as for example defined by the International Standards Organization (ISO) for single- and multilanguage thesauri.

IMC Congress, Brussels

Page 12 of 28

Š Copyright PROJECT CONSULT GmbH 1993


Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Multilingual Informations and Retrieval Systems Technology and Samples

Thesauri Thesauri offer a hierachical structured and crosslinked nomenclature One field on the screen may be represented by a structured hierarchical thesaurus Similar to a selection list, the thesaurus displays a text related to a database value related with this text The thesaurus offers navigation and interpretation tools The Thesaurus is a database of itself which relates numeric values to texts and provides additional structure by hierarchic order and crosslinks The structure of thesauri is standardized by ISO The same thesaurus may be used by different applications Fig. 10:

Thesauri

Seen from the outside, the thesauri we are discussing here for multilingual systems act similar to selection lists (Fig. 10, compare also Fig. 9). First a list of generic terms is displayed (ISO Top Term; TT). Once a top term has been selected, the more precise terms subordinated to it are shown on a second list (ISO Narrower Term; NT). When one of these is selected it forms the new generic term (ISO Broader Term; BT) for the next level of narrower terms (Fig, 11). This strict hierarchy is fully applicable to only a few subject areas. Therefore, the ISO standard provides for crosslinks. These link terms from different levels and branchings independently of their position in the hierarchy. This is easier to follow on a program than it is to describe in print. An electronic thesaurus is referenced by numbers in the program just as is a selection list (which see). However, unlike a selection-list entry, a thesaurus entry includes not only a "unique identifier" number in the database, but also flags which specify its display position (level and branching in the hierarchy) and the type (and if necessary direction) of links. The links allow a term to be associated with more than one top or broader terms in other branchings, as well as the linkage of a broader term to several narrower terms in other branchings, regardless of the position in the hierarchy. The use of different links (uni-directional, bidirectional, broad-to-narrow, narrow-to-broad, additional reference, synonym, etc.) make it easier to navigate in such a system. In principle the electronic thesaurus is an entire database application, which stands between the user interface and the database proper. The database proper stores only the unique identifier. If this is referenced with a "narrower term", using its links and hierarchical position all associated broader terms up to the top term can be found.

IMC Congress, Brussels Š Copyright PROJECT CONSULT GmbH 1993

Page 13 of 28


Multilingual Information and Retrieval Systems Technology and Samples

Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Thesauri Hierarchy and Crosslinks The Hierarchical View of the Thesaurus

Unique identifier Position in hierarchical view

(Top Term, Broader Term, Narrower Term) 1

1000

2

3

1100

1200

4

1110

5

1120

6

1210

7

1220

4

1110

The Network Structure of the Thesaurus (Crosslinks independent of the hierarchical position) 1

1000

2

1100

8 3

Fig. 11:

1200

6

1210

7

1220

1120

Hierarchy and virtual linkages (crosslinks)

An electronic thesaurus is represented internally as a network (relational system), but to the outside as a hierarchy. Thus, the composition of a list of terms depends not only on the broader term, but also on the links and the route taken to get to the broader term. Unlike with a selection list, the lists displayed by an electronic thesaurus can differ from situation to situation. In addition to assisting in navigating by displaying the selection lists specific to a broader term selected previously, a database-supported thesaurus can also be used in "specialist" or "beginner" mode. When entering information, a specialist mode is best which allows entry of a narrower term or an abbreviation directly, with the system determining the associated broader terms without having to go through the hierarchy. However, users who are inexperienced with hierarchical selection lists or with the subject content of the thesaurus are better off doing their searches in beginner mode, whereby the system analyses users' text input, looks for a match in the thesaurus, and if in doubt shows a synonym list and help text suggesting a repeat attempt or a more closely defined query. Such a "global search" can also be done by further fields or other resources of the thesaurus. Fig. 12 shows how a number of "slices" are assigned to the reference keys of the thesaurus database. Each of the language slices contains all information on the hierarchical and network structure of the terms, since this will differ from language to language (narrower or broader terms, different semantic fields). However, regardless of the differences among the languages the same information must be clearly accessible in all. Therefore, the unique identifier is assigned not only the term itself main keyword), but also acronyms (e.g. "NASA"), homonyms (words that sound the IMC Congress, Brussels

Page 14 of 28

Š Copyright PROJECT CONSULT GmbH 1993


Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Multilingual Informations and Retrieval Systems Technology and Samples

same but have different meanings), synonyms, plural forms, explanatory notes, etc. This information is also accessed during a global search. The "language slices" need not necessarily contain foreign languages; they can also contain different aspects of a single language. This is particularly useful for specialist languages. Thus, one slice can contain the regular colloquial language, with only two or three levels and accessible to everyone, while another slice can contain the terminology for a specialist field broken down into more levels and accessible only to those working in that field. This allows control of the extent, depth, and accessibility of information.

Thesauri "Slice"- Model of a Multilingual Thesaurus German Language "Slice"

Unique ID A1 Unique ID A2 Unique ID A3

... Unique ID An

Fig. 12:

ID´s of preID´s of sucposition in main key synonyms, hohelp text decessors cessors hierarchy wordn monyms etc. ID´s of preID´s of sucposition in main key synonyms, hohelp text decessors cessors hierarchy wordn monyms etc. ID´s French Language "Slice" dec ID´s of preID´s of sucposition in main key synonyms, hohelp text decessors cessors hierarchy wordn monyms etc. ID´s ID´s of preID´s of sucposition in main key synonyms, hohelp text dec decessors cessors hierarchy wordn monyms etc. ID´sEnglish Language "Slice" dec ID´s of preID´s of sucposition in main key synonyms, hohelp text decessors cessors hierarchy wordn monyms etc. ID´s ID´s of preID´s of sucposition in main key synonyms, hohelp text dec decessors cessors hierarchy wordn monyms etc. ID´s of preID´s of sucposition in main key synonyms, hohelp text decessors cessors hierarchy wordn monyms etc.

...

...

ID´s of predecessors

ID´s of successors

position in hierarchy

main key wordn

synonyms, homonyms etc.

help text

Slice structure of a multilingual thesaurus

In addition to the modular slice structure, an electronic thesaurus database offers many advantages: a) A standardized, controlled vocabulary ensures unambiguous and complete retrieval of all correctly entered information. b) Entry errors are prevented. c) Selection lists and help functions assist the user in finding his or her way through extensive, many-layered specialist vocabularies. d) Functions like "global search" enable searches to include synonyms, homonyms, acronyms and other references as well as the help text itself. e) The organization and structure of thesauri are internationally standardized. f) A thesaurus database acts as a pre-processor, saving time in searches in the database proper, since only short, unambiguous numerical references need be IMC Congress, Brussels © Copyright PROJECT CONSULT GmbH 1993

Page 15 of 28


Multilingual Information and Retrieval Systems Technology and Samples

Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

searched and evaluated. The thesaurus then converts the unique identifiers for display. g) Thesaurus databases can be run on a PC LAN, thus reducing the workload on the central database and information resources management (IRS; see below and Fig. 3). If the system includes optical-systems management software in addition to the thesaurus database and the database proper, it has a three-level database hierarchy (compare Figs. 3 and 26): a) Database for one or more thesauri (local or central) b) Database for managing unique identifiers to selection lists and thesauri and for managing database entries (numerical, alphanumeric, date, time, Boolean variables, etc.) c) Information retrieval and access system (RIAS). As a rule a non-standard database for managing WORM (write-once) media, erasable, rewritable, and M/O optical media, or read-only media (CD-ROM). A standard database (preferably relational) can be used for the thesaurus database as well as for the database proper. Full-text databases are not suitable for this type of application (Fig. 13).

Database Characteristics Standard relational database may be used to manage data (except for language interpretation) Support of optical disk information retrieval system for mass data management Standard fulltext database are not usable Fig. 13:

Database characteristics

2.6 Fulltext Translation The electronic interpretation and translation of running text requires very different strategies from those described up until now. Transformation modules, selection lists and thesauri can be combined in a system as desired, since they all work by the same rules: Numerical identifiers are transformed into predefined expressions in defined ways. A system capable of analysing running text is difficult to combine with these modules. It is an independent and complex software system made up of many component parts (Figs. 14 and 15).

IMC Congress, Brussels

Page 16 of 28

Š Copyright PROJECT CONSULT GmbH 1993


Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Multilingual Informations and Retrieval Systems Technology and Samples

Language Interpreter Characteristics The language interpreter contains different modules which allow translation and interpretation of fulltext databases. Dictionaries provide information for the direct translation of nouns (singular, plural, conjunctions, etc.) Statistical modules support the interpretation of the noun inside a text Linguistic modules support the interpretation of the grammatical context Comparision modules combine the different strategies of interpretation Presentation modules display the answer of a query in the chosen language as translated fulltext Inverted file and cache modules optimize access Fig. 14:

Components of a language translation system

Language Interpreter Structure User Interface Entry

Query Display

Dictionaire Modules

Statistic Modules

Linguistic Modules

Presentation Modules

Comparision

Inverted File

Language Interpreter

Database

Fig. 15:

Structure of a language translation system

IMC Congress, Brussels Š Copyright PROJECT CONSULT GmbH 1993

Page 17 of 28


Multilingual Information and Retrieval Systems Technology and Samples

Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

a) Dictionaries contain the individual words in their different forms (plural, singular, declined, conjugated, irregular verb forms, etc.). As a rule the dictionary will constitute a database application of its own. However, it is completely different in structure, makeup and content from the thesaurus discussed above. b) Statistics modules analyse the occurrence and composition of words and combinations of words. c) Linguistic and grammatical-analysis modules are the most difficult part. They must contain all the rules and comparative examples required to analyse syntax. Pattern recognition and fuzzy logic techniques are often used for this purpose. d) The results of a), b) and c) above are combined, evaluated and interpreted in a comparison module. The comparison module is designed so that intermediate results of one module can be returned to another module for evaluation. This gives rise to an iterative process with a relatively high rate of recognition in texts on specific subjects for which there are electronic dictionaries containing the subject terminology. e) Due to their architecture, traditional databases are not very effective at timeconsuming text analysis. To speed things up, special cache and inverted file modules are often used as intermediaries. f) Presentation modules handle the correct on-screen presentation of the translated text. They work with information from the dictionary module, the evaluated text from the database, and the inverted file system. The running text interpretation system we have described can be used to evaluate queries in regular text. Fig. 15 shows the processing path for a query. The system goes through the modules from bottom to top in the same way to convert a text out of the database. The system shown here is just one possible configuration. Since this technology is very new, many other approaches are being investigated. This particular approach has the advantage that different modules with differing evaluation strategies can be consulted simultaneously. Furthermore, each module can be dedicated to certain languages or vocabularies, and accessed automatically by the comparison module as needed. The interpretation and translation of a text is very time-consuming, and usually possible only on very fast dialogue computers. Complex systems such as the one described should not be confused with simple translation aids. Traditional full-text databases are seldom suitable for such systems. Standard database software uses a strategy of leaving out filler words, adjectives, adverbs, etc. in order to save memory space and increase database speed. However, a language interpretation system needs all of the information contained in the text, since otherwise coherent, context-adequate translation is not possible. "Language Interpreter" database systems have enjoyed initial successes with the UNO and the European Commission. The choice of a system for multilingual database applications is still simple at this point: a) For document-oriented (facsimile) systems, applications with controlled vocabularies, and systems intended to bring about a standardization of use, the transformation, selection list and thesaurus approach is the right choice. IMC Congress, Brussels

Page 18 of 28

Š Copyright PROJECT CONSULT GmbH 1993


Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Multilingual Informations and Retrieval Systems Technology and Samples

b) For full-text applications which will not go into full use within the next three to four years, the approach described in this section should be attempted or at least examined. At present there is no commercial software immediately available for either application, nor are off-the-shelf solutions likely to become available in the future, since the nature of the application and the vocabulary will be subject to constant change. However, in my opinion an approach as shown in Fig. 3 is ideal. It combines the different transformation and interpretation components in one level where they work in parallel. They link the user interface with the database proper. This integrative approach combines the advantages of all of the techniques named, which can then be used individually or in combination as needed.

3. Sample Applications We will now look at multilingual information and retrieval systems from the user's point of view, using three examples.

Application Examples HYPARCHIV

Standard optical filing software for Microsoft Windows in 9 languages

wf Base

Distributed press and commercial information system in 4 languages based on ISDN-Knots (wf, Switzerland)

HEMIS

Meta-database and information system for environmental data; Informations, programmes, methods, etc. for CD-ROM-distribution (UNEP/HEM, worldwide)

Fig. 16: Application examples

a) wfBase

Press and economics information in a distributed documentimaging system

b) HEMIS

Environmental information on CD-ROM

3.1 wfBase

IMC Congress, Brussels Š Copyright PROJECT CONSULT GmbH 1993

Page 19 of 28


Multilingual Information and Retrieval Systems Technology and Samples

Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

wfBase was developed specially for the Swiss Institute for Commercial Development (German "Wirtschaftsförderung", hence "wf"). It has been in operational use since 1992. The Swiss Institute for Commercial Development is located in Zürich, with offices in Geneva, Bern and Lugano. Prior to the introduction of wfBase, dossiers on political events, economic data, and the like were kept independently at all four locations. The goal of wfBase is to enable access by all Institute users to all press articles, periodicals, and Institute documents, independent of the language of data entry (Figs. 17 and 18).

wfBase

wf Schweitzer Wirtschaftsförderung Swiss Institute for Commercial Development Zürich - Geneva - Bern - Lugano

The wf owns one of the largest archives on commercial and political topics in Switzerland. It provides information to politicians, journalists and its commercial members representing all major companies of Switzerland. Optical filing system for press and commercial documents (scanned and created via word processor, sreadsheet, etc. ) Distributed system linked via SwissNet 2 (ISDN) Access for wf-employees and third-party partners via multilingual graphic user interface (ISDN and telephone modem) Database with 4-lingual thesaurus Access to information independent of the language in which it was entered Several million documents stored on M/O-Jukeboxes (2 times 50 gigabyte) Integrated bureau communication with textprocessing, spreadsheet, FAX, library management, electronic mail, accounting, address database, etc. Fig. 17:

wfBase - Features

IMC Congress, Brussels

Page 20 of 28

© Copyright PROJECT CONSULT GmbH 1993


Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Multilingual Informations and Retrieval Systems Technology and Samples

wfBase Storage and Communications Layout wf-User

Harddisk Cache

Zürich Images, Files & Descriptors Read / Write / Create

wf-User

Harddisk Cache

Lugano Images, Files & Descriptors Read / Create

wf-User

Harddisk Cache

Geneva Images, Files & Descriptors Read / Create

wf-User

Harddisk Cache

Bern Images, Files & Descriptors Read / Create

Fig. 18:

Jukebox External Use Novell Netware

Jukebox Internal Use Addresses Library Dossiers

DB Server ISDN

Zürich

SwissNet 2

Archive - Server

Zürich ISDN SwissNet 2

Communications Server

Zürich ISDN SwissNet 2

ISDN

&

Telephone Modem

External User Harddisk Cache

wfBase - System configuration with internal and external users and information management in two jukeboxes (Zürich)

IMC Congress, Brussels © Copyright PROJECT CONSULT GmbH 1993

Page 21 of 28


Multilingual Information and Retrieval Systems Technology and Samples

Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

wfBase also integrates other applications besides document management under its graphical user interface, such as word processing and spreadsheet applications, address and library management, billing for outside users, electronic faxing and mailboxes, etc. The wfBase system makes use of some HYPARCHIV modules, but is otherwise an independent application with client-server architecture and a relational database on an OS/2 server. The MS Windows workplaces are linked together in a Novell network. Outside users can access wfBase by modem, query documents ("subsets"), and display and print them locally or have wfBase fax the documents to them. The four wfBase locations are linked by SwissNet2 (ISDN). This powerful network allows compressed scanned facsimile transmission. Two jukeboxes store scanned facsimiles, locally-generated data, and incoming faxes. The system is highly errortolerant and largely fail-safe. At the heart of wfBase is the database with a quadrilingual (German, French, Italian, English) thesaurus for subject-area classification. The thesaurus includes over 2000 subject areas, organized hierarchically and in linked structure over four levels.

wfBase Multilingual Thesaurus The two images show different views of the thesaurus for thematic keywords (here in German). The thesaurus supports the user in navigation, jump-functions, short-key-entries, synonym-retrieval and other techniques for easy-to-use access.

Fig. 19:

Screen II aus Vortrag Online ´92

Screen I aus Vortrag Online ´92

Thesaurus-Maske Sachgebiet

Thesaurus-Maske Sachgebiet

für Vortrag auf Folie einkleben

für Vortrag auf Folie einkleben

wfBase multilingual thesaurus, showing two windows of the thesaurus screen. The left shows the branching from a broad term to a list of narrower terms. The thesaurus contains the subject areas covered in the dossiers.

IMC Congress, Brussels

Page 22 of 28

© Copyright PROJECT CONSULT GmbH 1993


Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Multilingual Informations and Retrieval Systems Technology and Samples

In addition to the thesaurus, there are selection lists for other fields and fields for text and data entry. The database enables the user to locate documents regardless of the language in which they were entered. However, the system displays documents only in their language of origin; in a multilingual country like Switzerland it is not necessary to translate the contents of documents, as users are expected to be multilingual as a matter of course. Instead, the objective of wfBase is to improve communication between office locations, standardize addresses and documentation, eliminate redundancies, and provide third parties (members of the wf's supporting organizations) with a simple, time-saving and cost-effective means of access. 3.2 HEMIS Within the United Nations Environmental Programme, or UNEP, there is an organization called UNEP/HEM (Harmonization of Environmental Measurement) which is responsible for the harmonisation of environmental monitoring methods, plans, projects and information. Since 1990 a project has been underway at the Munich UNEP/HEM office to immplement an information and meta-database system for the UNEP/HEM, called HEMIS (= HEM Information System). HEMIS is intended to provide an overview of: a) Current global and national environmental projects by the UN and other international and world organizations b) Institutions, research emphases, periodicals, and key personnel c) Methodology, reference materials, etc. d) Databases, data formats, data quality, access, etc. The information contained in HEMIS is meta-data compiled from widely varying sources (Figs. 17 and 24).

HEMIS

UNITED NATIONS ENVIRONMENTAL PROGRAMME HARMONIZATION OF ENVIRONMENTAL MEASUREMENT UNEP / HEM, Nairobi / Munich

The UNEP / HEM Office harmonizes nomenclature, measurements and other information used worldwide in environmental projects. This task will be supported in the future by the HEMIS meta-database and information system, a multilingual CD-ROM using PC-system. Multilingual thesauri for scientific nomenclature, countries, climates, etc. with references, links, synonyms, homonyms, acronyms and wildcard-functionalitity Harmonization of nomenclature by standardized access to Information Hyperlinks, guided tours, global search facilities together with the thesaurus enable easy access to the Information independent of the language of entry CD-ROM based worldwide distribution Fig. 20:

HEMIS - Information and meta-database system of the UN environmental organization UNEP/HEM

IMC Congress, Brussels Š Copyright PROJECT CONSULT GmbH 1993

Page 23 of 28


Multilingual Information and Retrieval Systems Technology and Samples

Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

The goal is to harmonize access to heterogeneous information of varying quality and extent from varying sources. HEMIS is made up of two component systems: a) One system will be installed in Munich with which all information can be collected, processed, the contents made readily accessible, and managed. The system is intended to be able to create reports (printouts) selected from its database and to create CD-ROM databases. b) The other will handle worldwide distribution of extracts from HEMIS in Munich by CD-ROM in regularly updated editions. The two component systems will have differing user interfaces, databases, etc. System a) is a production system that will generally be used only by UNEP/HEM employees. System b) is designed to provide information internationally on environmental projects, prevent parallel developments, and supply basic project and database data, even if the information is not available in the user's own language.

1

The HEMIS CD-ROM will be made as attractive as possible so that it is widely used, and so that other institutions not associated with the UN will be motivated to supply data for the system (Fig. 21).

Harmonization and Distribution of Information via HEMIS Examples of sectoral / regional / specialized sources of environmental meta-data

Users

H

INFOTERRA

E

M

I

S

EARTHWATCH

Institutions

UNEP

ESA Programmes

EEA-TF WMO

 

Databases

Classification Systems

UN Methods/ Models

NGOs

Persons

GEMS IAEA

Fig. 21:

High

Level Data Model

Governments

Others

Information harmonisation and and distribution by HEMIS. Data on paper, diskette and CD is read into the stationary HEMIS, selected and formatted, classified semi-automatically or manually following a defined nomenclature (thesauri), and finally distributed in the form of printed reports on specific subjects or on CD-ROM. This figure shows only a representative sample of the participating organizations.

1 At

this writing (late 1992) HEMIS is still at the design and prototype stage. Not all components have been implemented as yet. IMC Congress, Brussels

Page 24 of 28

© Copyright PROJECT CONSULT GmbH 1993


Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Multilingual Informations and Retrieval Systems Technology and Samples

The major components of both the stationary and the CD-ROM HEMIS systems are a number of electronic thesauri, structured as shown in Figs. 11, 12, and 22.

The Internal Structure of the HEMIS Thesaurus A

B

C

D

E

F

G

Unique Identifier (ID)

IDs of predecessors (ISO TT, BT, links)

IDs of followers (ISO NT, links)

Main descriptor Position of “D” for display in the in the hierarchy hierarchy of the thesaurus

Synonyms, acronyms, homonyms, interpretations, etc. of “D”

Explanation

Numeric

Numeric

Numeric

Alpha- numeric

Numeric

Numeric

Alpha- numeric

One entry

Up to 64 entries

Up to 255 entries One entry

One entry

Up to 255 entries One entry

8 digits

8 digits

8 digits

Sequence of digits

Sequence of digits

Up to 20 characters (due to display restrictions)

Up to 8 digits (max. of 8 hierarchy levels)

Up to 40 characters each sequence of texts

Up to 255 characters

Internal management

Internal mangement

Bidirectional

Unidirectional

Retrievable via hierarchical selection list and global search

For screen display in the hierarchical thesaurus only

Retrievable via global search

Available as context sensitive help function

Unique reference key for the descriptor database

Fig. 22:

Structure of the HEMIS thesaurus for geographical units, climate zones, subject areas, and other hierarchically structured reference keys. For an explanation of the entries in the first row see Section 2.5 and Fig. 12.

The thesauri and selection lists are part of both HEMIS systems. In the stationary system they are used in making key words for data sets, documents, graphics, images etc., and for searching and compiling data. If information is supplied on computer media in pre-agreed formats, some of the key-word creation process can be done by the system automatically. In the CD-ROM version the thesauri, selection lists and all other entries are used only for researching and compiling information. The HEMIS CD-ROM version has a multi-layer modular structure (see Fig. 23).

IMC Congress, Brussels © Copyright PROJECT CONSULT GmbH 1993

Page 25 of 28


Multilingual Information and Retrieval Systems Technology and Samples

Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

HEMIS-System Layout with Multi-Lingual User-interfaces Additional user interface in different languages

User Interface - (i.e. English) Query by example Standard variables (alphanumeric, numeric etc)

Global search

Thesauri Selection lists

Numeric keys related to thesauri and selection lists

Descriptor database (field oriented database)

Guided tours

Links language translator

Database of guided tour links

Hyperlinks (part of the stored objects)

Information retrieval and access system (IRAS)

Objects Texts

Fig. 23:

Images

Datasets

HEMIS system layout with multilingual user guidance and search. The user interfaces in the various languages make up the first layer. The next layer is composed of modules for different search and navigation strategies, likewise language-specific. In addition to a database, HEMIS has prearranged "guided tours" and "links". The information and documents on the CD-ROM are managed by an Information Retrieval and Access System (IRAS).

In addition to searching for certain key words or terms, HEMIS also offers navigation assistance in the form of prearranged "guided tours" and individual links. A global database search takes a certain amount of time, but it does allow the user to use the system without prior knowledge of what contents lie behind a given field in the search mask. The user interface can be toggled among different loadable languages, as can the thesauri, selection lists, links and guided tours. Free text input and scanned-in documents are not translated. HEMIS is intended to provide the initial information; the user can then consult the source institutions, databases, or publications for more in-depth information. Fig. 24 shows the proposed starting screen of the HEMIS prototype with the button fields for moving to the main subject-area screens.

IMC Congress, Brussels

Page 26 of 28

Š Copyright PROJECT CONSULT GmbH 1993


Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

Multilingual Informations and Retrieval Systems Technology and Samples

Start Screen of the HEMIS-Prototype Institutions

Programmes

Databases

Methods

Ref. Mat.

Guided Tour

H E M I S

Institutions

Environmental

Information System

Programmes

Thesaurus

Location

Region

?

Guided Tours Subject

Thesaurus

Databases

Location

Methods

Region

Ref. Material

Help EXIT Choisir

Choose

English

Francais

Wählen Sie

Deutsch

Fig. 24: HEMIS starting screen (suggested CD-ROM version)

4. Outlook and Summary The development of multilingual information and retrieval systems has only just begun.

Conclusions MultiLingual Information and Retrieval Software The European Challenge for 1993 Multi-lingual software is a must for all companies and organizations working in different European Coutries The American software industry is presently unable to supply multilingual software This is a window of opportunity for European software companies Multilingual software helps to bridge the national barriers within Europe Multilingual software is intelligent object-oriented programming using databases and information management systems as a framework for huge masses of coded and non-coded information Fig. 25:

Summary of the most important arguments for multilingual software

IMC Congress, Brussels Š Copyright PROJECT CONSULT GmbH 1993

Page 27 of 28


Multilingual Information and Retrieval Systems Technology and Samples

Dr. Ulrich Kampffmeyer PROJECT CONSULT GmbH

In this article, the following arguments have been advanced (Fig. 25): a) Multilingual software is a necessity for all organizations with Europe-wide or world-wide activities, for which a single "company language" is undesirable or impracticable. b) Multilingual software is available in its basic features as standard software, but as a rule it must be modified for the specific application before it can be used to full benefit (compare wfBase, 3.2, and HEMIS, 3.3) c) Multilingual retrieval software can be used for accessing large quantities of data or documents on digital optical storage media. d) Multilingual thesauri encourage standardization in document classification, enable clear and structured access to documnets, and support searches for documents not in the user's own language. e) Multilingual fulltext retrieval and translating systems are in use in prototype form. Combined with other techniques, such as thesauri, they will make easy-to-use information systems feasible in the future. f) Multilingual software is a market opportunity for European software and systems firms. g) Multilingual retrieval and information systems can be used to advantage in almost all areas of business and administration which extend beyond national and cultural boundaries.

IMC Congress, Brussels

Page 28 of 28

Š Copyright PROJECT CONSULT GmbH 1993


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.