Data, metadata and master data: the triple strategy for data warehouse projects by Sabir Asadullaev

Data, metadata and master data: the triple strategy for data warehouse projects Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A 09.07.2009 http://www.ibm.com/developerworks/ru/library/r-nci/index.html

Abstract The concept of data warehousing emerged in the early 90s. Since then, many data warehouse implementation projects were carried out, but not all of these projects were successfully completed. One of the most important reasons of failure is problem of common interpretation of data meaning, data cleaning, alignment and reconciliation. In the article it is shown that three interrelated projects for data, metadata and master data integration should be performed simultaneously in order to implement enterprise data warehouse (EDW).

Introduction The largest companies have implemented DW since the mid 90s. Previous projects can not be considered as unsuccessful, as they solved requested tasks, in particular, to provide companyâ&#x20AC;&#x2122;s management with consistent reliable information, at least in some areas of companyâ&#x20AC;&#x2122;s business. However, the growth of companies, changes in legislation and increased needs for strategic analysis and planning require further development of data warehouse implementation strategy. By this time the companies have understood that a successful data warehouse required creating a centralized system for master data and metadata management. Unfortunately, these projects are still performed separately. It is assumed generally that the development of enterprise data warehousing is a project of integrating data from disparate sources. At the same time, the sources contain not only data but also the master data, as well as the metadata elements. Typically, large companies start data warehouse project without allocating funds and resources for metadata and master data management. Project sponsors, steering committee and other decision makers usually try to allocate funds for the project phase by phase and tend to implement these three projects sequentially. As a result, the projectsâ&#x20AC;&#x2122; budgets, completion periods and quality do not meet the initial requirements because of the need for changes and improvements of IT systems built under the previous project. In most cases the need of a data warehouse project is the demand of business users who are no longer able to bring together data from different information systems. That is, precisely the requirements of business users define, first of all, the information content of the future data warehouse. The data sources for the future DW are transactional databases (OLTP), legacy systems, file storage, intranet sites, archives, and isolated local analytic applications. First, you need to determine where the required data are located. Since, as a rule, these data are stored in different formats, you must bring them to a single format. This task is performed by fairly complex system of data extraction, transformation and loading (ETL) into data warehouse. The ETL procedures can not be accomplished without an accompanying analysis of metadata and master data. Moreover, the practice of data warehouse implementation has shown [1] that the metadata created and imported from various sources manage in fact the entire process of data collection. 27

Master data management Reference data and master data include glossaries, dictionaries, classifiers, indices, identifiers, and codifiers. In addition to standard dictionaries, directories and classifiers, each IT system has its own master data required for system’s operation. As long as several information systems operate in isolation from each other, the problems caused by the difference in the master data usually does not arise. However, if you have to combine the reported data of two or more different systems, the discrepancy in the master data makes it impossible to merge tables directly. In such cases is requires a "translator" of codes stored in multiple tables to a unified form. In addition, the master data although infrequently, but does change, and consistent update of master data in all information systems is a challenge. Thus, there is a need to establish a master data management system, which helps coordinate master data changes in various information systems and simplifies the data integration from these systems.

Metadata management The prototype for metadata management were the Data Dictionary / Directory Systems, which were designed for logical centralization of information about data resources and should serve as a tool for enterprise data resources management [2]. Data sources, including transactional systems, contain metadata in an implicit form. For example, the table names and column names in the tables are technical metadata, and the definitions of entities that are stored in the tables represent the business metadata. Statistics of applications that can be done in monitoring systems should be classified as operational metadata. Relationship between the project roles and the database access, including the administration rights, and data for audit and for change management, usually related to the project metadata. And finally, business metadata are the most important piece of metadata, and include business rules, definitions, terminology, glossary, the data origin and processing algorithms. Many sources contain metadata elements, but almost never have full set of metadata. The result is a mismatch in the reports provided by different source systems. For example, in one report, the production volume can be calculated in dollars, and in another - in pieces, while in the third – in a total weight basis. That is, the same field “production volume” may contain a variety of data in different reports. Such a mismatch of data sense in reports forces companies to develop and implement an integrated system of unified indicators, reports, and terminology.

Data, metadata and master data interrelations DW structure consists of three main information levels: detailed, summary and historical data, as well as their accompanying metadata [3]. Now it is clear that this list should be complemented by master data. The relationship between data, metadata and master data can be visualized as a triangle (Pic. 1). As can be seen from the figure, all the relationships fall into three pairs: •

Data – metadata

•

Data – master data

•

Metadata – master data

Consider each pair in more detail. 28

Pic. 1. Data, metadata and master data interrelations

Data and metadata The interdependence of data and metadata can be shown in the following example. We can assume that any book contains data. Library catalog card is the metadata that describe the book. The collection of cards is a library catalog, which can be treated as a set of data (database). Card should be filled according to certain rules which are specified in a book on librarianship (metameta data). This book should also be placed in the library, and its catalog card must be prepared, which must be placed in the appropriate box of catalog cabinet, where you can find the rules for using the catalog. The question, are these rules the metametameta data, we leave to the reader as a homework. If you hold the book you need in your hands already, you do not need a catalog card for this book. Most home libraries do not have catalogues because the owner knows their library, and they are the libraryâ&#x20AC;&#x2122;s creator and the user. And they are a librarian in case if someone asks for a book from their library. But a large public library can not operate without the catalog. The situation in enterprise information systems is not so simple and so obvious. Despite the fact that the first publications on the need for data dictionary systems appeared in the mid 80s, corporate resources are still designed, developed and operated in isolation, without a unified semantic space. This situation in libraries would mean that the reader is one library could not even tell whether the required book exists in another library. In 1995 an article was published [4], which stated that for successful data integration it is necessary to establish and maintain the metadata flow. In the language of library users this discovery sounds something like this: "Libraries need to share information about books in a single format." It is now clear that this requirement needs to be clarified, since the metadata are generated on all stages of development and operation of information systems. Business metadata are generated on the initial stage of system design. They include business rules, definitions, terminology, glossary, the origin of algorithms and data processing, which are described in the language of business. On the next stage of logical design the technical metadata appear, such as names of entities and relationships between them. The table names and column names also refer to the technical metadata, and are determined on the stage of physical development of the system. 29

Metadata of production stage are operational metadata, which include statistics on computing resources usage, user activities and application statistics (e.g., frequency of execution, the number of records, componentwise analysis). The metadata that document the development efforts, provide data for the project audit, assign metadata stewards, and support change management, refer to the project metadata. Operational metadata are the most undervalued. The importance of operational metadata may be demonstrated by an example of a large company which provides customers with a variety of Webservices. At the heart of the company’s IT infrastructure resides a multi-terabyte data warehouse, around which custom local client databases and applications are built. Marketing department receives clients’ orders, legal department manages contracts, and the architectural team develops and provides documentation for project managers to transfer to outsourcing development. When a customer’s contract ends, everybody is looking for new clients, and nobody takes the time to inform administrators that the customer’s application and the database can be deleted. As a result, the overhead for data archiving, data and applications is increasing. In addition, the development of new versions is significantly hampered, since it is necessary to support the unused protocols and interfaces. Operational metadata management provides administrators and developers with information about how frequently applications are used. Based on this information we can determine unused applications, data and interfaces, whose removal from the system will significantly reduce the costs of its maintenance and future upgrades.

Data and master data In relational databases, designed in accordance with the requirements of normalization, one can identify two different types of tables. Some include, for example, a list of items of goods and their value (master data). Other tables contain information on purchases (the data). Without going into the jungle of definitions, in this example, you can see the difference between the data and the master data. Multiple purchases can be performed every second in a large store. But the prices and names of goods, at least for now, do not change every second. Master data in relational databases perform several functions. They help reduce the number of data entry errors, support more compact storage of data through the use of short codes instead of long names. In addition, master data is the basis for standardization and normalization of data. On the one hand, the presence of the corresponding nationwide classifier will inevitably affect the structure of the database. On the other hand, the process of bringing data to a third normal form, as a rule, leads to internal codifiers. Despite the fact that the ISBN or SIN codes are unique and can be the primary keys in relational databases, in practice, additional local codifiers are often created.

Metadata and master data There are various definitions and classifications of master data: by source, by management method, by classified data. For the purposes of this work, we may assume that the master data includes codifiers, dictionaries, classifiers, identifiers, indices and glossaries (Table 1). Classifier, for example, bank identification code BIC, is managed centrally by an external organization (Bank of Russia) and provides rules for the code. In addition, the classifier may determine the rules of the code usage. That is, the reuse of bank identification codes by payment participants is allowed one calendar year after the date of their exclusion from the BIC classifier of 30

Russia, but not before Bank of Russia draws up the consolidated balance sheet of the on payments using the aviso for that calendar year. BIC code does not contain a control number. A three-stage hierarchical classification is adopted in the All-Russia Classifier of Management Documentation: the class of forms (the first two digits), the subclass of forms (the second two digits), registration number (next three digits), and control numbers (last digit). Classifier can contain the rules for control number calculation or code validation algorithms. Table 1. Master data types

Metadata in classifiers are the rules for calculating the control number, a description of the hierarchical classification, and usage regulations for identification codes. Identifier (e.g., ISBN) is managed by authorized organizations in a non-centralized manner. Unlike the case of a classifier, identifier codes must follow the rules of control number calculation. The rules for the identifier compilation are developed centrally and are maintained by requirements of standards or other regulatory documents. The main difference from the classifier is that the identifier as a complete list either is not available, or it is not needed on a system design phase. The working list is updated with individual codes during system operation. The difference between identifiers’ metadata and classifiers’ metadata is the different behavior on various stages of system life cycle. Identifiers’ metadata must be defined at the design stage, when the identifiers are not yet filled with individual values. New identifiers may appear during the system operation. In some cases they do not match the existing metadata, and metadata should be revised to eliminate the possible misinterpretation of the new values of identifiers. Dictionary (e.g., phone book) is managed by an outside agency. The code numbering (telephone number) is not subject to any rules. Dictionary’s metadata are less structured, as is the phone book. However, they are also necessary. For example, if the organization provides several different communication methods (work phone, 31

home phone, mobile phone, e-mail, instant messaging tools, etc.), system administrator can describe the rules to send a message at system failure. Codifier is designed by developers for internal purposes of specific database. As a rule, neither checksum calculation algorithms, nor coding rules are designed for codifier. The encoding of a month of the year is a simple example of codifier. Despite the absence of external rules, encoding is carried out in accordance with the designerâ&#x20AC;&#x2DC;s concept and often contains rules (metadata) in an implicit form. For example, for payments over the New Year the month of January can be entered again in the codifier as 13th month. The index may be a just a numeric value (for example, the tax rate), which is derived from an unstructured document (order, law, act). It would be unreasonable to include a numeric value directly in the algorithm, since changing its value requires finding all occurrences in program text and replace the old value with the new. Therefore, indices, isolated in separate tables, are an important part of the master data. Metadata of indices define the scope of their applications, time limits and restrictions. Glossaries contain abbreviations, terms and other string values that are needed during the generation of forms and reports. The presence of these glossaries in the system provides a common terminology for all the input and output documents. Glossaries are so close in nature to the metadata that sometimes is difficult to distinguish them. Thus, the master data always contains business metadata and technical metadata. Most of the technical and business metadata is created during the understanding phase of metadata management life cycle [5]. Project metadata arise during the development phase and to a lesser extent, during the operation phase (e.g., assigning metadata stewards). Operational metadata are created and accumulated during the operation of the system.

Components of Enterprise Data Warehouse Enterprise data warehouse (EDW) transforms the data, master data and metadata from disparate sources and makes them available to users of analytical systems as a single version of truth. Data source are usually described as transactional databases, legacy systems, various file formats, as well as other sources of data, information from which must be provided to end users (pic.2). The components of enterprise data warehouse are 1. ETL tools used to extract, transform and load data into a central data warehouse (CDW); 2. Central data warehouse, designed and optimized for reliable and secure data storage; 3. Data marts which provide efficient user access to data stored in structures that are optimal for specific usersâ&#x20AC;&#x2122; tasks. A central repository includes, above all, three repositories: 1. Master data repository; 2. Data repository; 3. Metadata repository. The scheme above does not include an operational data store, the staging area, the data delivery and access tools, business applications and other components of EDW that are not relevant to this level of detail. 32

Pic. 2. Components of Enterprise Data Warehouse After several unsuccessful attempts to create a virtual data warehouse the need for a data repository has become unquestionable. In virtual DW architecture, a client program receives data directly from sources, transforming them instantly. The simplicity of architecture compensates for the waiting time of execution of query and data transformation. Query result is not saved, and the next same or similar request requires re-conversion of data, which lead to the abandonment of virtual DW and to creation of data repositories. The present situation with metadata and master data resembles the situation of virtual DWs. Metadata and master data are used intensively during data extraction, transformation and loading. Cleaned data are saved in data warehouse. However, metadata and master data are discarded as waste material. Creating a repository of metadata and master data significantly reduces the EDW implementation costs and improves the quality of information support for business users by reusing consistent metadata and master data from a single source.

Example of existing approach An illustration of the existing approaches to data integration is presented in the paper [6] on the master data implementation in a bank. The bank spent more than six months reengineering of planning and forecasting process for performance management. Vice-president of the bank explained the success of implementation of master data management initiatives was due to the fact that the team focused on solving a local problem, avoiding the "big bang", which refers to the creation of an enterprise data warehouse. In his view, the creation of enterprise master data management system is a long, difficult and risky job. During the next step it is planned to create a bank reporting system, based on integration of core banking systems to use more detailed data that are compatible with the general ledger. This will create a financial data repository, which should become the main source for all financial reporting systems, and will support the drill-down analysis.

Careful reading of the article leads to the following conclusions. First of all, this project did not provide for the integration of enterprise data and covers only reengineering of planning and forecasting process. Created data repository appears to be narrowly thematic data mart, and is not capable to support common analytical techniques, such as drill-down analysis. In contrast, the enterprise data warehouse provides a consistent enterprise data to a wide range of analytical applications. In practice, a single version of data can be provided only by enterprise data warehouse, which works in conjunction with one enterprise master data and metadata management systems. The article [6] describes how to create a single version of truth for metadata only for financial reporting area. Thus, the project team implemented a metadata and the master data management system for one specific area of activity. The team deliberately avoided the enterprise wide solutions: neither enterprise data warehouse, nor metadata or master data management system was implemented. Statements that enterprise data warehouse canâ&#x20AC;&#x2122;t be implemented on practice, are refuted by projects performed by IBM employees on a regular basis. This project is a typical "fast win"; the main objective is to demonstrate quick little success. At this stage, no one thinks about the price of applications redesign, redevelopment, and integration into the enterprise infrastructure. Unfortunately, we have to address increasingly the effects of the activity of "quick winners" who avoid complicated, lengthy, and therefore risky decisions. It should be clarified that small successful projects are quite useful as a pilot, when beginning of global project geared to the demonstration of a workable solution in a production IT environment. In this situation all the pros and cons should be weighed. In the worst case, all the results of the pilot project may be rejected due to incompatibility with enterprise architecture of information systems. In EDW development the compatibility problem is particularly acute because of the need to coordinate not only the interfaces and data formats, but also the accompanying metadata, and master data.

The practical realization of the triple strategy The data warehouse as corporate memory should deliver the unified consistent information, but usually doesnâ&#x20AC;&#x2122;t, due to conflicting master data and lack of common understanding of the data sense. Known solutions are metadata and master data analysis as part of data integration project without establishing metadata and master data management systems. Metadata and master data management systems implementations usually are regarded as separate projects, performed after data warehouse implementation (pic.3). The drawbacks of such known solutions are insufficient quality of information delivered to data warehouse end users due to lack of consistent metadata and master data management, extra expenditures for data warehouse redesign to align the existing data integration processes with requirements of new metadata and / or master data management systems. The result is inefficiency of these three systems, the coexistence of the modules with similar functionality, waste of duplicated functionality, rising development budget, high total cost of ownership and user frustration due to discrepancy in data, metadata and master data.

Pic.3. One of the existing workflows The master data, metadata and data integration projects, performed sequentially in any order canâ&#x20AC;&#x2122;t provide the business with the required quality of information. The only way to solve this problem is the parallel execution of three projects: metadata integration, master data integration and data integration (pic.4). 1. Enterprise metadata integration establishes a common understanding of the data and master data sense. 2. Master data integration eliminates the conflict in data and metadata coding in various information systems. 3. Data integration provides end users with data as a single version of the truth based on consistent metadata and master data. Coordinated execution of these three projects delivers a corporate data warehouse with improved quality at lower costs and time expenditures. The proposed approach increases the quality of information delivered from data warehouse to business users, and consequently provides better support for decisions based on improved information. The three integration projects (for data, metadata and master data), performed in parallel manner, allow to implement a coordinated architecture design, consistent environment, coherent life cycles and interrelated core capabilities for data warehouse, metadata management system, and master data management system.

Pic.4. Workflow according to the triple strategy In practice there are lots of ways, methods, and approaches which assure success of the parallel coordinated execution of three projects of data, metadata and master data integration. 1. Arrange the data warehouse, metadata integration and master data integration project as a program. 2. Adopt Guide to Project Management Body of Knowledge as world-wide recognized project management standards. 3. Select the spiral development life cycle. 4. Gather functional and non-functional requirements to select suitable core capabilities for data warehouse, for metadata, and for master data 5. Select an environment a. for data warehouse: data sources, ETL, data repository, staging area, operational data store, application data mart, departmental and regional data marts, analytical, reporting and other applications b. for metadata: Managed Metadata Environment with 6 layers: sourcing, integration, data repository, management, meta data marts, and delivery layer c. for master data: Upstream, MDM core, Downstream 6. Select an architecture design a. centralized data warehouse architecture b. centralized metadata architecture c. centralized master data repository 7. Select life cycles a. life cycle for data: for example: understand, extract, transform, load, consolidate, archive, deliver

b. life cycle for metadata: development, publishing, ownership, consuming, metadata management c. life cycle for master data: identify, create, review, publish, update, and retire 8. Define roles in project and responsibilities, and assign team members to specific roles. 9. Select the tools for each team member. The technical feature, which is absolutely required for the strategy implementation, is primarily the coordination of these three projects. In general, this is subject matter of program management. The specific details (who, what, when, where, how, why) of inter project communication depend on project environment described above.

Conclusion At the moment IBM is the only company which proposes almost full product set for the triple strategy implementation. ETL tools for data extraction from heterogeneous data sources, metadata glossary tools, data architecture instruments, master data management tools, sophisticated tools for BI environment designing, industrial data models, and middleware allowing to integrate the components into the unified environment for information delivery to business users. The idea of the triple strategy could have arisen 10 or 15 years ago. Practically implementation of the strategy was impossible at that time due to huge costs of developing the required tools, which are available now. Ready-made software tools for data, metadata and master data integration support the triple strategy and together can mitigate the project risks, reduce the data warehouse development time and provide companies with new availabilities to improve the corporate performance. The author thanks M. Barinstein, R.Ivanov, D.Makoed, A.Karpov, A.Spirin, and O.Tretyak for helpful discussions.

Literature 1. Asadullaev S. “Vendors’ data warehouse architectures”, PC Week / RE, 1998, № 32-33, p. 156157 2. Leong-Hong B.W., Plagman B.K. Data Dictionary / Directory Systems. John Wiley & Sons. 1982. 3. Inmon, W. H., Zachman, J. A., Geiger, J. G. “Data Stores, Data Warehousing and the Zachman Framework”, McGraw-Hill, 1997 4. Hackathorn R. “Data Warehousing Energizes Your Enterprise,” Datamation, Feb.1, 1995, p. 39. 5. Asadullaev S. “Metadata management using IBM Information http://www.ibm.com/developerworks/ru/library/sabir/meta/index.html

Server”,

2008,

6. Financial Service Technology. «Mastering financial systems success», http://www.usfst.com/article/Issue-2/Business-Process/Mastering-financial-systems-success/

2009,