Master data management with practical examples Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A Alexander Karpov, Solution Architect SWG IBM EE/A 09.11.2010 http://www.ibm.com/developerworks/ru/library/sabir/nsi/index.html
Abstract The article provides examples, when insufficient attention to master data management (MDM) leads to inefficient use of information systems due to the fact that the results of queries and reports do not fit the task and do not reflect the real situation. The article also notices difficulties faced by a company, which decided to implement a home grown MDM system, and provides practical examples and common errors. The benefits of enterprise MDM are stressed, and the basic requirements for MDM system are formulated.
Basic concepts and terminology Master data (MD) includes information about customers, employees, products, goods suppliers, which typically is not transactional in its nature. Reference data refer to codifiers, dictionaries, classifiers, identifiers, indices and glossaries [1]. This is a basic level of transactional systems, which in many cases is supplied by external designated organizations. Classifier is managed centrally by an external entity, contains the rules of code generation and has a three or four level hierarchical structure. Classifier may determine the coding rules. Classifier does not always contain the rules for calculating the check digit or code validation algorithms. An example of a classifier is the bank identification code BIC, which is managed by the Bank of Russia, contains no check digit, and has a four-level hierarchical structure: code of the Russian Federation, code of the Russian Federation region, the identification number of division of settlement network of the Bank of Russia, the identification number of the credit institution. Russia Classifier of Enterprises and Organizations is managed centrally by Russian Statistics Committee. In contrast to BIC it contains the method for calculating the check digit for enterprise or organizations code. Identifier (e.g., ISBN) is managed by authorized organizations in a non-centralized manner. Unlike the case of a classifier, identifier’s codes must follow the rules of check digit calculation. The rules for the identifier compilation are developed centrally and are maintained by requirements of standards or other regulatory documents. The main difference from the classifier is that the identifier as a complete list is either not available, or it is not needed on system design phase. The working list is updated with individual codes during system operation. Dictionary (e.g., Yellow Pages) is managed by a third party. The numbering code (telephone number) is not subject to any rules. Codifier is designed by developers for internal purposes of specific database. As a rule, neither checksum calculation algorithms, nor coding rules are designed for a codifier. The encoding of a month of the year is a simple example of codifier. The index may simply be a numeric value (for example, tax rate), which is derived from an unstructured document (order, law, act). A flat tax rate of 13% is an example of index. Glossaries contain abbreviations, terms and other string values that are needed during the generation of forms and reports. The presence of these glossaries in the system provides a common terminology 65
for all input and output documents. Glossaries are so close in nature to metadata that it is sometimes difficult to distinguish them.
Reference data (RD) and master data (MD) In Russian literature there is long established concept of "normative-reference information" (reference data) which appeared in the disciplines related to management of the economy back in the pre-computer days [2]. The term "master data" comes from the English-language documentation, and, unfortunately, was used as a synonym for reference data. In fact, there is a significant difference between reference data and master data.
Pic. 8. Data, reference data and master data
Pic.1 illustrates the difference between reference data, master data and transactional data in a simplified form. In some conditional e-ticketing system the codifier of the airports performs the role of reference data. This codifier could be created by the developers of the system, taking into account some specific requirements. But the airport code should be understandable to other international information systems for flawless interaction between them. This purpose is achieved by the unique three-letter airport code assigned to airports by the International Air Transport Association (IATA). Passengers’ data are not as stable as airport codes. At the same time, being once introduced into the system, the passenger’s data can be further used for various marketing activities, such as discounts when a certain total flight distance is achieved. Such information usually refers to master data. Master data may also include information about the crew, the fleet of the company, freight and passenger terminals, and many other entities involved in air transportation, but not considered in the framework of our simplified example. The top row in Pic.1 schematically depicts some transaction related to the ticket sale. Airports are relatively few in the world, yet there are much more passengers, and they can repeatedly use the services of this company, but a ticket can not and must not be reused. Thus, ticket sales data are the most frequently changing transactional data for an airline company. To sum up, we can say that reference data constitutes the base level of automated information systems. Master data store information about customers and employees, suppliers of products, equipment, materials and about other business entities. As reference data and master data have much in common, in those cases where the considered factors relate both to reference data and master data, we will refer to as "RD & MD", for example, "RD & MD management system", 66
Enterprise RD & MD management The most common and obvious issue of the traditional RD & MD management is the lack of support of data that changes over time. Address, as a rule, is one of the most important components of RD & MD. Unfortunately, addresses change. A client can move to another street, but the whole house and even the street can “move” also. So, in 2009, address of a group of buildings “Tower on waterfront” changed from “18, Krasnopresnenskaya embankment” to “10, Presnenskaya embankment”. Thus, the query “How much mail was delivered to the office of the company, renting premises in the “Tower on the waterfront” in 2009?” should correctly handle the delivery records to two different addresses. However, RD & MD management tools (hardware and software) themselves are not enough to reflect real world changes in the IT system. Someone or something is needed to track changes. That is, organizational measures are required, for example, qualified staff with proper responsibilities relevant to adopted methodology of RD & MD management. Thus, the enterprise RD & MD management includes three categories of activities: 1. Methodological activities that set guidelines, regulations, standards, processes and roles which support the entire life cycle of the RD & MD. 2. Organizational arrangements that determine the organizational structure, functional units and their tasks, roles and duties of employees in accordance with the methodological requirements. 3. Technological measures, which lie at the IT level and ensure the execution of methodological activity and organizational arrangements. In this article we will primarily discuss technological measures, which include the creation of a unified data model for RD & MD, management and archiving of historical RD & MD, identification of RD & MD objects, elimination of duplicates, conflict identification of RD & MD objects, enforcing referential integrity, support RD & MD objects life cycle, formulation of clearance rules, creating a RD & MD management system, and its integration with enterprise production information systems.
Technological shortcomings of RD & MD management Let us consider in more detail the technological area of RD & MD infrastructure development and associated disadvantages of the traditional RD & MD management.
No unified data model for RD & MD Unified data model for RD & MD is missing or not formalized which prevents the efficient use of RD & MD objects and obstructs any automation of data processing. The data model is the basic and most important part of the RD & MD management, giving answers, for example, for the following questions: •
What should be included into identifying attribute set of RD & MD object?
•
Which of all the attributes of RD & MD object should be attributed to the RD & MD and stored in the data model and what should be attributed to operational data and left in the production system?
•
How to integrate the model with external identifiers and classifiers?
67
•
Does a combination of two attributes from different IT systems provide a third unique attribute, important from a business perspective?
There is no single regulation of history and archive management Historical information in existing enterprise IT systems is often carried out by its regulations and has its own life cycles, responsible for the processing, aggregation and archiving of RD & MD objects. Synchronization and archiving of historical data and bringing them to a common view is a nontrivial task even with a common data model of RD & MD. An example of the problems caused by the lack of historical reference data is provided in the section " Law compliance and risk reduction"
The complexity of identifying RD & MD objects RD & MD objects in various IT systems have their own identifiers - sets of attributes. The attributes together can identify uniquely an RD & MD object in the information system and such set of attributes can be treated as an analog of composite primary key field in the database. The situation becomes more complicated when it is impossible to allocate a common set of attributes for the same objects in different systems. In this case, the problem of identifying and comparing objects of different IT systems changes from deterministic to probabilistic. Quality identification of RD & MD objects without specialized data analysis and processing tools is difficult in this case.
The emergence of duplicate RD & MD objects The complexity of object identification leads to the potential emergence of duplicates (or possible duplicates) of the same RD & MD object in different systems, which is the main and most significant problem for business. Duplication of information leads to cost duplication of object processing, duplication of "entry points", to the cost increase of maintaining the objects’ life cycles. Additionally we have to mention the cost of manual reconciliation of duplicates, which were originally too high, as it often goes beyond the boundaries of IT systems and require human intervention. It should be stressed that the occurrence of duplicates is a system error that appears in the earliest steps of business processes which involve RD & MD objects. On the next stages of the business processes execution the duplicate acquires bindings and attribute composition so the situation becomes more complicated.
Metadata inconsistency of RD & MD Each information system which supports a line of business of enterprise generates RD & MD objects specific to the business. Such IT system defines its own set of business rules and constraints applied both to the composition of attribute (metadata), and to the value of attributes. As a result, the rules and constraints imposed by various information systems, are in conflict with each other, thus nullifying even the theoretical attempts to bring all of the RD & MD objects to a singe view. The situation is exacerbated when, outwardly matching data model, the data have the same semantic meaning, but different presentations: various spelling, permutations in the addresses, name reduction, different character sets, reductions and abbreviation.
Referential integrity and synchronization of RD & MD model In real life RD & MD objects, located in the space of their IT systems, contain not only values but also references to other RD & MD objects, which can be stored and managed in separate external systems. Here, a problem of synchronization and integrity maintenance of enterprise wide RD & MD model arises to the utmost. One of the common ways of dealing with such problems is the transition to the use of the RD & MD that are maintained and are imported from outside the organization. 68
Discrepancy of RD & MD object life cycle Due to the presence of the same RD & MD object in a variety of enterprise systems, object input and change in these systems are inconsistent, and are often time stretched. It is possible that the same object in different systems is in mutually exclusive statuses (active in one system, archived in another, deleted in the third), making it difficult to maintain the integrity of RD & MD objects. Unbound and "spread" over time objects are difficult to use both in transactional, and in analytical process.
Clearance rules development RD & MD cleaning rules often are quite equitably attributed to methodological aspects. Of course, IT professionals need a problem statement from business users, for example, when the codes of airports should be updated, or which of the two payment orders has the correct data encoding. But business specialists are not familiar with the intricacies of the implementation of IT systems they use. Moreover, the documentation on these systems is incomplete, or missing. Therefore an analysis of information systems is required in order to clarify existing clearance rules and to identify new rules if required.
Wrong core system selection for RD & MD management Most often, the most significant sources and consumers of the RD & MD are large legacy enterprise information systems which are the core of company’s business. In real life, such a system is often chosen as the "master system" for the RD & MD management instead of creating a specialized RD & MD repository. The fact that such role of this “master” system is irrelevant to its initial design is usually ignored. As a result, any revision of these systems associated with RD & MD, pours into large and unnecessary spending. The situation is exacerbated when qualitatively new features must be entered along with the development of RD & MD management subsystems: batch data processing, data formatting and cleanup, data stewards’ assignment.
IT systems are not ready for RD & MD integration In order to fully implement RD & MD management into existing IT enterprise systems, it is necessary to integrate these systems. More often, this integration is necessary not as a one-time local event but as a change of processes, living within IT systems. The integration intended for operational mode support is not enough. The integration has to be carried out for the initial batch data loading (ETL), as well as for the procedures of manual data verification (reconciliation). Not all automated information systems are ready for such changes, not all systems provide such interfaces. Most of all, this is a completely new functionality to these systems. During system implementation arise architectural issues related to the selection of different approaches to the development of RD & MD management system and its integration with the technological landscape of the enterprise. To confirm the importance of this moment, we note that there are designed and proven architectural patterns and approaches aimed at proper deployment and integration of INS and MD.
Examples of traditional RD & MD management issues Thus, the main issues of the RD & MD management arise because of the decentralization and fragmentation of RD & MD across the company’s IT systems and are manifested in practice in concrete examples.
69
Passport data as a unique identifier For example, in a major bank as a result of creating a customer data model, it was decided to use the passport data in identifying attributes assuming its maximum selectivity. During execution of merge procedures of client data it was revealed that the customer’s passport is not unique. For example, customers who had relations with the bank using the old passport and then using new passport were registered as different clients. Analysis of client records revealed instances where one passport has been reported by thousands of customers. On top of that, one data source was the banking information system, in which the passport data were optional and the corresponding fields during the filling process were hammered with "garbage". It should be noted that the detected problems with the customers’ data quality were not expected and were found only at the stage of data cleaning, which required additional time and resources to finalize the rules for data cleaning and improve customer data model.
Address as a unique identifier In another case, an insurance company conducted a merger of customers’ personal data, where address is used as an identifying attribute. It was found that most clients were registered at the address "same", "ibid." Poor quality data were supplied by the application system that supports the activities of insurance agents. The system allowed agents to freely interpret the fields’ values of client questionnaire. Moreover, this system lacked any logic and formatted data input validations.
The need for mass contracts’ renewal In the third case, when an existing enterprise CRM system was connected to RD & MD management system only on the testing phase it became clear that the CRM system can not automatically accept the updates from RD & MD management system. This requires some procedural actions, in this case, invite the customer and renew paper contract documents that mention critical information relating to RD & MD. Both technological and organizational aspects of RD & MD integration and usage were reconsidered due to the large amount of work.
The discrepancy between the consistent data The fourth example describes a typical situation in many organizations. As a result of a rapid development of the company’s business, it was decided to open a new direction that supports the work with clients in the style of B2C / B2B using the Internet. To do this, a new IT system that supports the automation of new company’s business was acquired. During the deployment the integration with existing enterprise’s RD & MD was required. So the existing master data should be expanded by attributes specific for new IT system. Lack of a dedicated RD & MD management system made this task not easy. That’s why RD & MD were once loaded into the new system without any feedback from the existing enterprise’s IT landscape. Some time later this led to two independent versions of client directories. Initially the problem was solved by manual handling of customer data in spreadsheets, but after a while the number of customers has increased considerably, customer directories "diverged", and manual processing has proved ineffective and expensive. As a result, the situation has led to a serious escalation of the problem to the level of business users who do not have the overall picture of their customers for marketing campaigns.
Benefits of corporate RD & MD Enterprise RD & MD management has the following advantages: •
Law compliance and risk reduction 70
•
Profits increase and customer retention
•
Cost reduction
•
Increased flexibility to support new business strategies.
It sounds too good to be true, so we consider each of the benefits on practical examples.
Law compliance and risk reduction Prosecuting authorities demanded a big company to provide data for the past 10 years. The task seemed simple and doable: the company introduced procedures for regular archiving and backup of data and applications long before, storage media was stored in a secure room, the equipment to read the data carriers had not yet become obsolete. However, after the restoration of historical data from the archives it was revealed that the data are of no practical sense. RD & MD during this time changed repeatedly, and it was impossible to determine to what the data were related. Nobody foresaw RD & MD archiving because it seemed that this part of information was stable at the time. The company had been imposed major penalties. The company’s management responsible for these decisions was changed. In addition, the unit, responsible for RD & MD management, was established to avoid the repetition of such an unpleasant situation.
Profits increase and customer retention A large flower shop was one of the first to realize the effectiveness of email marketing. A web site was created where marketing campaigns were performed, where customers could subscribe to mail out on the Valentine's Day, on the birth of first child's, on a birthday of a beloved, etc. Subsequently, clients received congratulations with the proposals of flower choices. However, advertising campaigns were conducted with the assistance of various developers who created disparate applications, unrelated to each other. Therefore, customers can receive up to ten letters on the same occasion that annoyed customers and caused their outflow. As a result, each successive advertising campaign not only rendered unprofitable, but also reduced the number of existing customers. Flower shop had to spend considerable resources to process and integrate the applications. The high amount of expenses was related to the heterogeneity of customer information, multiple formats, addresses and telephone numbers, which caused big problems in the identification of customers to eliminate multiple entries.
Cost reduction One of the main requirements for the company's products is the need to respond quickly to changes in demand, launch a new product to the market in a short time and communication with consumers. We see that yesterday's undisputed leader turn into backward, while newcomers, who brought their product to market for the first time, increase their profits and capitalization greatly. Under these conditions, various corporate information systems, responsible for developing the product, its supply and sales, service and evolving should be based on a unified information base, covering all lines of the company’s business. Then the lunch of a new product to market requires less time and financial costs through seamless interaction between supporting information systems.
Increased flexibility to support new business strategies Elimination of fragmentation and decentralization of RD & MD allows providing the information as a service. This means that any IT system, following established communication protocols and access rights can query the enterprise RD & MD management system and obtain the necessary data. Service oriented approach allows to build flexible data services in accordance with changing 71
business processes, thus providing a timely response of IT systems and services in terms of changing requirements.
Architectural principles of RD & MD management The basic architectural principles of master data management are published in paper [3]. Let us list them briefly: •
The MDM solution should provide the ability to decouple information from enterprise applications and processes to make it available as a strategic asset of the enterprise.
•
The MDM solution should provide the enterprise with an authoritative source for master data that manages information integrity and controls distribution of master data across the enterprise in a standardized way that enables reuse.
•
The MDM solution should provide the flexibility to accommodate changes to master data schema, business requirements and regulations, and support the addition of new master data.
•
The MDM solution should be designed with the highest regard to preserve the ownership of data, integrity and security of the data from the time it is entered into the system until retention of the data is no longer required.
•
The MDM solution should be based upon industry-accepted open computing standards to support the use of multiple technologies and techniques for interoperability with external systems and systems within the enterprise
•
The MDM solution should be based upon an architectural framework and reusable services that can leverage existing technologies within the enterprise.
•
The MDM solution should provide the ability to incrementally implement an MDM solution so that a MDM solution can demonstrate immediate value.
Based on considered practical examples, we can expand the list of architectural principles with additional requirements to RD & MD management system: •
Master data system must be based on a unified RD & MD model. Without a unified data model it is not possible to create and operate a RD & MD system as a single enterprise source of master data.
•
Unified rules and regulations of master data history and archiving management are needed. The purpose is to provide opportunities to work with historical data to improve the accuracy of analytical processing, law compliance and risk reduction.
•
An MDM solution must be capable to identify RD & MD objects and to eliminate duplicates. Without identification it is impossible to build a unified RD & MD model and to identify duplicates, which cause multiple "entry points", cost increase for object processing and for maintenance of the objects life cycle.
•
RD & MD metadata must be consistent. Metadata mismatch leads to the fact that even if it is possible to create a unified model of the RD & MD, in fact, this model is of low quality due to the fact that different objects can actually be duplicated because of different definitions and presentations.
•
An MDM solution must support referential integrity and synchronization of RD & MD models. Depending on the solution architecture RD & MD model may contain both objects
72
and links. That is, the synchronization and integrity are necessary to support a unified RD & MD model. •
A consistent life-cycle of RD & MD object must be supported. RD & MD object stored in different IT systems in various stages of its life cycle (e.g., created, agreed, active, frozen, archived, destroyed), essentially destroys the unified RD & MD model. The life cycle of RD & MD objects must be expressed as a set of procedures, methodological, and regulatory documents approved by the organization.
•
Support should develop clearance rules for RD & MD objects and their correction. This ensures the relevance of a unified model of INS and MD, which may be disrupted due to changing business requirements and legislation.
•
It is necessary to create a specialized RD & MD repository instead of the use of existing information systems as a RD & MD "master system". The result is flexibility and performance of the RD & MD management system, data security and protection, improved availability.
•
The RD & MD management system must take into account that IT systems may not be ready to integrate RD & MD. Systems integration requires a counter-action: the existing system should be further developed to meet the requirements of the centralized RD & MD.
Conclusion The practice of creating RD & MD systems discussed in this paper shows that the company that attempts to develop and implement such a enterprise level system independently, faces a number of problems that lead to significant material, labor and time costs. As follows from the case studies, the main RD & MD technological challenges are caused by decentralization and fragmentation of the RD & MD in the enterprise. In order to address these challenges requirements to RD & MD management system are proposed and formulated. The following articles will discuss tools that can facilitate the creation of enterprise RD & MD management system, the main implementation stages of RD & MD management system, and the roles on various phases of the RD & MD life cycle.
Literature 1. Asadullaev S., “Data, metadata and master data: the triple strategy for data warehouse projects”, 09.07.2009, http://www.ibm.com/developerworks/ru/library/r-nci/index.html 2. Kolesov A., «Technology of enterprise master data management», PC Week/RE, № 18(480), 24.05.2005, http://www.pcweek.ru/themes/detail.php?ID=70392 3. Oberhofer M., Dreibelbis A., «An introduction to the Master Data Management Reference Architecture», 24.04.2008, http://www.ibm.com/developerworks/data/library/techarticle/dm0804oberhofer/
73