Data quality management using ibm information server by Sabir Asadullaev

Data quality management using IBM Information Server Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A 08.12.2010 http://www.ibm.com/developerworks/ru/library/sabir/inf_s/index.html

Abstract Data integration projects often do not provide users with data of required quality. The reason is the lack of established rules and processes to improve data quality, wrong choice of software, and lack of attention to work arrangement. A belief that data quality can be improved after the completion of the project is also widespread. The aim of this study is to determine the required process of data quality assurance, to identify the roles and qualifications, as well as to analyze the tools for interaction between participants in a data quality improvement project.

Introduction Data quality has a crucial impact on the correctness of decision making. Inaccurate geological data can lead to a collapse of high rise buildings, low-quality oil and gas exploration data cause significant losses due to incorrectly assessed effects of well drilling; incomplete data on the bank's customer is a source of error and loss-making. Other examples of serious consequences of inadequate data quality are published in [1]. Despite an apparent agreement on the need of data quality improvement, the intangibility of data quality as an end product raises doubts about the advisability of spending on these works. Typically, a customer, especially from financial management, asks the question, what will be the organizationâ&#x20AC;&#x2122;s profit on completion of works and how the result can be measured. Some researchers identify up to 200 data quality characteristics [2], so the absence of a precise criteria of quality also prevents the deployment of work to improve data quality. An analogy to water pipes may clarify the situation. Each end user knows that he needs water suitable for drinking. He does not necessarily understand the chemical, organoleptic, epidemiological and other requirements for water. Similarly, an end user does not have to understand what should be the technology, engineering construction and equipment for water purification. The purpose is to take water from designated sources, to process water in accordance with the requirements and deliver it to consumers. To sum up, we can say that to achieve the required data quality it is necessary to create an adequate infrastructure and arrangement of required procedures. That is, the customer does not receive "highquality data in a box" (equivalent - a bottle of drinking water), but the processes, tools and methods for their preparation (equivalent - town water supply). The aim of this study is to determine the process of data quality improvement to identify the needed roles and qualifications, as well as to analyze the tools for data quality improvement.

Metadata and project success For the first time I faced with the metadata problem in an explicit form in 1988 when I was the SW development manager for one of the largest petrochemical companies. In a simplified form, the task was to enter a large number of raw data manually, to apply complicated and convoluted algorithms 74

to process input data and to present the results on the screen and on paper. The task complexity and a large amount of work required that various parts of the tasks were performed by several parallel workgroups of customers and developers. The project ran fast, and customer representatives received the working prototypes of future systems regularly. Discussion of the prototype took the form of lengthy debates on the correctness of calculations, because no one could substantiate their doubts or to identify the cause of rejection of these data by experts. That is, results do not correspond to the intuitive customersâ&#x20AC;&#x2122; understandings of the expected values. In this connection we performed a data and code review for the consistency of input and output forms and data processing algorithms. Imagine our surprise when we discovered that the same data have different names in the input and output forms. This "discovery" compelled to change the architecture of the developed system (to carry out all the names into a separate file initially, and then in a dedicated database table), and to reexamine the developed data processing algorithms. Indeed, different names of the same data led to different understanding of their meaning and different algorithms to process them. Applied corrections allowed to substantiate the correctness of calculations and to simplify the support of the developed system. In the case of changing the indexâ&#x20AC;&#x2122;s name the customer should change one text field in one table, and this change was reflected in all forms. This was my first but not last time when metadata had a critical influence on the project success. My further practice of data warehouse development reaffirmed the importance of metadata more than once.

Metadata and master data paradox The need to maintain metadata was stressed yet in the earliest publications on data warehouse architecture [3]. At the same time master data management as a part of the process of DW development has not been considered until recently. Paradoxically, master data management was quite satisfactory, while metadata management was simply ignored. Perhaps the paradox can be explained by the fact that DW is usually implemented on relational databases, where the third normal form automatically leads to the need for master data management. The lack of off-the-shelf product instruments on the SW market also led to the fact that companies experienced difficulties in implementing enterprise metadata management. Metadata are still out of focus of developers and customers, and ignoring them is often the cause of DW project delays, cost overrun risk, and even project failure.

Metadata impact on data quality Many years ago, reading Dijkstra, I found his statement: "I know one -very successful- software firm in which it is a rule of the house that for one year project coding is not allowed to start before the ninth month! In this organization they know that eventual code is no more than the deposit of your understanding."[4]. At that moment I could not understand what one can do for eight months, without programming, without demonstrating working prototypes to the customer, without improving the existing code basing on the discussions with the customer. Now, hopefully, I can assume what the developers have been loaded with for eight months. I believe that the solution understanding is best formalized through metadata: a data model, glossary of technical terms, sources description, data processing algorithms, applications launch schedule, responsible personnel identification and access requirements... All this and much more is metadata.

In my opinion one of the best definitions of a specification of a system under development is given in [5]: "A specification is a statement of how a system - a set of planned responses - will react to events in the world immediately outside its borders". This definition shows how closely metadata and system specification are. In turn, there are close links between metadata, data, and master data [6]. This gives a reason to believe that the better metadata are worked out, the higher system specification quality is and, under certain circumstances, the higher data quality is.

Data quality and project stages Data quality must be ensured at all stages of problem statement, design, implementation and operation of information system. Problem statement is eventually expressed in formulated business rules, adopted definitions, industry terminology, glossary, data origin identification and data processing algorithms described on business language. This is business metadata. Thus, problem statement is a definition of business metadata. The better problem statement and definition of business metadata are performed, the better is data quality which must be provided by the designed IT system. IT system development is associated with entitlement of entities (such as table names and column names in a database) and identifying the links between them, programming of data processing algorithms in accordance with business rules. Thus the following statements are equally true: 1. Technical metadata system appears on development phase; 2. Development of the system is the definition of technical metadata. Documentation of the design process provides personal responsibility of each team member as the result of his work, which leads to improved data quality due to the existence of project metadata. Deviations from established regulations may happen during system operation. Operational metadata, such as user activitiesâ&#x20AC;&#x2122; logs, computing resources usage, applicationsâ&#x20AC;&#x2122; statistics (eg, execution frequency, records number, component analysis) allows not only to identify and prevent incidents that lead to data quality deterioration, but also to improve user service quality through optimal utilization of resources.

Quality management in metadata life cycle Extended metadata life cycle [7] consists of the following phases: analysis and understanding, modeling, development, transformation, publishing, consuming, reporting and auditing, management, quality management, ownership (Pic. 1). Quality management stage solves the task of heterogeneous data lineage in data integration processes; quality improvement of information assets; input data quality monitoring, and allows to eliminate data structure and processability issues before they affect the project.

Pic.1. Extended metadata management life cycle

Data flows and quality assurance At first glance, the role of quality management stage is not remarkable. However, if we use the roles description [7, 8] and draw Table 1, which shows the task description for each role at each stage of metadata management life cycle, it is evident that all of the projects tasks can be divided into two streams. The first flow, directed along the tableâ&#x20AC;&#x2122;s diagonal, contains the activities aimed at creation of functionality of metadata management system. The second stream consists of tasks to improve data quality. It should be noted, that all project participants contribute to data quality improvement if project team is selected properly. Let us consider the tasks flow of data quality improvement in more detail. In practice, four indicators of data quality are usually discussed: comprehensiveness, accuracy, consistency and relevance [9]. Comprehensiveness implies that all required data are collected and presented. For example, a client address may omit supplementary house number, or the patient's medical history misses one record of a disease. Accuracy of data indicates that presented values (e.g., passport number, or the loan period, or the date of departure) donâ&#x20AC;&#x2122;t contain errors. Consistency is closely related to metadata and influences data understanding. This may be date in different formats, or such a term as "profit", which is differently calculated in different countries. Relevance of data is associated with timely data update. The client can change the name, or get a new passport; wellâ&#x20AC;&#x2122;s debit might change over time. In the absence of timely updates the data may be complete, accurate, consistent, but out of date.

Change of requirements, which inevitably associated with IT system development, can bring, as any changes, to the result which is the opposite to the desired one. •

Completeness of data may suffer from inaccurate problem statement.

•

Data accuracy can be reduced as a result of increased load on the employee responsible for manual data entry.

•

Consistency can be impaired due to the integration of a new system with a different understanding of data (metadata).

•

Relevance of data can be compromised by the inability to update data timely due to insufficient throughput of the IT system.

So IT professionals responsible for change management (for example, project manager) should analyze the impact of the changes on the IT environment. The discrepancies between glossary and database columns lead to data consistency violation, which is essentially a metadata contradiction. Since identification of these conflicts requires understanding both of subject area, and IT technologies, in this step it is necessary to involve a business analyst who should reach complete visibility of the actual data state. Revealed discrepancies may require updates of business classification that must be performed by a subject matter expert. Consistency as data quality indicator requires discrepancies elimination in metadata. This work should be performed by a data analyst. Enterprise data used in the company’s business are the most important information assets or data resources. The quality of these resources has a direct impact on business performance and is a concern of, among others, IT developers who can use the design tools for managing and understanding information assets that are created and are available through IBM Information Server. Thus, an IT developer ensures data comprehensiveness, accuracy, consistency and relevance. Business users have an instrumental ability to track data lineage, which allows to identify missing data and to ensure comprehensiveness. Stewards maintain data consistency by managing metadata to support common data meaning understanding by all users and project participants, and to monitor comprehensiveness, accuracy and relevance of the data.

Table 2. Data flows and quality assurance

Roles, interactions and quality management tools Picture 2 shows the interaction pattern between the roles and used tools [8]. Tasks related to data quality improvement and discussed in the previous section are highlighted. Groups of tasks, related to one role, are enclosed in a dotted rectangle. Interactions between the roles are assigned to the workflow, the direction of which is marked by arcs with arrows. Let us consider in more detail the tools and the tasks performed by roles. A project manager, who is responsible for change management process, analyzes the impact of changes on the IT environment with the help of Metadata Workbench. A business analyst reveals contradictions between the glossary and database columns, and notifies metadata authors using a functionality of Business Glossary and FastTrack. Data analysis tools built into QualityStage help the business analyst to reach a full visibility of data actual state. A subject matter expert (the metadata author) uses Business Glossary to update business classification (taxonomy), which supports the hierarchical structure of terms. A term is a word or phrase that can be used to classify and to group objects in the metadata repository. If a joint work of experts is necessary, Business Glossary provides subject matter experts with collaboration tools to annotate data definitions, descriptions editing, and their categorization. Using Business Glossary and Rational Data Architect, data analyst eliminates the conflicts between glossary and tables and columns in databases, which were identified by the business analyst. Metadata Workbench provides an IT developer with tools for metadata review, analysis, design and enrichment, and allows him to manage and understand information assets that were created and are available through the IBM Information Server Business users, who are responsible for legislative requirements compliance, are able to trace data lineage using appropriate Metadata Workbench tools. Support of common understanding of data meaning by all users and project participants is performed by stewards with the help of Information Analyzer.

Necessary and sufficient tools As follows from the analysis, IBM Information Server product family provides all participants with the necessary tools to ensure data quality. Information is extracted from a data source system and then evaluated, cleaned, enriched, consolidated and loaded into the target system. Data quality improvement is carried out in four stages. 1. Research stage is performed in order to fully understand the information. 2. Standardization stage reformats data from different systems and converts them to the required content and format. 3. Matching stage ensures data consistency by linking records from one or more data sources related to the same entity. This stage is performed in order to create semantic keys for information relationships identification. 4. Survival stage ensures that the best available data survive and that data are prepared correctly for transfer to the target system. This stage is required to obtain the best representation of interrelated information.

Pic. 2. Roles, interactions and quality management tools 81

Thus, IBM Information Server family is a necessary tool to ensure data quality, but not always sufficient, since in some cases, additional instruments are needed for master data quality assurance. The issues of master data quality assurance will be discussed in future articles.

Conclusion Data quality assurance is a complex process which requires the involvement of all project participants. Metadata quality impact is extremely high, so it is important to ensure quality management within the metadata life cycle. Analysis showed that when used properly, IBM Information Server family creates a workflow to ensure data quality. IBM Information Server’s tools provide each employee involved in a data integration project with the quality management instruments and ensure an effective interaction of the project team.

Literature 1. Redman T.C. “Data: An Unfolding Quality Disaster”. Information Management Magazine, August 2004. http://www.information-management.com/issues/20040801/1007211-1.html 2. Wang, R., Kon, H. & Madnick, S. “Data Quality Requirements Analysis and Modeling”, Ninth International Conference of Data Engineering, 1993, Vienna, Austria. 3. Hackathorn R. “Data Warehousing Energizes Your Enterprise,” Datamation, Feb.1, 1995, p. 39. 4. Dijkstra E.W. “Why is software so expensive?'”, in "Selected Writings on Computing: A Personal Perspective", Springer-Verlag, 1982, pp. 338-348 5. DeMarco Т. “The Deadline: A Novel About Project Management”, Dorset House Publishing Company, Incorporated, 1997 6. Asadullaev S. “Data, metadata and master data: the triple strategy for data warehouse projects”, 09.07.2009. http://www.ibm.com/developerworks/ru/library/r-nci/index.html 7. Asadullaev S. “Metadata Management Using IBM Information Server”, 30.09.2008. http://www.ibm.com/developerworks/ru/library/sabir/meta/index.html 8. Asadullaev S. “Incremental implementation of IBM Information Server’s metadata management tools”, 21.09.2009, http://www.ibm.com/developerworks/ru/library/sabir/Information_Server/index.html 9. Giovinazzo W. “BI: Only as Good as its Data Quality”, Information Management Special Reports, August 18, 2009. http://www.informationmanagement.com/specialreports/2009_157/business_intelligence_bi_data_quality_governance_d ecision_making-10015888-1.html