Incremental implementation of IBM Information Server’s metadata management tools Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A 21.09.2009 http://www.ibm.com/developerworks/ru/library/sabir/Information_Server/index.html
Abstract Just 15 years ago a data warehouse (DW) implementation team had to develop custom DW tools from scratch. Currently integrated DW development tools are numerous and their implementation is an challenging task. This article proposes incremental implementation of IBM Information Server’s metadata management tools in DW projects by the example of typical oil & gas company.
Scenario, current situation and business goals After having spent significant amounts of money on hundreds of SAP applications, our client suddenly realized that a seemingly homogeneous IT environment doesn’t automatically provide unified understanding of business terms. The customer, one of the world’s leading companies, incorporates four groups of subsidiary units, which operate in oil & gas exploration and production, and in refinery of petroleum products and marketing. Subsidiary units are subsidiaries spread around the world; they operate in various countries with different legislation, languages and terminologies. Each unit has its own information accounting system. Branch data warehouses integrate information from units’ accounting systems. The reports produced by branch data warehouses are not aligned with each other due to disparate treatment of reports’ fields (attributes). The company decided to build an attribute based reporting system, and realized that the lack of common business language made the enterprise data integration impossible. In this scenario, the company decided to establish a unified understanding of business terms, which allows to eliminate the contradiction in understanding of report fields. Business goals were formulated in accordance with the identified issues: •
Improve the quality of information, enhance security of information, and provide the transparency of its origin;
•
Increase the efficiency of business process integration and minimize time and effort of its implementation;
•
Remove the hindrances for corporate data warehouse development.
Logical Topology – As Is The existing IT environment incorporates the information accounting systems of Units, branch information systems, branch data warehouses, information systems of the headquarters, data marts and planned enterprise data warehouse. Information accounting systems of subsidiaries were realized on various platforms and are out of the scope of this consideration. Branch information systems are mainly based on SAP R/3. Branch data warehouses were developed on SAP BW. Headquarters’ information systems are realized using Oracle technologies. Data marts are currently working over headquarters’ information systems and branch data warehouses are running SAP BW. The platform for enterprise data warehouse is DB2.
47
Pic. 1. Logical Topology – As Is 48
On the left side of Pic.1 we can see the information systems of four branches: Exploration, Production, Refinery and Marketing. These hundreds systems include HR, financial, material management and other modules, and are out of our scope currently because they will not be connected to metadata management system on this stage. The center of Pic.1 presents the centralized part of the client’s IT infrastructure. It includes several branches’ data warehouses on SAP BW platform and headquarters’ information system on Oracle data base. Historically these two groups use various independent data gathering tools and methods, so stored data are not consistent across the information systems. The information is grouped in several regional, thematic and department data marts. These data marts were built independently over the years. That’s why the reports generated by OLAP systems do not provide a unified understanding of the report’s fields. Since the metadata management eliminates the data mismatch, improves the integration of business processes and removes obstacles to developing a enterprise data warehouse, it was decided to implement the enterprise metadata management system.
Architecture of metadata management system Basically, there are three main approaches to metadata integration: point-to-point architecture, point-to-point architecture with model-based approach, and central repository-based hub-andspoke metadata architecture [1]. The first one is point-to-point architecture, a traditional metadata bridging approach, in which pair-wise metadata bridges are built for every pair of product types that are to be integrated. Relative ease and simplicity of pair integration leads to uncontrolled growth of number of connections between systems. This uncontrolled growth results in considerable expense for maintaining a unified semantic space when changes are made in at least one system. Second is point-to-point architecture with model-based approach, which significantly reduces the cost and complexity associated with the traditional point-to-point metadata integration architectures based on metadata bridges. Common meta-model eliminates the need to construct pair-wise metadata bridges and establishes a complete semantic equivalence at metadata level between different systems and tools that are included in the information supply chain to users. The third is central repository-based hub-and-spoke metadata architecture. In this case the repository generally takes on a new meaning as the central store for both the common metamodel definition and all of its various instances (models) used within the overall environment. The centralized structure of information systems of oil and gas company dictates the choice of central repository-based hub-and-spoke metadata architecture as the most adequately implementing the necessary connection between systems to be integrated.
Architecture of metadata management environment Metadata Management Environment (MME) [2] includes the sources of metadata, metadata integration tools, a metadata repository, and tools for metadata management, delivery, access and publication. In some cases, the metadata management environment includes metadata data marts, but in this task they are not needed, since the functionality of the metadata data marts is not required. Metadata sources are all information systems and other data sources that are included in the enterprise metadata management system. Metadata integration tools are designed to extract metadata from sources, their integration and deployment to a metadata repository. Metadata repository stores business rules, definitions, terminology, glossary, data lineage and data processing algorithms, described in the business language; description of the tables and 49
columns (attributes), including the statistics of the applications’ execution, the data for the project audit. Metadata management tools provide a definition of access rights, responsibilities and manageability. Tools for metadata delivery, access and publication allow users and information systems to work with metadata the most convenient way.
Architecture of metadata repository Metadata repository can be implemented using either a centralized, a decentralized, or a distributed architecture approach. The centralized architecture implies a global repository, which is designed on a single metadata model and maintains all enterprise systems. There are no local repositories. The system has a single, unified and coherent metadata model. The need to access a single central repository of metadata can lead to performance degradation of metadata consuming remote systems due to possible communication problems. In the distributed architecture the global repository contains enterprise metadata for the core information systems. Local repositories, containing a subset of metadata, serve the peripheral system. Metadata model is uniform and consistent. All metadata are processed and agreed in a central repository, but are accessed through the local repository. The advantages of local repositories are balanced by requirements to be synchronized with a central metadata repository. The distributed architecture is preferable for geographically distributed enterprises. Table 1. Comparison of metadata repositories architectures
The decentralized architecture assumes that a central repository contains only metadata references which are maintained independently in local repositories. Lack of coordination efforts on terms and concepts significantly reduces development costs, but leads to multiple and varied models that are mutually incompatible. The applicability of this architecture is limited to the case when the integrated systems are within the non-overlapping areas of company’s operations. 50
As one of the Company’s most important objectives is to establish a single business language, a decentralized architecture is not applicable. The choice between centralized and distributed architecture is based on the fact that all the systems to be integrated are located in headquarters, and there is no problem with stable communication lines. Thus, the most applicable to this scenario is a centralized architecture of metadata repository. In various publications one can find statements that metadata repository is a transactional system, and should be managed differently than the data warehouse. From our point of view, the recommendation to organize metadata repository data warehouse is more justified. Metadata should accompany the data throughout its lifecycle. That is, if the data warehouse contains historical data, the metadata repository should also contain relevant historical metadata.
Logical Topology – To Be The selected architectures of metadata management environment, of metadata management system and of metadata repository lead to the target logical topology shown in Pic. 2. On can see two major changes compared to the current logical topology. 1. We plan to create an enterprise data warehouse and to use IBM Information Server as an ETL tool (Extract, Transform and Load). This task is beyond the scope of current work. 2. The second, most important change is the centralized metadata management, which allows the Company to establish a common business language for all systems operating in headquarter. So, on the client side only the metadata client is required.
Two phases of extended metadata management lifecycle Extended metadata management lifecycle (Pic.3) as proposed in [3], consists of the following stages: analysis and understanding, modeling, development, transformation, publication, consuming, ownership, quality management, metadata management, reporting and auditing. In terms of incremental implementation the extended metadata management lifecycle can be divided into two phases: 1. “Metadata elaboration” phase: analysis and understanding, modeling, development, transformation, publication. 2. “Metadata Production” phase: consuming, ownership, quality management, metadata management, reporting and audit. As the phases’ names imply, on the first phase mainly analysis, modeling and development of metadata are mainly carried out, while the second phase is more closely related to the operation of the metadata management system. For clarity, the stages of phase “Metadata elaboration” are grouped in the left hand side of Pic. 3, whereas the stages of phase “Metadata Production” are placed on the right hand side.
51
Pic. 2. Logical Topology – To Be
52
“Metadata elaboration” phase Analysis and understanding includes data profiling and analysis, quality assessment of data sets and structures, understanding the sense and content of the input data, identification of connections between columns of database tables, analysis of dependencies and information relations, and investigation of data for their integration. •
Business Analyst performs data flow mapping and prepares the initial classification.
•
Subject matter expert develops business classification.
•
Data Analyst accomplishes analysis of systems.
Modeling means revealing data aggregation schemes, detection and mapping of metadata interrelation, impact analysis and synchronization of models. •
Data Analyst develops the logical and physical models and provides synchronization of models.
Development provides team glossary elaboration and maintenance, business context support for IT assets, elaboration of flows of data extraction, transformation and delivery. •
IT developer creates the logic of data processing, transformation and delivery.
Transformation consists of automated generation of complex data transformation tasks and of linking source and target systems by means of data transformation rules. •
IT developer prepares the tasks to transform and move data, which are performed by the system.
Publishing provides a unified mechanism for metadata deployment and for notification upgrade. •
IT developer provides deployment of integration services, ...
•
... Which help Metadata steward publish metadata
“Metadata Production” phase Consuming is visual navigation and mapping of metadata and their relations; metadata access, integration, import and export; change impact analysis; search and queries (Pic.3). •
Business users are able to use metadata
Ownership determines metadata access rights. •
Metadata steward maintains the metadata access rights
Metadata quality management solves the tasks of lineage of heterogeneous data in data integration processes; quality improvement of information assets; input data quality monitoring, and allows to eliminate issues of data structure and their processability before they affect the project. •
Project manager analyzes the impact of changes
•
Business analyst identifies inconsistencies in the metadata
•
Subject matter expert updates business classification
•
Data analyst removes the contradiction between metadata and classification
•
IT developer manages information assets
•
Metadata steward supports a unified understanding of metadata meaning
•
Business users use metadata and inevitably reveal metadata contradictions 53
Analysis and understanding Business analyst Data flows mapping Initial classification Subject matter expert Business classification Data analyst System analysis
Reporting and audit Metadata steward Metadata audit Report metadata state
Metadata Management Project manager Assign a steward Define responsibilities and manageability Modeling Data analyst Logical & physical data models Synchronization of models
Quality management Project manager Analyze the change impact Business analyst Discover metadata contradictions Subject matter expert Update the business classification Data analyst Eliminate metadata contradictions IT developer Manage the information assets Metadata steward Maintain the common understanding of the data sense Business user Report metadata issues
Development IT developer Logic of end-to-end information processing
Transformation IT developer Data standardization and transformation procedures
Ownership Metadata steward Define metadata access rights Publication IT developer Integration services deployment Metadata steward Metadata publication
Consuming Business user Read-only access to metadata
Pic. 3. Extended metadata management lifecycle
54
During Metadata management stage access to templates, reports and results is managed; metadata, navigation and queries in the meta-model are controlled; access rights, responsibilities and manageability are defined. •
The project manager should appoint stewards and allocate responsibilities among team members.
Reporting and audit imply formatting options settings for report’s results, report generation for the connections between business terms and IT assets, scheduled reports execution, saving and reviewing the reports’ versions. •
Metadata steward provides auditing and reporting
Audit results can be used to analyze and understand metadata on the next stage of the life cycle.
Roles and interactions on metadata elaboration phase Business analyst, with the help of the instrumentality of mapping editor, a component of FastTrack, creates mapping specifications for data flows from sources to target systems (Pic.4). Each mapping can contain several sources and targets. Mapping specifications are used for business requirements documentation. Mapping can be adjusted by applying business rules. Endto-end mapping can involve data transformation rules, which are part of functional requirements and define how an application should be developed. Business analyst uses Information Analyzer to make decisions on integration design on the basis of data base tables’ investigation, columns, keys and their relations. Data analysis helps to understand the content and structure of data before a project starts, and on later project stages allows making conclusions useful for integration process. Subject matter expert (metadata author) uses Business Glossary to create the business classification (taxonomy), which maintains hierarchical structure of terms. A term is a word or phrase which can be used for object classification and grouping in metadata repository. Business Glossary supplies subject matter experts with a collaborative tool to annotate existing data definitions, to edit descriptions, and to assign data object to categories. Data analyst uses Information Analyzer as an instrument for a complete analysis of data source systems and target systems; for evaluation of structure, content and quality of data on single and multiple columns level, on table level, on file level, on cross table level, and on the level of multiple sources. Data analysts and architects can invoke Rational Data Architect for database design, including federated databases that can interact with DataStage and other components of Information Server. Rational Data Architect provide data analysts based on metadata research and analysis, and data analysts can discover, model, visualize and link heterogeneous data assets, and can create physical data models from scratch deriving it from logical models by means of transformation, or with the help of reverse engineering of production databases. IT developer uses FastTrack during program logic development of end-to-end information processing. FastTrack converts the artifacts received from various sources into understandable descriptions. This information has internal relations and allows the developer to get the descriptions from metadata repository and to concentrate on the complex logic development, avoiding losing the time for search in multiple documents and files. FastTrack is integrated into the IBM Information Server. That’s why the specifications, metadata, and the jobs become available to all project participants, who use the Information Server, Information Analyzer, DataStage Server and Business Glossary. IT developer runs QualityStage for data standardization automation, for data transformation into verified standard formats, for designing and testing match passes, and for data cleansing 55
operations setup. Information is extracted from the source system, is measured, cleansed, enriched, consolidated, and loaded into the target system. IT developer leverages DataStage for data transformation and movement from source systems to target systems in accordance with business rules, requirements of subject matter and integrity, and / or in compliance with other data of target environment. Using metadata for analysis and maintenance, and embedded data validation rules, IT developer can design and implement integration tasks for data, received from a broad set of internal and external sources, and can arrange extremely big data manipulation and transformation using scalable parallel processing technologies. IT developer has choice to implement these processes as DataStage batch jobs, as real time tasks, or as Web services. IT developer executes Information Services Director as a foundation for deploying integration tasks as consistent and reusable information services. Thus IT developer can use metadata management service-oriented tasks together with enterprise applications integration, businessprocess management, with enterprise service bus and the application servers. Business users need a read-only access to metadata. Their demands can be met by two instruments: Business Glossary Browser and Business Glossary Anywhere.
56
Рис. 4. Roles & Interactions on Elaboration phases of metadata management lifecycle 57
Roles and interactions on metadata production phase IT specialists who are responsible for change management, say, a project manager, can analyze a change impact on the information environment with the help of Metadata Workbench (Pic.5). Business Glossary allows to assign the role of stewards, who are responsible for the metadata, to a user or a group, and to link the role of stewards with one or more metadata objects. Stewards’ responsibility includes the effective metadata management and integration with related data, and providing authorized users with relevant data access. Stewards must ensure that all data are correctly described and that all data users understand the meaning of the data. If business analyst discovers contradictions between glossary and database columns, he can notify metadata authors by means of Business Glossary features. Business analyst investigates data status to reach a complete visibility of the actual data condition using QualityStage’s embedded analyzing tools. Data analyst eliminates contradictions between glossary and data base tables and columns by means of Business Glossary and Rational Data Architect Metadata Workbench provides IT developers with metadata view, analysis and enrichment tools. Thus IT developers can use Metadata Workbench embedded design tools for management and understanding the information assets, created and shared by IBM Information Server. Business users responsible for regulations compliance such as Sarbanes-Oxley and Basel II, have the possibility to trace the data lineage in reports using the appropriate tools of Metadata Workbench. Stewards can use Information Analyzer to maintain the common understanding of data sense by all users and project participants. Stewards can invoke Metadata Workbench to maintain metadata stored in the IBM Information Server. Administrators can use the capabilities of Web console for global administration that is based on a common framework of Information Server. For example, user may need only one credential to access all the components of Information Server. A set of credentials is stored for each user to provide a single sign-on to all registered assets.
58
Pic. 5. Roles & Interactions on Production phases of metadata management lifecycle 59
Adoption route 1: metadata elaboration So, we have two metadata adoption routes. Route 1: Metadata Elaboration and Route 2: Metadata Production. These routes are beginning at the single start point. Picture 6 represents Route 1, which deals mainly with first part of metadata management lifecycle, namely with Analysis and understanding, Modeling, Development, Transformation, Publishing, and Consuming As the first step we have to install Metadata Server, which maintains metadata repository, and supports metadata services. On the second step one should add Information Analyzer to perform automated analysis of production systems and to define initial classification. Step three is adding FastTrack which allows to create mapping specifications for data flows from sources to target. We can add Business Glossary as a fourth step in order to create the business classification To create logical & physical data models the Rational Data Architect could be added on the fifth step. Sixth step is the extended usage of the Information Analyzer to create rules for data evaluation. On the seventh step we plan the extended usage of FastTrack to program the logic of end-to-end information processing. As step eight one could install QualityStage and DataStage to design and execute data transformation procedures To deploy integration tasks as services we should add Information Services Director on the ninth step. On the last step one has to grant users with read-only access to metadata, and we can add Business Glossary Browser and Business Glossary Anywhere.
60
Pic. 6. Metadata adoption route on Elaboration phases of metadata management lifecycle 61
Adoption route 2: metadata production This adoption route covers production part of metadata management lifecycle, and includes Reporting and audit, Ownership, Quality management, Metadata Management. The second route begins at the same starting point as route 1. Almost all products were installed during the first route, so this route in general deals with extended usage of the software added previously. Web console is the one of the two products which should be added during this route. It enables management of users’ credentials, and hence, it is required in the very beginning. The next step “Extended use of Business Glossary� should be performed as soon as possible to assign a steward. To perform the change impact analysis one should add the Metadata Workbench. The extended usage of FastTrack and QualityStage allows to discover the contradictions between glossary and data base columns. Extended usage of Rational Data Architect could eliminate the revealed contradictions between glossary and data base tables & columns. Metadata Workbench can help in understanding and managing the information assets. By means of Business Glossary users could update the business classification according to new requirements. Again Metadata Workbench helps in reporting the revealed metadata issues. Information Analyzer can be used to maintain the common understanding of the data sense Both Metadata Workbench and Web Console can be used to maintain metadata and to report metadata state.
62
Pic. 7. Metadata adoption route on Production phases of metadata management lifecycle
63
Conclusion The proposed routes cover an extended metadata management lifecycle in data integration projects. The participants of metadata management project are provided incrementally with a consistent set of IBM Information Server metadata management tools. Software that is implemented following the proposed routes, realizes the pre-selected architecture environment for metadata management, metadata management system and a metadata repository in accordance with the target logical topology. Incremental implementation of metadata management tools of IBM Information Server reduces the time and complexity of the project, enabling business user to get the benefits of metadata management on earlier stages, and increases the probability of successful implementation of metadata management system. This work was performed as part of plusOne initiative. The author would like to express his gratitude to Anshu Kak for the invitation to plusOne project.
Literature 1. Poole J., Chang D., Tolbert D, Mellor D. Common Warehouse Metamodel: An Introduction to the Standard for Data Warehouse Integration, Wiley, 2003. 2. Marco D., Jennings M. Universal Meta Data Models, Wiley, 2004. 3. Asadullaev S. “Metadata management using IBM Information Server�, 2008, http://www.ibm.com/developerworks/ru/library/sabir/meta/index.html
64