Metadata Management Using IBM Information Server Sabir Asadullaev, Executive IT Architect, SWG IBM EE/A 06.10.2008
Abstract The strategy selection for BI metadata management system implementation requires getting the answer to several critical questions. Which metadata needs to be managed? What does the metadata lifecycle look like? Which specialists are needed to complete the project successfully? Which instruments can support the specialists during the whole lifecycle of required metadata set? This paper investigates the metadata management system for data integration projects from these four specified points of view.
Glossary Glossary is a simple dictionary which includes a list of terms and definitions on a specific subject. It contains terms and their textual definitions in natural language, like this glossary. Thesaurus (treasure) is a variety of dictionary, where lexical relations are established between lexical units (e.g., synonyms, antonyms, homonyms, paronyms, hyponyms, hyperonyms‌) Controlled vocabulary requires the use of predefined, authorized terms that have been preferred by the authors of the vocabulary Taxonomy models subtype - supertype relationships, also called parent-child relationships on basis of controlled vocabularies with hierarchical relationships between the terms Ontology expands on taxonomy by modeling other relationships, constraints, and functions and comprises the modeled specification of the concepts embodied by a controlled vocabulary.
Metadata types IBM Information Server handles four metadata types. These metadata serve the data integration task solved by Information Server Business Metadata are intended for business users and include business rules, definitions, terminology, glossaries, algorithms and lineage using business language. Technical Metadata are required by specific BI, ETL, profiling, modeling tool users and define source and target systems, their table and fields structures and attributes, derivations and dependencies. Operational Metadata are intended for operations, management and business users, who need information about application runs: their frequency, record counts, component by component analysis and other statistics. Project Metadata are used by operations, stewards, tool users and management in order to document and audit the development process, to assign stewards, and to handle change management.
Success criteria of metadata project Not all companies have recognized the necessity of metadata management in data integration projects (for instance, in data warehouse development). Those who started the implementation of 38
metadata management system faced a number of challenges. The requirements for metadata management system in BI environment can be defined precisely and in proper time, for example: •
Metadata management system must provide relevant and accessible centralized information about all information systems and their relations.
•
Metadata management system must establish a consistent usage of business terminology across organizations
•
Impact of change must be discovered and planned
•
Problems must be traced from the point of detection down to the origin.
•
New development must be supplied with the information about existing systems
The reality is that unwieldy repository stores a pile of useless records; each system uses its own isolated metadata; uncoordinated policies appear to be mismatching factors; obsolete and unqualified metadata do not meet the quickly changing business requirements. The failure of metadata management projects, when one would think the goals are defined, the budget is allocated and a competent team is picked up, is mainly caused by the next reasons: •
Insufficient participation of business-user in the creation of a consolidated glossary, which may be the result of inconvenience and complexity of glossary and metadata repository management tools.
•
Supporting only a couple of metadata types due to shortage of time and / or financial resources.
•
Lacking or incomplete documentation for production systems, which could be mitigated by tools for data structure analysis of existing systems on the initial investigation step.
•
Lack of support of the full metadata management life cycle due to the fragmented metadata management tools.
Side bar Strictly speaking, these statements are related to product success or failure as a result of the project execution. As a rule, projects success criteria are timely execution within the budget, required quality and scope. Project success doesn’t guarantee the product success. The history of technological expansion knows a lot of examples, when technically perfect product, developed in time and with no budget deficit, wasn’t demanded or didn’t meet with a ready market sale The success criteria for metadata implementation project are the demand for developed metadata management system by subject matter experts, by business-users, by IT personnel and by other information systems, both in production and development.
Metadata management Lifecycle The simplified lifecycle implies five stage and five roles (Pic.1). Development is the creation of new metadata by author (subject matter expert). Publishing, performed by publisher, notifies the participants and users of the existing and available metadata and their locations. Ownership allows to define and to assign metadata usage rights. Consuming of metadata is performed by the development team, by users or by information systems. Metadata management, executed by manager or stewards, includes modification, enrichment, extension, and access control.
39
Metadata Management Development
Publishing
Consuming
Ownership Pic.1. Simplified metadata management lifecycle Extended metadata management lifecycle consists of the following stages (Pic.2). Analysis and understanding includes data profiling and analysis, data sets and structures quality determination, understanding the sense and content of the input data, connection revealing between the columns of database tables, analysis of dependence and information relations, data investigations for their integration.
Analysis and understanding
Reporting and audit
Modeling
Metadata Management
Quality management
Development
Consuming Ownership
Publishing
Transformation
Pic.2. Extended metadata management lifecycle Modeling means revealing data aggregation schemas, detection and mapping of metadata interrelation, impact analysis and synchronization of models. Development provides team glossary building and management, business context support for IT assets, elaboration of extraction, transformation and delivery data flows. Transformation consists of automated generation of complex data transformation tasks and of linking source and target systems by means of data transformation rules. 40
Publishing provides a unified mechanism for metadata deployment and for upgrade notification. Consuming is visual navigation and mapping metadata and their relations; metadata access, integration, import and export; change impact analysis; search and queries. Metadata quality management solves the tasks of heterogeneous data lineage in data integration processes; quality improvement of the information assets; input data quality monitoring, and allows to eliminate data structure troubles and their processability before they affect the project. Reporting and audit imply setting formatting options for report’s results, report generation for the linage between business terms and IT assets, scheduling reports execution, saving and reviewing the versions of the reports. Audit results can be used for analysis and understanding on the next loop of life cycle. Metadata management is to manage access to templates, reports and results, to control metadata, to navigate and query the metamodel, to define access rights, responsibilities and manageability. Ownership determines metadata usage rights. Sidebar Support of full metadata management lifecycle is critically important for metadata management goals, especially for big enterprise information systems. Lifecycle discontinuity leads to the consistency violation of the corporate metadata, and isolated islands of the contradictive metadata arise. Implementation of consistent tools for metadata management leads to a considerable increase in the success possibility of metadata management system implementation project.
IBM Information Server metadata management tools IBM Information Server platform includes the following metadata management tools. Business Glossary is a Web-based application that supports the collaborative authoring and collective management of business dictionary (glossary). It allows to maintain the metadata categories, to build their relations, and to link them to physical sources. Business Glossary supports the metadata management, alignment and browsing and assignment the responsible stewards. Business Glossary Anywhere is a small program which provides a read-only access to the content of business glossary through operation system’s clipboard. User can highlight the term on the screen of any application, and a business definition of the term will appear in a pop-up window. Business Glossary Browser provides a read-only access to business glossary’s content in a separate window of web-browser. Information Analyzer scans automatically data sets to determine their structure and quality. This analysis helps in understanding data inputs to integration process, ranging from individual fields to high-level data entities. Information analysis also enables to correct problems with structure or validity before they affect the metadata project. Information Analyzer maintains profiling and analysis as an ongoing process of data reliability improvement. QualityStage provides the instruments for investigation, consolidation, standardization and validation of heterogeneous data in integration processes and improves the quality of the information assets. 41
DataStage maintains the development of data flows, which extract information from multiple sources, transform it according to the specified rules and deliver it to target data bases or applications. Information Analyzer performs source systems analysis and passes to QualityStage, which, in turn, supports DataStage, responsible for data transformation. Used together Information Analyzer, QualityStage and DataStage allow to automate the data quality assurance processes, and to eliminate the painstaking or even impossible data integration handworks. FastTrack reveals the relations between columns of database tables, links columns and business terms, automatically creates complex data transformation tasks in DataStage and QualityStage Designer, binds data sources and target systems by data transformation rules, reducing the application development time. Metadata Workbench provides metadata visualization and navigation tools, maintains visual representation of metadata interdependences, gives the possibilities of information dependencies and relations analysis between various tools, allows metadata binding, generates reports on business terms and IT assets relations, support metadata management, navigation and metamodel queries; allows to investigate key integration data: Tasks, Reports, DBs, Models, Terms, Stewards, Systems. Web Console provides administrators with a role based access management tools; maintains the scheduling of report execution, storing the results of queries in common repository and viewing multiple versions of the report; creating the directories for report storage and indicating in which directories the reports will be stored. Web console allows to define the formatting options for results of queries. Information Services Director resides in the domain layer of IBM Information Server and provides the unified mechanism for publishing and management the data quality services, allowing to IT specialists to deploy and control the services for any data integration task. Common services include metadata services, which supply the standard service-oriented end-to- end access to metadata and their analysis. Rational Data Architect is an enterprise data modeling and integration design tool that combines data structure modeling capabilities with metadata discovery, relationship mapping, and analysis. Rational Data Architect helps to understand data assets and their relationships to each other and allows to reveal data integration schemas, to visualize metadata relations, to analyze the impact of changes and the synchronization of models. Metadata Server maintains the metadata repository and other components interaction, and support metadata services: metadata access and integration, impact analysis, metadata import, export, search and queries. The repository is a J2EE application. For persistent storage it uses a standard relational database such as IBM DB2, Oracle, or SQL Server. Backup, administration, scalability, parallel access, transactions, and concurrent access are provided by an implemented database. As we can see, IBM Information Server metadata management tools cover the extended metadata management lifecycle.
Roles in metadata management project The team roles set of the metadata management project depends on many factors and can include, for example, infrastructure engineers, information security specialists, and middleware developers. Limited by team roles being of direct relevance to metadata system development, the role list can look as follows. 42
Project manager for effective project management requires both project documentation and information on product deliverables, namely, on developing metadata management system. So project manager should be granted an access to tools producing the reports on jobs, queries, data bases, models, terms, stewards, systems and servers. Subject matter expert has to participate in the business glossary collaborative creation and management. Expert must define the terms and their categories, and to establish their relations. Business analyst should know the subject matter, understand the terminology and the sense of entities, and have previous experience in formulating the rules of data processing and transformation from sources to target systems and consumers. Participation of business analyst in business glossary creation is also very important. Data analyst reveals all the inconsistencies and contradictions in data and terms before the application program developments starts. IT developer should have the ability to familiarize himself with business terminology, to develop the data processing jobs, to implement transformation rules, to code data quality Application administrator is responsible for maintaining and versioning the configuration of applications in production; for updates and patch sets installation; for maintaining and monitoring the current state of program components; for execution of the general policies of the protection profiles; for conducting the performance analysis and for application execution optimization. Data base administrator should tune the data base and control its growth; reveal the performance problems and fix them; generate the required data base configurations; change the structure of data vase; add and remove the users and change their access rights. Business users in the frame of metadata project need a simple and effective access to the metadata dictionary. As in the case of a common paper dictionary users require the ability to read the lexical entry along with the explicit description and the brief dictionary definition, preferably without any loss of context or focus.
The roles support by IBM Information Server tools Business Glossary allows to assign a steward (responsible for metadata) role to a user or a group of users; and to hold steward liable for one or more metadata objects. Steward’s responsibilities imply an efficient management and integration with related data and making the data available to authorized users. Steward should ensure that data is properly defined, and that all users of the data clearly understand its meaning. Subject matter expert (metadata author) uses Business Glossary to create the business classification (taxonomy), which maintains hierarchical structure of terms. Term is a word or phrase which can be used for object classification and grouping in metadata repository. Business Glossary supply subject matter experts with collaborative tool to annotate existing data definitions, to edit descriptions, and to assign data object to categories. If business analyst or data analyst discovers contradictions between glossary and data base columns, he can notify metadata authors by means of Business Glossary features. Other project participants need a read-only access to metadata. Their demands can be covered by two instruments: Business Glossary Browser and Business Glossary Anywhere. Information Analyzer plays an important role on the integration analysis stage, which is required for the estimation of the data existence and their current state. The result of this stage fulfillment is 43
the understanding of the source systems and, consequently, the adequate target system design. Instead of time-consuming hand work in analysis of the outdated or missing documentation, Information Analyzer provides business analyst and data analyst with the possibilities of automated analysis of production systems. Business analyst uses Information Analyzer to make decisions on integration design on the basis of data base tables investigation, columns, keys and their relations. Data analysis helps to understand the content and structure of data before project starts, and allows making useful for integration process conclusions on later project stages. Data analyst accepts Information Analyzer as an instrument for a complete analysis of source information systems and target systems; for evaluation of structure, content and quality of data in single and multiple column level, on table level, on file level, on cross table level, and on the level of multiple sources. Stewards can use Information Analyzer to maintain the common understanding of the data sense by all users and project participants. Business analyst or data analyst by means of Information Analyzer can create additional rules for evaluation and measurement of data and their quality in time. These rules are either simple criteria of column evaluation based on results data profiling, or complex conditions, which evaluate several fields. Evaluation rules allow to create the indices, which deviation can be controlled over time. QualityStage can be invoked on the preparation stage of enterprise data integration (often referred to as data cleansing). IT developer runs QualityStage for data standardization automation, for data transformation into the verified standard formats, for designing and testing match passes, and for data-cleansing operations setup. Information is extracted from the source system, measured, cleansed, enriched, consolidated, and loaded into the target system. Data cleansing jobs consist of the following sequence of stages. Investigation stage is performed by business analyst to reach a complete visibility of the actual condition of data and can be fulfilled using both Information Analyzer and QualityStage’s embedded analyzing tools. Standardization stage reformats data from multiple systems to ensure that each data type has the correct content and format. Match stages ensure data integrity by linking records from one or more data sources that correspond to the same entity. The goal of the Match stages is to create semantic keys to identify information relationships. Survive stage ensures that the best available data survives and is correctly prepared for the target. This means that survive stage is executed to build the best available view of related information Basing on data understanding achieved on investigation stage, IT developer can apply QualityStage ready to run rules to reformat data from several sources on standardization stage. IT developer leverages DataStage for data transformation and movement from source systems to target systems in accordance with business rules, requirements of subject matter and integrity, and / or in compliance with other data of target environment. Using metadata for analysis and maintenance, and embedded data validation rules, IT developer can design and implement integration processes for data, received from a broad set of corporate and external sources, and processes of mass data manipulation and transformation leveraging scalable
44
parallel technologies. IT developer can implement these processes as DataStage batch jobs, as real time tasks, or as Web services. FastTrack is predominantly an instrument of business analyst and IT developer. Business analyst with the help of the instrumentality of mapping editor, a component of FastTrack, creates mapping specifications for data flows from sources to target systems. Each mapping can contain several sources and targets. Mapping specifications are used for business requirements documentation.. Mapping can be adjusted by applying business rules. End-to-end mapping can involve data transformation rules, which are part of functional requirements and define how application should be developed. IT developer uses FastTrack during the process of program logic development of end-to-end information processing. FastTrack converts the artifacts received from various sources into understandable descriptions. This information has internal relations and allows the developer to get the descriptions from metadata repository and to concentrate on the complex logic development, avoiding loosing the time for search in multiple documents and files. FastTrack is integrated into IBM Information Server, so specifications, metadata and jobs become available for all project participants, who use Information Server, Information Analyzer, DataStage Server and Business Glossary.
Steward
Business analyst
Data analyst
IT developer
DB administrator
Business users
x
x x
x x
x x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x x
x
x
x x x x x x
x
45
Application Administrator
Subject matter expert
Business Glossary Business Glossary Browser Business Glossary Anywhere Information Analyzer QualityStage DataStage FastTrack Metadata Workbench Web Console Information Services Director Rational Data Architect
Project manager
Table 1. Roles in a metadata management project and IBM Information Server tools
x x
x x
x
Metadata Workbench provides IT developers with metadata view, analysis and enrichment tools. Thus IT developers can use Metadata Workbench embedded design tools for management and understanding the information assets, created and shared in IBM Information Server. Business analysts and subject matter experts can leverage Metadata Workbench to manage metadata stored in IBM Information Server. Specialists, responsible for compliance with regulations such as Sarbanes-Oxley and Basel II, have the possibility to trace the data lineage of business intelligence reports using the appropriate tools of Metadata Workbench. IT specialists who are responsible for change management, say, project manager, with Metadata Workbench can analyze the change impact on the information environment. Administrators can use the capabilities of Web console for global administration that is based on a common framework of Information Server. For example, user needs only one credential to access all the components of Information Server. A set of credentials is stored for each user to provide single a sign-on to the products registered with the domain. IT developer executes Information Services Director as a foundation for deploying integration tasks as consistent and reusable information services. Thus IT developer can use metadata management service-oriented tasks together with corporate applications integration, the business-process management, with enterprise service bus and the application servers. Data analysts and architects can invoke Rational Data Architect for data base design, including federated databases, that can interact with DataStage and other components of Information Server. Rational Data Architect provide data analysts with metadata research and analysis capabilities, and data analysts can discover, model, visualize and relate heterogeneous data assets, and can create physical data models of from scratch, from logical models by using transformation, or from the database using reverse engineering.
Conclusion The performed multianalysis, including the types of metadata, the metadata life cycle, the roles in metadata project, metadata management tools, allowed to draw the following conclusions. IBM Information Server metadata management tools cover an extended metadata management lifecycle in data integration projects. The participants of metadata management project are provided with the consistent set of IBM Information Server metadata management tools, which allows to considerably increase the corporate metadata management system implementation’s success probability. The process flows of IBM Information Server components and their interaction will be considered in further papers. Author thanks S.Likharev for useful discussion.
46