Data Warehousing: Triple Strategy in Practice Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A “Program Engineering”, 2011, v4, pp 26-33 www.ibm.com/developerworks/ru/library/sabir/strategy/index.html
Abstract This paper uses a practical example of a system for collecting and analyzing primary data to show how triple strategy and recommended architecture of enterprise data warehouse (EDW) can provide higher quality of the information analysis service while reducing costs and time of EDW development.
Introduction Many successful companies have found that separate line of business management does not give a complete picture of the company’s market situation and business. To make accurate and timely decisions, experts, analysts and company management need unified, consistent information, which should be provided by an enterprise data warehouse (EDW). In practice, enterprise data warehouse projects do not meet time, cost and quality targets as a rule. In many cases analytical reports as data warehouse output still contain conflicting information. In this article it is shown that adherence to recommended architectural solutions, using proven strategies for creating EDW and the right choice of software tools can reduce EDW development costs and improve the quality of EDW services. Based on the triple strategy, recommended architecture, proposed principles and best practices of EDW construction, project management plan is proposed for software development of an enterprise data warehouse. IBM offers a complete toolset for data, metadata and master data integration at all stages of the life cycle of an EDW development project. The purpose of this paper is to analyze the simplified solution based on IBM Forms, IBM InfoSphere Warehouse and IBM Cognos BI software. The solution must be scalable and functionally expandable. It should be easily integrated into the enterprise IT infrastructure and to be able to become a foundation for enterprise data warehouse.
Architecture of primary data gathering and analysis system A typical solution for the collection, storage and analysis of primary data inputted manually was proposed in articles [1, 2]. We recall the essential system requirements: Collection of primary, statistical, or reported data from remote locations is required; It is necessary to check the data on the workplace prior sending to the center. The collected data should be purified and stored for a period defined by regulations; It necessary to form management statements to assess the state of affairs in the regions. It is necessary to perform analysis based on the collected statistics to identify regular patterns and to make a management decision; The solution must be extensible and scalable. For example, it is necessary to anticipate the subsequent integration of the primary data gathering and analysis system with document management system. • • • • •
Pic. 1 shows the centralized architecture of a primary data gathering and analysis system, which assumes that the data input can be performed remotely, and all IBM Forms [3] data collection servers, InfoSphere Warehouse [4] data storage servers, and IBM Cognos [5,6 ] data analysis and interpretation servers are installed in a single data center. 103
Pic. 1. Solution architecture using Lotus Forms, InfoSphere Warehouse and Cognos BI
104
Analysts can work both locally and remotely, thanks to the Web interface provided by Cognos for the preparation and execution of analytical calculations. The proposed architecture has been based on the most simple software configuration for a typical task without taking into account the specific requirements of various projects. Despite the obvious limitations, the proposed approach can be used as a basis for a variety of solutions for different industries and enterprises. Separation of subsystems for data entry, collection, storage and analysis allows us to construct different architectures depending on the needs and demands of the task and enterprise infrastructure requirements. Another advantage of this modular system is the possibility of its functionality expansion. Since all modules communicate over standard generally accepted protocols, it can be integrated with various IT systems such as document management, metadata and master data management, enterprise resource planning, analytical and statistical packages and many others. The system for collecting and analyzing raw data can be easily integrated into existing corporate IT infrastructure. In other circumstances it may be a first step to implementation of a corporate system for collecting, storing and analyzing data. As you can see, the architecture of the primary data gathering and analysis system (Pic. 1) contains no metadata or master data management tools. On the face of it, this contradicts the proposed triple strategy of a data warehouse development [7], which involves integration of data, metadata and master data. On the other hand, it is not clear how this solution relates to the recommended EDW architecture [8], and how the proposed approach differs from the numerous projects whose primary purpose is to demonstrate a quick insignificant success.
Role of metadata and master data management projects The task of primary data input, their collection, storage and analysis has several features. First of all, primary data is entered manually in the field of approved on-screen forms (e-forms). That’s why eform fields are aligned with each other, both inside individual e-form and between e-forms. This means that different entities have different names and different fields. Therefore, the customer at early project stages has planned to store not forms or reports, but individual field data of which forms and reports can be constructed further. Consistent set of data is a great first step to manage metadata, even if this requirement was not formulated explicitly. Under these conditions, the first phase of metadata management does not require the use of specific software and can be done with a pencil, an eraser and paper. The main difficulty of this stage is to reach an agreement among all the experts on the terms, entities, their names and methods of calculation. Sometimes users have to abandon familiar but ambiguous names, and agreement may require considerable effort and time. Fortunately, this work has been performed by the customer before resorting to the external project team. The names of e-form fields, methods of input data check and calculation are essential business metadata. The solution architecture is developed on the basis of collected metadata, including the data warehouse model, EDW tables columns are created and named. Thus, implicit technical metadata management is started. Changes are inevitable during the maintenance of the developed system. During this stage, the developers need to manage a glossary of terms. If it was not created earlier, it's time to think about glossary implementation, since the system maintenance process forces to start centralized metadata management in an explicit form. 105
This scenario implies minimal overhead for the implementation of a centralized metadata management system, as the kernel of the future system has been created previously. This core, though small and not having enough features, has a major asset of consistent metadata. Centralized master data management should be started simultaneously with metadata management. The reason is simple: master data and metadata are closely connected [7], and master data implementation without a metadata project, as a rule, does not lead to success. The basis for a master data management system can be a set of codifiers, dictionaries, classifiers, identifiers, indices and glossaries maintained for the data warehouse. In this case well conceived metadata should perform a master data quality assurance role, which eliminates data encoding conflicts under conditions of skilled DW design. Thus, systematization of business metadata, based on e-form fields and performed at the pre-project stage, has provided the opportunity to create a trouble-free metadata and master data management systems. It allowed to reduce the budget of the project of implementation of primary data gathering and analysis system. At the same time the project team was aware that the metadata and master data projects are performed implicitly. At this stage only the designers’ strategic vision and the developers’ accuracy are demanded.
Recommended DW Architecture Recommended enterprise data warehouse architecture, proposed in [8],is constructed in accordance with the following basic principles. EDW is the only source of noncontradictory data and should provide users with consistent data of high quality gathered from different information systems. •
Data should be available to employees to the extent necessary and sufficient to carry out their duties.
•
Users should have a common understanding of the data, i.e., there should be a common semantic space.
•
It is necessary to eliminate data encoding conflicts in the source systems.
•
Analytical calculations must be separated from operational data processing.
•
Multilevel data organization should be ensured and maintained.
•
It is necessary to follow the evolutionary approach, allowing business continuity and the IT investment saving.
The information content of future data storage, stages of EDW development and putting functional modules into operation are determined, first of all, by the requirements of the business users. Data protection and secure storage must be ensured. Data protection activities should be adequate to the value of the information. Architecture designed in accordance with these principles, follows the examined principle of the modular design - "unsinkable compartments”. By separating the architecture into modules, we also concentrate certain functionality in them (Pic. 2). ETL tools provide complete, reliable and accurate information gathering from data sources by means of algorithms concentrated in ETL for the collection, processing, data conversion and interaction with metadata and master data management systems. 106
Metadata management system is the primary source of information about the data in EDW. Metadata management system supports the relevance of business metadata, technical, operational and project metadata. The master data management system eliminates conflicts in the data encoding in source systems. Central Data Warehouse (CDW) has the only workload of reliable and secure data storage. Data structure in the CDW is optimized solely for the purpose of ensuring effective data storage. Data sampling, restructuring, and delivery tools (SRD) in this architecture are the only users of the CDW, taking on the whole job of data marts filling and, thereby, reducing the user queries workload on the CDW. Data marts contain data in formats and structures that are optimized for tasks of specific data mart users.
Pic. 2. Recommended DW Architecture
So, comfortable user’s operation with the necessary amount of data is achieved even when the connection to CDW is lost. The ability to quickly restore data mart’s content from the CDW in case of data marts failover is also provided. The advantage of this architecture is the ability to separate design, development, operation and refinement of individual EDW components without an overhaul of the whole system. This means that the beginning of work on the establishment of EDW does not require hyper effort or hyper investments. To start, it is enough to implement a data warehouse with limited capabilities, and following the proposed principles, to develop a prototype that is working and truly useful for users. Then you need to identify the bottlenecks and to evolve the required components.
Relation between the recommended architecture and the solution Architecture solution for the primary data collection, storage and analysis system (Pic. 3) is translated to EDW terms of the recommended architecture and is aligned with it. 107
Data are collected with the help of IBM Forms, which uses e-forms for manual data entry and allows you to transfer the collected data to other systems. IBM Forms application server can be further integrated with the repositories of structured and unstructured data. The only data source in this project are e-forms filled in according to strict rules, so at this stage there is no need for tools extraction, transformation and loading (eg, DataStage). However, in the future the project will evolve, and one can expect the need to connect other sources. The possibility of using ETL tools provides functional extensibility of the system without the need of a radical redesign. Data storage is implemented using IBM InfoSphere Warehouse. Data analysis can be performed by means of IBM InfoSphere Warehouse and IBM Cognos Business Intelligence (BI).
Pic. 3. Recommended and solution architectures’ relations
IBM InfoSphere Warehouse provides the following data analysis tools: analytical processing using Cubing Services based OLAP tools and Alphablox, data mining using Miningblox and Alphablox, and data mining with the assistance of Intelligent Miner. IBM Cognos 10 Business Intelligence (BI) is an integrated software suite to manage enterprise performance and is designed to aid in interpretation of data arising during the operation of the organization. Any employee can use IBM Cognos 10 BI to create business reports, to analyze data and monitor events and metrics in order to make effective business decisions. Cognos BI 10 allows to draw graphs to compare the plan and fact, to create different types of reports, to embed reports into a convenient portal and to create a custom dashboard. Analytical tools can be extended by means of IBM Cognos TM1 enterprise planning software, which provides a complete, robust and dynamic planning environment for the timely preparation of personalized budgets and forecasts.
108
Metadata which are obtained as a byproduct of matching e-forms, and master data, which are the result of data reduction to normal form in relational database of EDW, are the prototypes of future enterprise metadata and master data management systems (Pic. 4). The first publication of the need to establish systems of Data Dictionary / Directory Systems has appeared in the mid 80s [9]. An article [10] published in 1995 stated that for successful data integration it is necessary to establish and maintain metadata flow. Current practice shows that this requirement needs to be clarified, since metadata are generated at all stages of development and operation of information systems. The relationship between data, metadata and master data was discussed in detail in [8], where it was shown that master data contain business metadata and technical metadata. Data loading to EDW can not be properly performed without metadata and master data, which are heavily used at this stage, explicitly or implicitly. Cleaned and consistent data are stored, but metadata and master data are usually ignored.
Pic. 4. Step to enterprise metadata & master data management Creation of metadata and master data repositories significantly reduces the EDW implementation costs; allows us to move from the storage of inconsistent forms to the storage of consistent data and improves the quality of information services for business users [11].
Comparison of proposed and existing approaches In order to answer the question, what is the difference of the proposed approach and existing practice, consider a typical example of a project of a financial analysis system development in a bank [12]. The project team relied on the fact that creation of enterprise master data is a long, difficult and risky job. Therefore, the project was limited to local task solution of planning and forecasting
109
processes reengineering, which should pave the way for a bank reporting system, based on the integration of core banking systems, to use more detailed data, consistent with the general ledger. Development of an EDW in project team’s eyes was tantamount to the "Big Bang" that created the universe. The project team, avoiding enterprise wide solutions, has introduced metadata and master data for a particular area of activity. Therefore, the financial data repository is a highly specific data mart for financial statements (Pic. 5). In contrast, the EDW provides consistent corporate data for a wide range of analytical applications. Practice shows that only EDW, which is integrated with metadata and master data management systems, can provide a single version of data. As you can see, the main objective of this project is a small demonstration of quick win. Many of us were put in the same situation, when there was an urgent need to demonstrate even a small, but working system. An experienced developer knows that he will have to follow the advice of Brooks [13] and throw this first version away. The reason is that the cost of applications redesig n and their integration into the enterprise infrastructure would be prohibitive because of the lack of agreed metadata and master data.
Pic. 5. Example of existing approach
The final architecture of implementation of existing approaches Let us briefly summarize the results of the analysis. 1. Existing approaches implement disparate application data marts effectively. The data in the data marts may be valuable inside the units, but not for the company as a whole, because of the impossibility of data reconciliation due by different data sense and coding. 2. Belief that the creation of an EDW is like a deadly trick with unpredictable consequences is widespread, so it often is decided to create local data marts without EDW development. 3. The demand for instant results leads to development and implementation of limited solutions with no relation to enterprise level tasks. 110
Following these principles, the company introduces initially separate, independent data marts. The information in the data marts is not consistent with data from other data marts, so the management has contradictory reports on their tables. The indicators in these reports with the same name may hide different identities, and vice versa, the same entity may have different names, can be calculated by different algorithms, based on different data sets, for various periods of time. As a result, users of independent application data marts speak different business languages, and each data mart has its own metadata. Another problem is the difference between master data used in the independent data marts. The difference in data encoding used in the codifiers, dictionaries, classifiers, identifiers, indices and glossaries makes it impossible to combine these data without serious analysis, design and development of master data management tools. So the company creates several inconsistent data warehouses, which is fundamentally contrary to the very idea of establishing an EDW as the one and only source of cleaned, consistent and noncontradictory historical data. Lack of enterprise metadata and master data management (shaded in Fig. 6) makes the possibility to agree on of data between them even less probable. Obviously neither the management nor the users of such a repository are inclined to trust the information contained in it. So on the next step the need for radical redesign arises, and in fact, for creation of a new data warehouse which stores not reports, but agreed indicators, from which reports will be collected. Thus, the pursuit of short term results and the need to demonstrate quick wins lead to the rejection of a unified end-to-end metadata and master data management. The result of this approach is the presence of semantic islands, where the staff speaks a variety of business languages. Enterprise data integration architecture must be redesigned completely, which leads to repeated time and money expenditures to create a full scale EDW (Pic. 6).
Pic. 6. Result of existing approach: DW with intermediate Application DMs
111
Triple strategy and EDW development planning The proposed approach is based on the triple strategy, on recommended architecture, on formulated principles and on best practices of EDW development. As a rule, developers need to quickly demonstrate at least an insignificant success in data integration. In some companies, by contrast, one must develop and implement a corporate strategy for EDW. No matter how the task is formulated, in both cases you must have a global goal before your eyes and reach it by means of short steps. The role of the compass that lines up with strategic goals is given to coordinated integration of data, metadata, and master data (Pic. 7): 1. master data integration to eliminate data redundancies and inconsistencies; 2. metadata integration to ensure a common understanding of data and metadata; 3. data integration to provide end users with a single version of truth on the basis of agreed metadata and master data As you know, a great journey starts and ends with a small step. Creation of a centralized data metadata, and master data management environment is a priority task. But business users do not see immediate benefits to themselves from that environment, and management prefers to avoid long term projects with no tangible results for the company's core business.
Pic. 7. DW development plan Therefore, two or three pilot projects should be chosen on the first phase. The main selection criteria for these projects are management support and users’ and experts’ willingness to participate in the task formulation. Projects should provide minimum acceptable functionality of the future of EDW. As a tentative example the following pilot projects are selected to implement the first phase (Pic.7): 1. Data mining on the basis of Intelligent Miner (IM); 112
2. Multidimensional analysis (OLAP) using Cubing Services and Alphablox; 3. Unstructured text analysis using Unstructured Text Analysis Tools (UTAT). All these tools deployed in the first phase of pilot projects are part of IBM InfoSphere Warehouse. It is important that users feel the real benefits of EDW as a result of these short projects. The project team together with users needs to analyze the results of pilot projects implementation and if necessary determine the actions to change the EDW environment and to adjust the tasks of data metadata, and master data integration. The next step is to choose three or four new pilot projects that cold promote the company to the creation of basic functionality of the future EDW. It is desirable that the selection process involves all concerned parties: company management, users, business experts, project team and EDW maintenance and support team. Centralized data, metadata and master data management environment must be developed enough to meet the requirements of EDW basic functionality. Assume that the following projects and tools are chosen to be implemented on second phase: 1. Report generation and data analysis with Cognos Business Insight Advanced and Report Studio; 2. Creation of complex interactive dashboard, based on Cognos Business Insight; 3. Scenario analysis using Cognos TM1; 4. Corporate Planning with Cognos TM1. The projects results should be reexamined after completion of the pilot projects of second phase. The next step should be the development of a fully functional EDW, which is impossible without a comprehensive support by the environment of centralized data, metadata and master data management. Thus, a rough plan for a EDW development may look as follows: •
Strategic objectives: o coordinated integration of data, metadata, and master data
•
Tactical objectives: o Selection of two or three projects to demonstrate the benefits o Creation of a centralized data, metadata and master data management environment, o Project results analysis and alteration of EDW environment, if necessary o Implementation of three or four pilot projects, relying on the experience gained o In case of success - EDW development with company-wide functionality o EDW operation and modernization to fit new tasks, formulation and solution of which became possible due to the accumulated experience of EDW operation
Thus, the EDW development project is not completed when EDW is accepted as commissioned and fully operational. EDW must evolve together with the company. Life goes on, new problems arise, and new information systems are required. If these systems can provide the information important for the data analysis across the enterprise, these new systems must be connected to the EDW. In order to avoid integration issues it is desirable to create a new system based on the capabilities of a centralized data, metadata and master data management environment.
113
In turn, a centralized data, metadata and master data management environment should be changed and improved taking into consideration the needs of new systems. Therefore, centralized data, metadata and master data management environment must evolve until company and its IT systems exist, which is conventionally indicated on Pic. 7 by the arrows that go beyond the schedule.
Conclusion Enterprise data warehouse, built as a result of a coordinated data, metadata, and master data integration, provides higher quality of information and analytical services at lower costs, reduces development time and enables decision making based on more accurate information. The proposed approach provides an effective operation of data, metadata, and master data management systems, eliminates the coexistence of modules with similar functionality, lowers the total cost of ownership and increases user confidence in the data of EDW. The integration of data, metadata, and master data, performed simultaneously with the development of EDW functionality allows to implement the agreed architecture, environment, life cycles, and key capabilities for data warehouse and metadata and master data management systems.
Literature 1. Asadullaev S., “Primary data gathering and analysis system – I. Problem formulation, data collecting and storing ”, 2011, http://www.ibm.com/developerworks/ru/library/sabir/warehouse-1/index.html 2. Asadullaev S., “Primary data gathering and analysis system – II. Primary data analysis”, 2011, http://www.ibm.com/developerworks/ru/library/sabir/warehouse-2/index.html 3. IBM, “IBM Forms documentation”, 2010, https://www.ibm.com/developerworks/lotus/documentation/forms/ 4. IBM, “InfoSphere Warehouse overview 9.7”, 2010, http://publib.boulder.ibm.com/infocenter/idm/v2r2/index.jsp?topic=/com.ibm.isw.release.doc/he lpindex_isw.html 5. IBM, “Cognos Business Intelligence”, 2010, http://publib.boulder.ibm.com/infocenter/cfpm/v10r1m0/index.jsp?topic=/com.ibm.swg.im.cogn os.wig_cr.10.1.0.doc/wig_cr_id111gtstd_c8_bi.html 6. IBM, “Cognos TM1”, 2010, http://www-01.ibm.com/software/data/cognos/products/tm1/ 7. Asadullaev S., “Data, metadata and master data: triple strategy for data warehouse project”. http://www.ibm.com/developerworks/ru/library/r-nci/index.html, 2009. 8. Asadullaev S., “Data warehouse architecture – III”, http://www.ibm.com/developerworks/ru/library/sabir/axd_3/index.html, 2009. 9. Leong-Hong B.W., Plagman B.K. “Data Dictionary / Directory Systems”. Wiley & Sons. 1982. 10. Hackathorn R. “Data Warehousing Energizes Your Enterprise”, Datamation, Feb.1, 1995, p. 39. 11. Asadullaev S., “Data quality management using IBM Information Server”, 2010, http://www.ibm.com/developerworks/ru/library/sabir/inf_s/index.html 12. Financial Service Technology. «Mastering financial systems success», 2009, http://www.usfst.com/article/Issue-2/Business-Process/Mastering-financial-systems-success/ 13. Brooks F. P. “The Mythical Man-Month: Essays on Software Engineering”, Addison-Wesley Professional; Anniversary edition, 1995. 114