Data Warehouse Architectures - III Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A 03.11.2009 http://www.ibm.com/developerworks/ru/library/sabir/axd_3/index.html
Abstract The series of three articles is devoted to architectures of data warehouse (DW) and their predecessors. The abundance of various approaches, methods and recommendations makes a mess of concepts, advantages and drawbacks, limitations and applicability of specific architecture solutions. The first article [1] is concerned with the evolution of OLAP role understanding, of DW architecture components, of virtual DW and independent data marts. The second article [2] considers the Centralized DW (CDW) with ETL (Extract, Transform, Load), CDW with ELT (Extract, Load, Transform), CDW with operational data store, and extended model with data marts. This article discusses centralized ETL with parallel DW and data marts; DW with intermediate application data marts, DW with integration bus, and the recommended DW architecture.
Centralized ETL with parallel DW and data marts In this case the whole architecture of EDW is built around the ETL (Extract, Transform and Load) system. Information from disparate sources goes to ETL which purifies and harmonizes data and loads it to a central data warehouse (CDW), to operational data store (ODS), if any, and, if necessary, to temporary storage area. This is a common practice in EDW development. But downloading data from the ETL to the data marts directly is unusual. In practice, this architecture is a result of users’ requirement to access analytical data as soon as possible, without time delay. Operational data store does not solve the problem, as users may be located in distant regions, and they require territorial data marts. The security limitations on the deployment of heterogeneous information in ODS may be another rationale for this architecture. This architecture has a trouble spot: one of the problems of its operation is data recovery difficulty after a crash of data marts, directly supplied from ETL. The point is that ETL tools are not designed for long term storage of extracted and cleaned data. Transactional systems tend to focus on ongoing operations. Therefore, in case of data losses in data marts directly associated with ETL, one has to either extract information from the transactional systems’ backup, or organize historical archives of data sources systems. These archives require funds for their development and operational support, and they are redundant, with a corporate standpoint, since they duplicate functions of EDW, but they are designed only to support a limited number of data marts. As another approach, sometimes these data marts are connected both to ETL directly and to data warehouse, which leads to confusion and misalignment of the results of analytical work. The reason is that data coming in EDW, as a rule, pass additional checks for consistency with the already loaded data. For example, financial document can be loaded with the requisites, almost coinciding with the document received by the EDW before. The ETL system, not having the information about all downloaded data, can not reveal whether the new document is a mistake or a result of legitimate correction. Running inside the data warehouse, data verification procedures could reveal such uncertainty. The new data will be discarded in case of errors. Contrary, if it is a required correction, the changes will affect both these numbers, and the corresponding aggregate figures.
19
Pic.1. Centralized ETL with parallel DW and data marts
Thus, the information, loaded into data mart from ETL directly, may contradict the data received from the EDW. Sometimes, to solve this contradiction, the identical data verification algorithms are implemented in the data marts and in EDW. The disadvantage is the need to support and to synchronize the same algorithms in the EDW and in data marts, fed from the ETL directly. To sum up, we can say that parallel data marts lead to additional data processing, to organization and maintenance of excess operating archives, to support of duplicate applications and to decentralized data processing, which causes information mismatch. Nevertheless, the parallel data marts can be implemented in cases where rapid access to analytical information is more important than disadvantages of this architecture.
DW with intermediate application data marts The following assumptions were the rationales for this architecture’s invention. 1. Some companies still deploy and operate independent disparate application data marts. Data quality in these data marts can meet the requirements of analysts who are working with DM. 2. Project stakeholders are confident that enterprise data warehouse implementation is a deadly technical trick with unpredictable consequences. As a matter of fact, the difficulties of EDW development and implementation are not technical, but are associated with poor project organization and with the lack of involvement of experts - future EDW users. However project team tries to avoid nonsignificant technology issues and to simplify up-to-the-minute tasks, instead of improving project organization. 3. The requirement for quick results. The necessity to report on a quarterly basis causes a need for quick tangible results. That’s why project team is not immune to the temptation to develop and implement a restricted solution with no relation to other tasks.
20
Following these principles either accidentally or deliberately, companies start data integration with introducing the separate independent data marts, in the hope that the data they contain will be easily, simply and quickly integrated when required. The reality is much more complicated. Although the quality of data in data marts can satisfy their users, this information is not consistent with data from other DMs. So reports, prepared for the top management and decision makers, can not be reduced to an uncontroversial view. The same indicators can be calculated by different algorithms based on different data sets for various periods of time. Figures with the same name may conceal different entities, and vice versa, the same entity may have different names in various DMs and reports.
Pic. 2. DW with intermediate application data marts
Diagnosis is a lack of common data sense. Users of independent data marts speak different business languages, and each DM contains its own metadata. Another problem lies in the difference of master data, used in the independent data marts. The differences in the data encoding, used codifier, dictionary, classifiers, identifiers, indices, glossaries make it impossible to combine these data without serious analysis, design and development of master data management tools. However, the organization already has approved plans, budget and timeline for EDW which is based on independent data marts. Management expects to get results quickly and inexpensively. Developers provided with a scarce budget, are forced to implement cheapest solutions. This is a proven recipe for creation a repository of inconsistent reports. Such repository contradicts the idea of data warehousing as a single and sole source of purified, coherent and consistent historical data. Obviously neither the company management nor the repository users are inclined to trust the information contained therein. Therefore, the total rebuilding of DW is required that usually implies
21
that new EDW should be created, which stores report figures indexes, rather than full reports. This allows to aggregate figures indexes into consistent reports. Successful EDW rebuilding is impossible without metadata and master data management systems. Both systems will impact only the central data warehouse (CDW), as independent data marts contain their own metadata and master data. As a result, management and experts can get coherent and consistent records, but they can not trace the data origin, due to discontinuity in the metadata data management between independent data marts and CDW. Thus, the desire to achieve immediate results and to demonstrate rapid progress leads to denial of unified, end-to-end management of metadata and master data. The result of this approach is the semantic islands, where users speak a variety of business languages. Nevertheless, this architecture can be implemented, where a single data model is not necessary, or is impossible, and where a relatively small amount of data must be transferred to CDW without knowledge of their origin and initial components. For example, an international company, operating in different countries, has already implemented several national data warehouses that follow local legal requirements, business constrain and financial accounting rules. CDW can require only piece of information from the national DWs for corporate reporting. There is no need to develop a unified data model, because it would not be demanded at the national level. Certainly, similar scheme requires a high degree of confidence in national data, and can be used, if intentional or unintentional distortion of the data will not lead to serious financial consequences for the entire organization.
Data Warehouse with Integration Bus Widespread acceptance of service - oriented architecture (SOA) [3] has led to an idea to use SOA in solutions for enterprise data warehousing instead of ETL tools to extract, transform, load to a central data warehouse, and instead of SRD tools to sample, restructure and deliver data to the data marts. Integration bus, which underpins the SOA, is designed for web-services and applications integration, and provides intellectual message routing, protocol mediation and message transformation between service consumer and service provider applications. At first glance, the functionality of service bus allows us to replace the ETL and SRD by integration bus. Indeed, ETL performs mediation between the central data warehouse (CDW) and data sources, and SRD is the mediator between the CDW and data marts. It would seem that the replacement of the ETL and SRD by the integration bus can benefit from the flexibility provided by the bus for application integration. Imagine that the CDW, the operational data store (ODS), the temporary storage area, metadata and master data management systems call the bus as independent applications with queries to update the data from data sources. First of all, the load on the data sources will increase by many times, since the same information will repeatedly transmitted by the request of the CDW, ODS, the temporary storage area and metadata and master data management systems. The obvious solution is to create for the integration bus its own data store to cache queries.
22
Pic. 5. Data Warehouse with Integration Bus
Secondly, the data gathering procedures, previously centralized in the ETL, now scattered over the application requesting the data. The discrepancy in various data gathering procedures for the CDW, ODS, metadata and master data management systems will arise Sooner or later. Data collected by different methods at different time intervals, processed by different algorithms contradict each other. Thereby the main goal of creating the CDW as the single source of consistent non-contradictory data will be destroyed. The consequences of SRD replacement by the integration bus are not so dramatic. CDW must be transformed into service in order to respond to data marts requests for data, directed through the integration bus. This means that data warehouse must conform to the most common style of web services and support HTTP / HTTPS protocols and SOAP / XML message format. This approach works well for short messages, but usually data marts require a large amount of data to pass through integration bus. The task can be solved by using the binary objects transmission. The necessary data restructuring can not be performed by the integration bus, and must be carried out either in the CDW, or in the data marts. Data restructuring inside CDW is unusual functionality for CDW, as it must be aware of all data marts, and has to carry the additional workload, irrelevant to its main goal: reliable data storage. Data restructuring inside data marts requires the direct access from DM to CDW. In many cases it’s unacceptable for security reasons. This function can be realized by some proxy service that receives data and transmits them to the data marts after the restructuring. So, we return to the idea of SRD tool just supplied with bus interface. Thus, integration bus can be used in the EDW architecture as a transport medium between the data sources and the ETL and between SRD and data marts in those cases where the components of EDW are separated geographically and are behind firewalls in accordance with strict requirements for data protection. In this case, for interoperability it is sufficient that the exchange was enabled 23
over HTTP / HTTPS protocols. All data collection, transformation and dissemination logic should still be concentrated in ETL and SRD.
Recommended EDW Architecture Architecture of an enterprise data warehouse (EDW) should satisfy many functional and nonfunctional requirements that depend on the specific tasks solved by the EDW. As there is no generic bank, airline, or oil company, so there is no single solution for the EDW to fit all occasions. But the basic principles that EDW must follow can still be formulated. First and foremost it is the data quality that can be understood as complete, accurate and reproducible data, delivered in time where they are needed. Data quality is difficult to measure directly, but it can be judged by the decisions made. That is, data quality requires investment, and it can generate profits in turn. Secondly, it is the security and reliability of data storage. The value of information stored in EDW can be compared to the market value of the company. Unauthorized access to EDW is a threat with serious consequences, and therefore adequate protection measures must be taken. Thirdly, the data must be available to the employees to the extent necessary and sufficient to carry out their duties. Fourthly, employees should have a unified understanding of the data, so a single semantic space is required. Fifthly, it is necessary, if possible, to resolve conflicts in data encoding in the source systems.
Pic. 4. Recommended EDW Architecture
The proposed architecture follows the examined principles of modular design - "unsinkable compartments�. The strategy of "divide and rule" is applicable not only in politics. By separating the
24
architecture into modules, we also concentrate in them certain functionality to give power over the unruly IT elements. ETL tools provide complete, reliable and accurate information gathering from data sources by means of algorithms concentrated in ETL for the collection, processing, data conversion and interaction with metadata and master data management systems. Metadata management system is the principal "keeper of wisdom" which you can ask for advice. Metadata management system supports the relevance of business metadata, technical, operational and project metadata. The master data system is an arbitrator for conflict resolution of data encoding. Central Data Warehouse (CDW) has only the workload of reliable and secure data storage. Depending on the tasks, the reliability of CDW can be up to 99,999%, to ensure smooth functioning with no more than 5 minutes of downtime per year. CDW’s software and hardware tools can protect data from unauthorized access, sabotage and natural disasters. Data structure in the CDW is optimized solely for the purpose of ensuring effective data storage. Data sample, restructuring, and delivery tools (SRD) in this architecture are the only users of the CDW, taking on the whole job of data marts filling and, thereby, reducing the user queries workload on the CDW. Data marts contain data in formats and structures that are optimized for tasks of specific data mart users. At present, when even a laptop can be equipped with a terabyte disk drive, the problems associated with multiple data duplication in the data mart do not matter. The main advantages of this architecture are: •
comfortable user’s operation with the necessary amount of data,
•
the possibility to restore quickly the contents from the CDW in case of data marts failover,
•
off-line data access when connection with the CDW is lost.
This architecture allows a separate design, development, operation and refinement of individual EDW components without a radical overhaul of the whole system. This means that the beginning of work on the establishment of EDW does not require hyper effort or hyper investments. To start it is enough to implement a data warehouse with limited capabilities, and following the proposed principles, to develop a prototype that is working and truly useful for users. Then you need to identify the bottlenecks and to evolve the required components. Implementation of this architecture along with the triple strategy for data integration, metadata, and master data [4], allows to reduce time and budgeting needed for EDW implementation and to develop it in accordance with changing business requirements.
Conclusion The article discusses the advantages and limitations of the following architectures: centralized ETL with parallel DW and data marts, DW with intermediate application data marts, data warehouse with integration bus and recommended EDW architecture. Recommended corporate data warehouse architecture allows creating in a short time and with minimal investment the workable prototype that is useful for business users. The key to this architecture, providing an evolutionary development of EDW, is the introduction of metadata and master data management systems in early stages of development.
25
Literature 1. Asadullaev S. «Data Warehouse Architectures – I», 19.10.2009, http://www.ibm.com/developerworks/ru/library/sabir/axd_1/index.html 2. Asadullaev S. «Data Warehouse Architectures – II», 23.10.2009. http://www.ibm.com/developerworks/ru/library/sabir/axd_2/index.html 3. Bieberstein N., Bose S., Fiammante M, Jones K., Shah R. “Service-Oriented Architecture Compass: Business Value, Planning, and Enterprise Roadmap”, IBM Press, 2005. 4. Asadullaev S. «Data, metadata, master data: the triple strategy for data warehouse project», 09.07.2009, http://www.ibm.com/developerworks/ru/library/r-nci/index.html
26