Data Warehouse Architectures - II Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A 23.10.2009 http://www.ibm.com/developerworks/ru/library/sabir/axd_2/index.html
Abstract The second paper continues a series of three articles on data warehouse (DW) architectures and their predecessors. The abundance of various approaches, methods and recommendations makes a mess of concepts, advantages and drawbacks, limitations and applicability of specific architecture solutions. The first article [1] was concerned with the evolution of OLAP role understanding, with DW architecture components, with virtual DW and independent data marts. This publication considers the Centralized DW (CDW) with ETL (Extract, Transform, Load), CDW with ELT (Extract, Load, Transform), CDW with operational data store, and the extended model with data marts. The third article discusses centralized ETL with parallel DW and data marts; DW with intermediate application data marts, DW with integration bus, and the recommended DW architecture.
Centralized Data Warehouse with ETL Virtual data warehouse and independent data marts showed that a unified data repository is required for effective operation of analytical systems. To fill this repository we need to extract and to reconcile disparate data from various data sources and to load data into the repository. ETL tools should be aware of all information about data sources: the structure of stored data and their formats, differences in data processing methods, the meaning of the stored data, the data processing schedule in transaction systems. Ignoring this information about data (metadata) leads inevitably to quality deterioration of the information loaded into the repository. As a result, users lose confidence in the data warehouse, trying to get information directly from the source, which leads to unnecessary time expenditures of specialists who maintain data source systems. Thus, ETL tools must use the information about data sources. Therefore, ETL tools should work in close conjunction with metadata management tools. Extracted data should be converted into a unified form. Since the data are stored mainly in relational databases, we need to take into account the difference in the data coded values. Dates can be encoded in different formats; addresses may use different abbreviations, product encoding may follow different nomenclatures. Initially, information about master data was included in the data conversion algorithms of ETL tools. As the number of data sources and the volume of data being processed increased (the first one can reach thousands systems, and the second can be more than ten terabytes per day), it became clear that it is necessary to separate master data management (MDM) from ETL, and to ensure their effective interaction between MDM and ETL. Thus, ETL tools in conjunction with metadata and master data management tools extract data from sources, transform them to a required format, and load into a data repository. Usually data warehouse repository is used to store the data, but it can be also can also an operational data store (ODS), a staging area, and even a data mart. Therefore, one of the key requirements for ETL tools is their ability to interact with various systems. Growing volume of processed data and the need to increase the responsiveness of provisioning of analytical information impose increased requirements for the performance and scalability of ETL 12
tools. Therefore, ETL tools should use various schemes of parallel computing and be able to run on high-performance systems having different architectures.
Pic. 1. Centralized Data Warehouse with ETL
As is seen, the ETL tools must fit different requirements: •
Data from various data source systems should be collected, even if one or more systems failed to complete data processing in time, and at least some required data should be provided.
•
Collected information must be recognized and converted in accordance with transformation rules, and with the help of metadata and master data management system.
•
Transformed information must be loaded into a staging zone, into a data warehouse, OSD, a data mart, as required by business and production processes.
•
ETL tools must have a high throughput to collect and load the ever-increasing data volumes into various repositories.
•
ETL tools must possess high performance and scalability to reduce data processing time and to shorten lags in providing data for analytical tasks.
•
ETL tools should provide various data extracting instruments in different operating environments: from data collection batch systems, which are non-critical to time delays, to practically real-time incremental data processing.
In connection with these often mutually exclusive requirements, design and development of ETL tools become a difficult task, especially when ready made solutions are not used.
13
Centralized Data Warehouse with ELT Traditional ETL system is often blamed for poor efficiency and high cost due required dedicated hardware and software. As an alternative to ETL, ELT (extraction, loading and transformation) tools were proposed, which are attributed to high productivity and efficient use of equipment. In order to understand what are the comparative advantages and disadvantages of ETL and ELT systems of, let’s turn now to the three main functions of enterprise data warehouse (EDW): 1. Full and timely collection and processing of information from data sources; 2. Safe and secure data storage; 3. Provision of data for analytical tasks. The input to ETL / ELT systems are disparate data whish have to be compared, cleaned, transformed to a common format, and to be processed according to calculation algorithms. On the one hand, data practically do not stay in ETL / ELT systems; on the other – the main information stream flows through these systems to data repositories. Therefore, the requirements for information security can be moderate.
Pic. 2. Centralized Data Warehouse with ELT
As a rule central data warehouse (CDW) contains a wealth of information and its full disclosure could lead to serious losses for the company. In this case, a reliable information security perimeter is required around CDW. Data structure in CDW should best fit the requirements of long-term, reliable and secure storage. Using ELT approach means that CDW should also perform the data transformation. Data delivery for analytical tasks requires specific reorganization of data structures for each analytical application. Multidimensional analysis requires data cubes; statistical analysis, as a rule, uses data series, scenario and model analysis can use MS Excel files. In this architecture business applications use data from CDW directly. So this architecture implies that CDW should store data 14
structures that are optimized both for current and for future business applications. Moreover, such direct access increases the risk of unauthorized access to all data in CDW. Thus, we see that this architecture entrusts CDW with the data transformation function and information services for analytic tasks. Both of these features are unusual for CDW, which in this form becomes "all in one" unit, where functional components generally have lower quality than if they were implemented separately (e.g., a camera in a mobile phone). We will later discuss how data storage functions and functions of data delivery for analytical applications can be separated. Implementation of the ETL scheme allows to separate data processing and data storage functions. ELT scheme loads CDW with improper data conversion functions. The migration of ETL functionality inside CDW forces us not only to provide the same processing power, but also to design a universal platform that can still efficiently process data and store them. This approach could be applied to SOHO segment, but enterprise wide system (like EDW) requires adequate solution. Despite the stated performance advantages of the ELT scheme, on practice it turns out that: 1. Data quality affects data load time. For example, ETL may discard up to 90% of duplicate data during data cleaning and transformation. In this data case ELT will load all the data in the CDW, where data cleaning will occur. 2. Data transformation rate in CDW storage depends strongly on processing algorithms and data structures. In some cases SQL processing within the CDW’s database is more efficient, in others cases external programs that extract data to be processed and load processing results to the CDW will run much faster. 3. Some algorithms are very difficult to implement using SQL statements and stored procedures. This imposes restrictions on the use of the ELT scheme, while ETL can use appropriate and more effective tools for data processing. 4. ETL is a unified area where data extraction, processing and loading rules reside, which simplifies testing, modification and operation of algorithms. ELT, by contrast, separate data collecting and loading algorithms from transformation algorithms. That is, to test a new transformation rule we have to risk the integrity of the production data warehouse, or to create a test copy of the repository, which is very costly. Thus, comparing ETL and ELT, we see that the advantages of data loading and transformation are not clear, that the ELT faces SQL constraints in data conversion, and that the savings in ELT software and hardware result in financial costs for the creation of software and hardware CDW test copy. The use of ELT may be justified if: 1. There are no stringent requirements for DW reliability, performance, and security. 2. Budget constraints force to take a risk of data loss. 3. Data warehouse and data sources interact via a service bus (SOA). The latter case is the most exotic, but it has a right to exist under certain conditions. In this case the service bus is responsible for the integration between data sources and DW at the messaging level, and a minimal (by the DW standards) data conversion and loading to DW.
15
Centralized DW with Operational Data Store Data extraction, transformation and loading processes, of course, take some time to complete. Additional delay is caused by the need to check data downloaded to DW for consistency with already available in DW data, for data consolidation, and for the totals’ recalculation based on new data. The operational data store (ODS) was proposed in 1998 [2] in order to reduce the time delay between information receipt from ETL and analytical systems. ODS has less accurate information due to lack of internal checks, and has more detailed data due to missed data consolidation phase. Therefore, data from ODS are designed to make tactical decisions, while information from a central data warehouse (CDW) is better suited for strategic missions [3]. Imagine a company that sells drinks and snacks from vending machines throughout the country. 15 minutes downtime of an empty machine means potential profit loss, so it is crucial to monitor the status of the machine and fill it with the missing goods. Collection and processing of all information across the country may take several hours, whereas products delivery is done locally: in every city there is a warehouse from where drinks and snacks are delivered to the nearest outlets. Warehouses are filled up through centralized procurement. Thus, there are two different types of tactical tasks (filling vending machines), and strategic planning (filling warehouses).
Pic. 3. Centralized DW with Operational Data Store
Indeed, if an extra bottle of water will be delivered as a result of incomplete and inaccurate data in the ODS, then it will not lead to serious losses. However, a planning error caused by low data quality in ODS may adversely affect a decision on the types and volumes of bulk purchases. Information security requirements for CDW and ODS are also different. In our example, ODS stores recent data for no more than a couple of hours. CDW stores historical information, which can cover a period of several years for better prediction of the required purchase volume. This historical information can present a considerable commercial interest for competitors. So tactic analysts can 16
work directly with the ODS, while strategic analysts must work with the CDW through a data mart for the responsibility delineation. Tactic analysts can access data nearly on line due to absence of data mart. Data mart does not preclude strategic analysis, since such an analysis is carried out on monthly or even quarterly basis. The architecture shown in Fig. 3 involves direct interaction between CDW and business applications. Analysis of the strengths and limitations of this approach will be considered in the section "Extended model with data marts". Now it should be noted that ODS actually performs another role of a staging zone when data move sequentially from ODS to CDW. Tactic analysts, working with data from ODS, wittingly or unwittingly reveal errors and contradictions in the data, thereby improving their quality. In this scheme corrected data from the ODS are transferred to CDR. However, there are other schemes, for example, when data from the ETL come both in the ODS and in the CDW in a parallel manner. Unnecessary data are simply erased from ODS after using. This scheme is applicable in cases where human intervention in the data can only distort them, voluntarily or involuntarily.
Extended Model with Data Marts Direct access of business applications to CDW is admissible if the users’ requests do not interfere with the normal functioning of CDW, if users communicate with CDW through high-speed lines, or if accidental access to all data in CDW does not lead to serious losses. Administration of direct user access to CDW is an extremely difficult task. For example, a user from one department is authorized to access data from another unit only in 10 days after data are available. Another user can see only the aggregates, but no detailed data. There are other, more complicated access rules. Their management, accounting, and change lead to the inevitable errors caused by a combination of difficult access conditions. Data marts that contain information intended for a specific group of users, significantly reduce the risk of information security breaches. Up to now, the quality of communication lines is a serious problem for geographically distributed organizations. In the event of breakage or insufficient bandwidth, remote users are denied access to the information contained in CDW. The solution is remote data marts, which are filled either after working hours, or incrementally, as information becomes available, using assured data transfer. Various business applications require different data formats: multi-dimensional cubes, data series, two-dimensional arrays, relational tables, files in MS Excel, comma separated values, XML-files, etc. No data structure in CDW can meet these requirements. The solution is the creation of data marts, whose data structures are optimized for the specific requirements of individual applications. Another reason for the need of data marts creation is the requirement for the reliability of CDW, which is often defined as four or five nines. This means that downtime of CDW can not be more than 5 minutes per year (99.999%) or more than 1 hour per year (99.99%). Creation of a hardware and software system with such characteristics is a complex and expensive engineering task. Requirements for protection against terrorist attacks, sabotage and natural disasters further complicate the construction of software and hardware system and implementation of appropriate organizational arrangements. The more complex such system is and the more data it stores, the higher is the cost and complexity of its support. Data marts reduce dramatically the CDW load, both through the number of users and through data volume in the repository, as these data can be optimized for storage, not for query facilities. 17
Pic. 4. Extended Model with Data Marts
If the data marts are filled directly from CDW, the actual number of users is reduced from hundreds and thousands to tens of data marts, which become CDW users. Implementation of SRD (Sample, Restructure, Delivery) tools reduces the number of users to one and only one. In this case, the logic of the information supply for data marts is concentrated in the SRD. So DMs can be optimized for service user requests. CDW hardware and software can be optimized exclusively for reliable, secure data storage. SRD tools also soften the CDW workload due to the fact that different data marts can access the same data, whereas SRD retrieves data once, convert to various formats and delivers to different data marts.
Conclusion The paper considers the following architectures: a centralized DW with ETL system, DW with ELT system, central data warehouse with an operational data store (ODS), extended model with data marts. Next papers will discuss the advantages and limitations of centralized ETL with parallel DW and data marts, DW with intermediate application data marts, data warehouse with integration bus, and recommended DW architecture.
Literature 1. Asadullaev S. “Data Warehouse Architectures - I”, 19.10.2009, http://www.ibm.com/developerworks/ru/library/sabir/axd_1/index.html 2. Inmon, W. “The Operational Data Store. Designing the Operational Data Store”. Information Management Magazine, July 1998. 3. Building the Operational Data Store on DB2 UDB Using IBM Data Replication, WebSphere MQ Family, and DB2 Warehouse Manager, SG24-6513-00, IBM Redbooks, 19 December 2001, http://www.redbooks.ibm.com/abstracts/sg246513.html?Open 18