Data Warehouse Architectures - I Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A 19.10.2009 http://www.ibm.com/developerworks/ru/library/sabir/axd_1/index.html
Abstract This paper starts a series of three articles on data warehousing (DW) architectures and their predecessors. The abundance of various approaches, methods and recommendations makes a mess of concepts, advantages and drawbacks, limitations and applicability of specific architecture solutions. The first article is concerned with the evolution of OLAP role understanding, with DW architecture components, with virtual DW and independent data marts. The second article considers the Centralized DW (CDW) with ETL (Extract, Transform, Load), CDW with ELT (Extract, Load, Transform), CDW with operational data store, and extended model with data marts. The third article discusses centralized ETL with parallel DW and data marts; DW with intermediate application data marts, DW with integration bus, and the recommended DW architecture.
OLAP and OLTP Any transactional system usually contains two types of tables. One of them is responsible for quick transactions. For example, ticket sale system has to ensure reliable exchange of short messages between the system and a large number of ticket agents. Indeed, entered and printed information that deals with passengers’ names, flight dates, flight number, seat, flight destination, can be estimated to be around 1000 bytes. Thus, passenger service requires fast processing of short records. Another type of tables contains a summary of sales for a specified time, by directions, and by categories of passengers. These tables are used by analysts and financial specialists in the end of month, quarter, or year, when company’s financial results are needed. And if the number of analysts is ten times smaller than the number of ticket agents, the volume of data required for analysis exceeds an average transaction size by several orders of magnitude. Not surprisingly, execution of analytical queries increases the system’s response time to ticket availability request. Creating a system with a reserve of computational power can mitigate the negative impact of analytical processing load on transactional activity, but leads to a significant cost increase of the required software and hardware, due to the fact that excess processing capacity remains unused most of the time. The second factor that led to the separation of analytical and transactional systems is different requirements that are applied by analytical and transactional systems to computing systems. The OLAP story begins in 1993 when an article "Providing OLAP (online analytical processing) to users - analysts" was published [1]. Initially, it appeared that the separation of transactional and analytical systems (OLTP - OLAP) is sufficient. However, it soon became clear that OLAP systems cope badly with the task of mediator between various data sources (including transactional systems) and client applications. It became clear that analytical data storage environment is required. Initially shared databases were proposed for this role, which implied the need to copy and store the original data from data sources. This idea turned out to be quite unviable, as transactional systems were commonly developed without a unified plan, and thus they contained conflicting and inconsistent information.
5
Pic. 1. Evolution of OLTP and OLAP understandings These implications lead to the idea of a data warehouse, designed for secure data storage and systems for data extraction, transformation and loading (ETL). OLAP-systems had to use information from data warehouse. It was revealed soon that data warehouse accumulates very important enterprise information and any unauthorized access to the DW is fraught with serious financial losses. Moreover, data formats oriented towards reliable storage are hard to combine with requirements for fast information service. Geographical distribution and organizational structure of enterprise also require a specific approach to information services to each company’s unit. The solution is a data mart that contains a required subset of information from the DW. Data load to data mart from DW may occur during the user activity’s decay time. In case of data mart failure, data can be easily retrieved from the DW with minimal losses. Data Marts can handle reporting, statistical analysis, planning, scenario calculations (what-if analysis) and, in particular, multidimensional analysis (OLAP). Thus, OLAP systems that initially claimed to be almost a half of the computing world (giving the second half to OLTP systems), now rank among analytical tools on the working groups level.
Six levels of data warehouse architectures Data warehouse architecture at times resembles a child's toy blocks. Any arbitrary combination of blocks can represent something that you can meet in real life. Sometimes in company one can find the presence of several enterprise data warehouses, each of which is positioned as the only and unified source of consistent information. Multilevel data marts in the presence of a unified data warehouse bring even more fun. Why should not we build a new DM on top of DW? You see, users want to combine some data from the two DMs to a third one. Maybe, it would make sense if the DMs contain information that is not in the 6
DW, for example, if users have enriched DM with their calculations and data. Even if so, what is the value of these enriched data in comparison with those that have passed through a cleaning sieve in accordance with enterprise policies? Who is responsible for the quality of the data? How they appeared in the system? Nobody knows, but everyone wants to get access to information that is not in the DW. Data warehouses are somewhat similar to a water purification system. Water with different chemical composition is collected from various sources. Therefore, the specific cleaning and disinfection methods are applied for each case of water source. Water delivered to the consumers meets strict quality standards. And no matter how we complain about the quality of water, this approach prevents the spread of epidemics in the city. And it comes to no one’s mind (I hope so) to enrich purified water with water from a nearby pond. However, IT has its own laws. Various data warehousing architecture will be considered later, though extremely exotic approaches are not going to be examined. We will discuss the architecture of the enterprise data warehouse on six layers, because, despite the fact that the components themselves may be absent, the layers exist in some form.
Pic. 2. Six layers of DW architecture The first layer consists of data sources, such as transactional and legacy systems, archives, separate files of known formats, MS Office documents, and any other structured data sources. The second layer hosts ETL (Extract, Transformation and Load) system. The main objective of ETL is to extract data from multiple sources to bring them to a consistent form and load to the DW. The hardware and software system where ETL is implemented must have a large throughput. But high computing performance is even more important. Therefore, the best of ETL systems should be able to provide a high degree of task parallelism, and work even with clusters and computational grids.
7
The role of the next level is a reliable data storage, protected from unauthorized access. Under the proposed triple strategy [2], we believe that at this level metadata and master data management systems should also be placed. An operational data store (ODS) is needed when quick access is required to even incomplete, not fully consistent data available with the least possible time lag. A staging area is needed to implement a specific business process, for example, when data steward should review data and should permit to load reviewed data to DW. Sometimes storage areas are referred to as a database buffer necessary for the implementation of internal process operations. For example, ETL retrieves data from a source, writes them into the internal database, cleans, and loads to DW. In this paper, the term “staging zone� is used for storage areas, designed for operations performed by external users or systems in accordance with business requirements for data processing. Separation of staging zone in a specific component of DW is needed, since these zones require establishment of additional administration, monitoring, security and audit processes. Information systems at data distribution layer still do not have a common name. They can be simply called ETL, as well as the system of extraction, transformation, and loading on the second layer. Or, to emphasize the differences from ETL, they are sometimes called ETL-2. Data distribution systems at the fourth layer perform tasks that differ significantly from the tasks of ETL, namely, sampling, restructuring, and data delivery (SRD - Sample, Restructure, Deliver). ETL extracts data from a variety of external systems. SRD selects data from a single DW. ETL receives inconsistent data that are to be converted to a common format. SRD has to deal with purified data the structure of which must be brought into compliance with the requirements of different applications. ETL loads data into a central DW. SRD shall deliver the data in different data marts in accordance with the rights of access, delivery schedule and requirements for the information set. Information access layer is intended to separate data storage functions from information support functions for various applications. Data marts must have a data structure which suits best the needs of information support tasks. Since there are no universal data structures that are optimal for all applications, data marts should be grouped by geographical, thematic, organizational, functional and other characteristics. Business applications layer is presented by scenario and statistical analysis, multidimensional analysis, planning and reporting tools and other business applications.
Virtual Data Warehouse Virtual data warehouse remained in the Romantic era, when it seemed that you can implement everything that a human mind can imagine. No one remembers virtual DW, and so again and again invent them, however, on a new level. So we have to start with what is already long gone, but trying to be reborn in a new guise. The concept of virtual data warehouse was based on a few sublimely beautiful ideas. The first great idea is costs reduction. There is no need to spend money on expensive equipment for a central data warehouse. No need to have qualified personnel maintaining this repository. We do not need server rooms with expensive cooling systems, fire control and monitoring equipment. The second idea - we should work with the most recent data. The analytical system must work directly with data sources, bypassing all middlemen. The intermediary is the evil, everyone knows that. Our experts do not have confidence in the mediator applications. Experts have always worked directly with the data source systems. 8
The third idea – we will write everything you need. All that is needed - this is a workstation and access to data sources. And the compiler. Our programmers are still sitting idle. They will develop an application that will query by itself all sources by user's request, it will deliver the data to a user's computer, will convert divergent formats by itself, it will perform data analysis, and it will show everything on the screen.
Pic. 3. Virtual Data Warehouse Does the company have many various users with different needs? Do not worry, we will modify our universal application for as many options as you want. Is there a new data source? That’s wonderful. We will rewrite all of our applications, taking into account the peculiarities of this source. Did the data format change? Fine. We will rewrite all of our applications to reflect the new format. Everything is going well; everybody is busy, we need more staff, SW development department should be expanded. Oh, and the users of data sources are complaining that for some time their system is very slow, for the reason that every time, even if the request has already been done before, our universal client application queries the data source again and again. Therefore it is necessary to purchase new, more powerful servers. What about the spending cuts? No cuts. Conversely, the costs only increase. We need more developers, more servers, more power, and more space for server rooms. Are there still any benefits from this architecture? We got a tight coupling between data sources and analytical applications. Any change in the source data must be agreed with the developers of the universal client, in order to avoid transmission of distorted and misinterpreted data to the analysis applications. A set of interfaces to access different data source systems should be maintained on every workplace.
9
There is an opinion that all this is obvious and it is not worth wasting time on explaining things that everyone understands. But in case of a user's request “I need data from the data marts A and B” why do the same developers write a client application that accesses multiple data marts, again and again reproduces the dead architecture of a virtual data warehouse?
Independent data marts Independent data marts have emerged as a physical realization of the understanding that transactional and analytical data processing do not get along well together on a single computer. The reasons for incompatibility are as follows: •
Transactional processing is characterized by a large number of reads and writes to the database. Analytical processing may take only a few queries to the database.
•
A record length in OLTP is typically less than 1000 characters. A single analytical query may require megabytes of data for analysis.
•
Number of transactional system users can reach up to several thousand employees. The number of analysts is usually within a few tens.
•
Typical requirement for transactional systems is the clock round non-stop operation 365 days a year (24 x 365). Analytical processing does not have such well-defined requirements for the availability of analytical systems, but a report not prepared in time can lead to serious troubles for analysts as well as the company.
•
The transactional system’s load is distributed more or less evenly over the year. The analytical system’s load is usually maximal at the end of accounting periods (month, quarter, year).
•
Transactional processing is mainly carried out using current data. Analytical calculations address historical data.
•
Data in transactional systems can be updated, whereas in analytical systems data should only be added to. Any attempt to change data retroactively should at least cause awareness.
Thus, transactional and analytical systems place different requirements both for software and hardware in terms of performance, capacity, availability, data models, data storage organization, access data methods, peak loads, data volumes and processing methods. Creation of independent data marts was the first response to the need for the separation of analytical and transactional systems. In those days it was a big step forward, simplifying the design and operation of software and hardware because they do not have to try to satisfy the mutually exclusive requirements of analytical and transactional systems. The advantage of development of independent data marts is the ease and simplicity of their organization, as each of them operates the data of one specific application, and therefore there is no problem with metadata and master data. There is no need for complex systems extraction, transformation and loading (ETL). Data just are copied from a transactional system to a data mart on a regular basis. One application - one data mart. Therefore, independent data marts are often called application data marts. But what if users need to use information from multiple data marts simultaneously? Development of complex client applications that can query many data marts at a time and can convert the data on the fly has been compromised by virtual data warehouse approach. 10
Pic.4. Independent data marts So, you need a single repository - a data warehouse. But the information in the data marts is not consistent. Each data mart has inherited from its transactional system its terminology, data model, master data, including the data encoding. For example, in one system the date of the operation can be encoded in the Russian format dd.mm.yyyy (day, month, year), and in the other in the American format mm.dd.yyyy (month, day, year). So, at data merge it is necessary to understand, what does 06.05.2009 mean - is it June 5 or May 6. That’s why we need a data ETL (extract, transform and load) system. Thus, the benefits of independent data marts disappear with the first requirement of users to work with data from several data marts.
Conclusion The article deals with the evolution of understanding of OLAP role, with DW component architecture, with virtual DW and independent data marts. Next papers will discuss the advantages and limitations of the following architectures: a centralized DW with ETL system, DW with ELT system, central data warehouse with an operational data store (ODS), extended model with data marts, centralized ETL with parallel DW and data marts, DW with intermediate application data marts, data warehouse with integration bus, and recommended DW architecture.
Literature 1. Codd E.F., Codd S.B., and Salley C.T. "Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate". Codd & Date, Inc. 1993. 2. Asadullaev S. "Data, metadata and master data: the triple strategy for data warehouse projects“, http://www.ibm.com/developerworks/ru/library/r-nci/index.html, 2009.
11