Data Warehouse Basic Guide

Page 1

Data Warehouse Definition Importance of Data Warehouse Its Components Two Data Warehousing Strategies ETL Processes For a Successful Warehouse Data Warehouse Pitfalls


Data Warehouse  A subject oriented, integrated, time-variant, non-volatile collection of data in support of management decisions (Bill Inmon)  Subject oriented -- data are organized around sales, products, etc.  Integrated -- data are integrated to provide a comprehensive view  Time variant -- historical data are maintained  Nonvolatile -- data are not updated by users


Limitations of Traditional Databases  lack of on-line historical data  residing in different operational systems  extremely poor query performance  operational database designs not suited for decision support


The Importance of Data Warehousing  More cost – effective decision making  Increase quality and flexibility of enterprise analysis as data warehouse contain accurate and reliable data  Ability to maintain better customer relationships  Unlimited analyses of enterprise information


Components of Data warehouse  Summarized data  Basically of two type: 1) Lightly (departmental information) 2) Highly (enterprise wide decision)  Current detail  Comes directly from operational system  But stored by subject area and represent entire organization not a department  System of record  Maintaining the source of record  Integration and transformation Programs  Programs that convert an application – specific data to enterprise data


Cont.. 

Performs many function like  Reformatting, recalculating  Adding time element  Identifying the default value  Summarizing and merging the data  Filling up the blank fields

 Archives  Contain old data which hold some amount of significance to the organization  Used for trend analysis  Metadata  Control access and analysis of the data warehouse contents 

To manage and control data warehouse creation and maintenance


Two Data Warehousing Strategies  Enterprise-wide warehouse, top down, the Inmon methodology  Data mart, bottom up, the Kimball methodology  When properly executed, both result in an enterprise-wide data warehouse


The Data Mart Strategy  The most common approach  Begins with a single mart and are added over time for more subject areas  Relatively inexpensive and easy to implement  Can be used as a proof of concept for data warehousing  Requires an overall integration plan


The Enterprise-wide Strategy  A comprehensive warehouse is built initially  An initial dependent data mart is built using a subset of the data in the warehouse  Additional data marts are built using subsets of the data in the warehouse  Like all complex projects, it is expensive, time consuming, and prone to failure  When successful, it results in an integrated, scalable warehouse


ETL Processes  Extraction, Transformation, and Loading Process

 The “plumbing” work of data warehousing  Data are moved from source to target data bases  A very costly, time consuming part of data warehousing


Sample ETL Tools  Teradata Warehouse Builder from Teradata  DataStage from Ascential Software  SAS System from SAS Institute  Power Mart/Power Center from Informatica  Sagent Solution from Sagent Software


Reasons for “Dirty” Data • • • • • • •

Dummy Values Absence of Data Multipurpose Fields Inappropriate Use of Address Lines Violation of Business Rules Non-Unique Identifiers Data Integration Problems


I. Data Cleansing and Extracting  Source systems contain “dirty data” that must be cleansed  ETL software contains rudimentary data cleansing capabilities  Specialized data cleansing software is often used. Important for performing name and address correction and householding functions  Leading data cleansing vendors include Vality (Integrity), Harte-Hanks (Trillium), and Firstlogic (i.d.Centric)


Steps in Data Cleansing  Parsing  Correcting  Standardizing  Matching  Consolidating


Parsing ď ˇ Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. ď ˇ Examples include parsing the first, middle, and last name; street number and street name; and city and state.


Correcting ď ˇ Corrects parsed individual data components using sophisticated data algorithms and secondary data sources. ď ˇ Example include replacing a vanity address and adding a zip code.


Standardizing ď ˇ Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules. ď ˇ Examples include adding a pre name, replacing a nickname, and using a preferred street name


Matching ď ˇ Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications. ď ˇ Examples include identifying similar names and addresses.


Consolidating • Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.


II. Data Transformation ď ˇ Transforms the data in accordance with the business rules and standards that have been established ď ˇ Example include: format changes, deduplication, splitting up fields, replacement of codes, derived values, and aggregates


III. Data Loading  Data are physically moved to the data warehouse  The loading takes place within a “load window”  The trend is to near real time updates of the data warehouse as the warehouse is increasingly used for operational applications


For a Successful Warehouse  From day one establish that warehousing is a joint user/builder project  Establish that maintaining data quality will be an ONGOING joint user/builder responsibility  Train the users one step at a time  Consider doing a high level corporate data model in no more than three weeks  Look closely at the data extracting, cleaning, and loading tools


Cont..  Determine a plan to test the integrity of the data in the warehouse  From the start get warehouse users in the habit of 'testing' complex queries  Coordinate system roll-out with network administration personnel  Implement a user accessible automated directory to information stored in the warehouse


Data Warehouse Pitfalls  Many warehouse end users will be trained and never or seldom apply their training  Large scale data warehousing can become an exercise in data homogenizing  Loading information only because it is available  Providing no maintenance to the data warehouse


Contact Us Business Name: Skyline Business School Address: Hauz Khas Enclave, New Delhi ­ 110 016, India. Phone: 91­11­26864848,:91­11­26866968 E­mail: info@skylinecollege.com

Resource: www.skylinecollege.com/our­programmes/pgp­data­warehousing


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.