Data Warehouse Definition Importance of Data Warehouse Its Components Two Data Warehousing Strategies ETL Processes For a Successful Warehouse Data Warehouse Pitfalls
Data Warehouse A subject oriented, integrated, time-variant, non-volatile collection of data in support of management decisions (Bill Inmon) Subject oriented -- data are organized around sales, products, etc. Integrated -- data are integrated to provide a comprehensive view Time variant -- historical data are maintained Nonvolatile -- data are not updated by users
Limitations of Traditional Databases lack of on-line historical data residing in different operational systems extremely poor query performance operational database designs not suited for decision support
The Importance of Data Warehousing More cost – effective decision making Increase quality and flexibility of enterprise analysis as data warehouse contain accurate and reliable data Ability to maintain better customer relationships Unlimited analyses of enterprise information
Components of Data warehouse Summarized data Basically of two type: 1) Lightly (departmental information) 2) Highly (enterprise wide decision) Current detail Comes directly from operational system But stored by subject area and represent entire organization not a department System of record Maintaining the source of record Integration and transformation Programs Programs that convert an application – specific data to enterprise data
Cont..
Performs many function like Reformatting, recalculating Adding time element Identifying the default value Summarizing and merging the data Filling up the blank fields
Archives Contain old data which hold some amount of significance to the organization Used for trend analysis Metadata Control access and analysis of the data warehouse contents
To manage and control data warehouse creation and maintenance
Two Data Warehousing Strategies Enterprise-wide warehouse, top down, the Inmon methodology Data mart, bottom up, the Kimball methodology When properly executed, both result in an enterprise-wide data warehouse
The Data Mart Strategy The most common approach Begins with a single mart and are added over time for more subject areas Relatively inexpensive and easy to implement Can be used as a proof of concept for data warehousing Requires an overall integration plan
The Enterprise-wide Strategy A comprehensive warehouse is built initially An initial dependent data mart is built using a subset of the data in the warehouse Additional data marts are built using subsets of the data in the warehouse Like all complex projects, it is expensive, time consuming, and prone to failure When successful, it results in an integrated, scalable warehouse
ETL Processes Extraction, Transformation, and Loading Process
The “plumbing” work of data warehousing Data are moved from source to target data bases A very costly, time consuming part of data warehousing
Sample ETL Tools Teradata Warehouse Builder from Teradata DataStage from Ascential Software SAS System from SAS Institute Power Mart/Power Center from Informatica Sagent Solution from Sagent Software
Reasons for “Dirty” Data • • • • • • •
Dummy Values Absence of Data Multipurpose Fields Inappropriate Use of Address Lines Violation of Business Rules Non-Unique Identifiers Data Integration Problems
I. Data Cleansing and Extracting Source systems contain “dirty data” that must be cleansed ETL software contains rudimentary data cleansing capabilities Specialized data cleansing software is often used. Important for performing name and address correction and householding functions Leading data cleansing vendors include Vality (Integrity), Harte-Hanks (Trillium), and Firstlogic (i.d.Centric)
Steps in Data Cleansing Parsing Correcting Standardizing Matching Consolidating
Parsing ď ˇ Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. ď ˇ Examples include parsing the first, middle, and last name; street number and street name; and city and state.
Correcting ď ˇ Corrects parsed individual data components using sophisticated data algorithms and secondary data sources. ď ˇ Example include replacing a vanity address and adding a zip code.
Standardizing ď ˇ Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules. ď ˇ Examples include adding a pre name, replacing a nickname, and using a preferred street name
Matching ď ˇ Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications. ď ˇ Examples include identifying similar names and addresses.
Consolidating • Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.
II. Data Transformation ď ˇ Transforms the data in accordance with the business rules and standards that have been established ď ˇ Example include: format changes, deduplication, splitting up fields, replacement of codes, derived values, and aggregates
III. Data Loading Data are physically moved to the data warehouse The loading takes place within a “load window” The trend is to near real time updates of the data warehouse as the warehouse is increasingly used for operational applications
For a Successful Warehouse From day one establish that warehousing is a joint user/builder project Establish that maintaining data quality will be an ONGOING joint user/builder responsibility Train the users one step at a time Consider doing a high level corporate data model in no more than three weeks Look closely at the data extracting, cleaning, and loading tools
Cont.. Determine a plan to test the integrity of the data in the warehouse From the start get warehouse users in the habit of 'testing' complex queries Coordinate system roll-out with network administration personnel Implement a user accessible automated directory to information stored in the warehouse
Data Warehouse Pitfalls Many warehouse end users will be trained and never or seldom apply their training Large scale data warehousing can become an exercise in data homogenizing Loading information only because it is available Providing no maintenance to the data warehouse
Contact Us Business Name: Skyline Business School Address: Hauz Khas Enclave, New Delhi 110 016, India. Phone: 911126864848,:911126866968 Email: info@skylinecollege.com
Resource: www.skylinecollege.com/ourprogrammes/pgpdatawarehousing