Dataware house Introduction by QuontraSolutions

Page 1

Introduction to Data Warehousing BY Quontra Solutions

phone :Â (404)-900-9988 email

: info@quontrasolutions.com

website : www.quontrasolutions.com


Data Warehouse 

Maintain historic data

Analysis to get better understanding of business

Better Decision making

Definition: A data warehouse is a –

subject-oriented

integrated

time-varying

non-volatile

collection of data that is used primarily in organizational decision making. -- Bill Inmon, Building the Data Warehouse 1996


Subject Oriented •

Data warehouse is organized around subjects such as sales, product, customer.

It focuses on modeling and analysis of data for decision makers.

Excludes data not useful in decision support process.


Integrated •

Data Warehouse is constructed by integrating multiple heterogeneous sources.

•

Data Preprocessing are applied to ensure consistency.

RDBMS Data Processing Data Transformation Legacy System

Flat File

Data Warehouse Data Processing Data Transformation


Non-volatile •

Mostly, data once recorded will not be updated.

•

Data warehouse requires two operations in data accessing -

Incremental loading of data

-

Access of data

load

access


Time Variant •

Provides information from historical perspective e.g. past 5-10 years

•

Every key structure contains either implicitly or explicitly an element of time


Why Data Warehouse? Problem Statement: •

ABC Pvt Ltd is a company with branches at USA, UK,CANADA,INDIA

The Sales Manager wants quarterly sales report across the branches.

Each branch has a separate operational system where sales transactions are recorded.


Why Data Warehouse? USA

UK

Get quarterly sales figure for each branch and manually calculate sales figure across branches.

Sales Manager

CANADA

INDIA

What if he need daily sales report across the branches?


Why Data Warehouse? Solution: •

Extract sales information from each database.

•

Store the information in a common repository at a single site.


Why Data Warehouse? USA

UK Data Warehouse CANADA

INDIA

Query & Analysis tools

Sales Manager


Characteristics of Data Warehouse 

Relational / Multidimensional database

Query and Analysis rather than transaction

Historical data from transactions

Consolidates Multiple data sources

Separates query load from transactions

Mostly non volatile

Large amount of data in order of TBs


When we say large - we mean it! • Terabytes -- 10^12 bytes: Yahoo! – 300 Terabytes and growing • Petabytes -- 10^15 bytes: Geographic Information Systems • Exabytes -- 10^18 bytes: • Zettabytes -- 10^21 bytes: National Medical Records • Zottabytes -- 10^24 bytes: Weather images Intelligence Agency Videos


OLTP Vs Data Warehouse (OLAP) OLTP

Data Warehouse (OLAP)

Indexes

Few

Many

Data

Normalized

Generally De-normalized

Joins

Many

Some

Derived data and aggregates

Rare

Common


Data Warehouse Architecture Operational System

Sales Data Mart Analysis

Operational System ETL (Extract Transform and Load)

Data Warehouse

Generic Data Mart

Flat Files

Flat Files

Data Mining

Inventory Data Mart

Reporting


ETL

ETL stands for Extract, Transform and Load

Data is distributed across different sources –

Data can be in different format –

Flat files, Streaming Data, DB Systems, XML, JSON CSV, Key Value Pairs

Different units and representation –

Country: IN or India

Date: 20 Nov 2010 or 20101020


ETL Functions 

Extract –

Collect data from different sources

Parse data

Remove unwanted data

Transform –

Project

Generate Surrogate keys

Encode data

Join data from different sources

Aggregate

Load


ETL Steps •

The first step in ETL process is mapping the data between source systems and target database.

The second step is cleansing of source data in staging area.

The third step is transforming cleansed source data.

Fourth step is loading into the target system.

Data before ETL Processing:

Data after ETL Processing:


ETL Glossary

Mapping: Defining relationship between source and target objects. Cleansing: The process of resolving inconsistencies in source data. Transformation: The process of manipulating data. Any manipulation beyond copying is a transformation. Examples include aggregating, and integrating data from multiple sources. Staging Area: A place where data is processed before entering the warehouse.


Dimension 

Categorizes the data. For example - time, location, etc. A dimension can have one or more attributes. For example - day, week and month are attributes of time dimension. Role of dimensions in data warehousing. -

Slice and dice

-

Filter by dimensions


Types of dimensions •

Conformed Dimension - A dimension that is shared across fact tables.

Junk Dimension - A junk dimension is a convenient grouping of flags and indicators. For example, payment method, shipping method.

De-generated Dimension - A dimension key, that has no attributes and hence does not have its own dimension table. For example, transaction number, invoice number. Value of these dimension is mostly unique within a fact table.

Role Playing Dimensions - Role Playing dimension refers to a dimension that play different roles in fact tables depending on the context. For example, the Date dimension can be used for the ordered date, shipment date, and invoice date.

Slowly Changing Dimensions - Dimensions that have data that changes slowly, rather than changing on a time-based, regular schedule.


Types of Slowly Changing Dimension •

Type1 - The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all.

Type 2 - The Type 2 method tracks historical data by creating multiple records for a given value in dimension table with separate surrogate keys.

Type 3 - The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns we designate for storing historical data.

Type 4 - The Type 4 method is usually referred to as using "history tables", where one table keeps the current data, and an additional table is used to keep a record of all changes.

Type 1, 2 and 3 are commonly used. Some books talks about Type 0 and 6 also. http://en.wikipedia.org/wiki/Slowly_changing_dimension


Facts 

Facts are values that can be examined and analyzed.

For Example - Page Views, Unique Users, Pieces Sold, Profit.

Fact and measure are synonymous.

Types of facts: –

Additive - Measures that can be added across all dimensions.

Non Additive - Measures that cannot be added across all dimensions.

Semi Additive - Measures that can be added across few dimensions and not with others.


How to store data? Facts and Dimensions: 1.

Select the business process to model

2.

Declare the grain of the business process

3.

Choose the dimensions that apply to each fact table row

4.

Identify the numeric facts that will populate each fact table row


Dimension Table 

Contains attributes of dimensions e.g. month is an attribute of Time dimension. Can also have foreign keys to another dimension table Usually identified by a unique integer primary key called surrogate key


Fact Table 

Contains Facts

Foreign keys to dimension tables

Primary Key: usually composite key of all FKs


Types of schema used in data warehouse

Star Schema

Snowflake Schema

Fact Constellation Schema


Star Schema 

Multi-dimensional Data

Dimension and Fact Tables

A fact table with pointers to Dimension tables


Star Schema


Snowflake Schema ď Ź

ď Ź

An extension of star schema in which the dimension tables are partly or fully normalized. Dimension table hierarchies broken down into simpler tables.


Snowflake Schema


Fact Constellation Schema •

A fact constellation schema allows dimension tables to be shared between fact tables.

This Schema is used mainly for the aggregate fact tables, OR where we want to split a fact table for better comprehension. 

For example, a separate fact table for daily, weekly and monthly reporting requirement.


Fact Constellation Schema

In this example, the dimensions tables for time, item, and location are shared between both the sales and shipping fact tables.


Operations on Data Warehouse 

Drill Down

Roll up

Slice & Dice

Pivoting


Drill Down Product Category e.g Home Appliances Sub Category e.g Kitchen Appliances

Region

Product e.g Toaster

Time


Roll Up Year

Fiscal Year

Quarter

Fiscal Quarter

Month

Fiscal Month

Fiscal Week

Day


Slice & Dice Product = Toaster

Region

Region

Product

Time Time


Pivoting Product

Time

Region

Product

Time •

Also called rotation

Rotate on an axis

Interchange Rows and Columns

Region


Advantages of Data Warehouse •

One consistent data store for reporting, forecasting, and analysis

Easier and timely access to data

Scalability

Trend analysis and detection

Drill down analysis


Disadvantages of Data Warehouse •

Preparation may be time consuming.

•

High associated cost


Case Study: Why Data Warehouse •

G2G Courier Pvt. Ltd. is an established brand in courier industry which has its own network in main cities and also have sub contracted in rural areas across the country to various partners.

•

The President of the company wants to look deep into the financial health of the company and different performance aspects.


Challenges •

Apart from G2G’s own transaction system, each partner has their own system which make the data very heterogeneous.

Granularity of data in various systems is also different. For eg: minute accuracy and day accuracy.

To do analysis on metrics like Revenue and Timely delivery across various geographical locations and partner, we need to have a unified system.


Data warehouse model Time

Product

Sales Fact

Region

Product Category


Thank You


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.