Data Warehousing Back to Basics: Dimensional Modeling

Page 1

Jose Hernandez Director of Business Intelligence Dunn Solutions Group


Agenda Introduction What is a Data Warehouse? Dimensional Modeling


Overview

Chicago

Minneapolis

Raleigh

Full-service IT consulting firm Founded in 1988 Offices

 Chicago  Minneapolis  Raleigh  Bangalore, India

Bangalore


Practice Areas Application Development

Business Intelligence

Packaged Solutions

Custom App Dev

End-to-End BI

Legal Dashboard

Mobile App Dev

Data Warehouse

Visible Visitors

Portals

Dashboards

Web Design

DI + EIM/Quality

E-Commerce

Budgeting & Planning

Training Open-Enrollment

Map Intelligence On-Site + Custom Managed Services Jumpstart/Mentoring Predictive Analytics


Selected Clients

City of Chicago


Partnerships


Introduction: The New Series Focus on Data Warehousing Tool Agnostic Kimball Focus


Introduction: This Presentation Data Warehouse Back to Basics  We start with 50,000 foot view  Assuming you are new to data warehousing

 Keep it fundamental  Kimball point of view  What, Why and How


Why Build a Data Warehouse  We have mountains of data in this company but we can’t access it!  We need to slice and dice the data in a variety of ways.  You have to make it easy for business people to get at the data.

 Two people present the same business metrics and the numbers are different!  We want people to make decisions based on facts.


Why Build a Data Warehouse  Operational systems are not integrated

• • •

IDs and Codes not conformed Inconsistent format Data quality issues

 Operational systems generally not ideal for reporting

• • • •

Lack history Complex data structure Moving target

Poor query performance


Goals of a Data Warehouse  Make an organization’s data easy to access  Present the organization’s data consistently  Be adaptive and resilient to change

 Trusted and secure  Serve as the foundation for informed decisions  Business community must accept the warehouse if it is to be successful


Agenda Introduction What is a Data Warehouse? Dimensional Modeling


What is a Data Warehouse? • A simple question - does not seem to have simple answer! • Many definitions

• Two that you should consider

• Ralph Kimball • Bill Inmon


What is a Data Warehouse “A data warehouse is a system that extracts, cleans, conforms and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making...” …“It’s the place where users go to get their data” Ralph Kimball


What a Data Warehouse is NOT It is NOT…

Jose

 A product  A language  A project

 A data model  A copy of your transactional systems *Note: There are bundled products that come close to covering many aspects of a data warehouse!


The BI Stack

Extract Clean Conform Deliver ETL Management Services ETL Data Stores

Presentation Server

ETL System

Source Systems

Legacy mainframe systems Production databases Transactional systems Subscription data ‌

Data Marts Stars & Snowflakes Conformed Dimensions Conformed Facts

BI Applications

Front Room

Back Room

Metadata Infrastructure and Security

Reporting systems Ad hoc systems Dashboards Analytics systems


Our Focus Today

Extract Clean Conform Deliver ETL Management Services ETL Data Stores

Presentation Server

ETL System

Source Systems

Legacy mainframe systems Production databases Transactional systems Subscription data ‌

Data Marts Stars & Snowflakes Conformed Dimensions Conformed Facts

BI Applications

Front Room

Back Room

Metadata Infrastructure and Security

Reporting systems Ad hoc systems Dashboards Analytics systems


Agenda Introduction What is a Data Warehouse? Dimensional Modeling


Dimensional Modeling Dimensional modeling is a technique which allows you to design a database that meets the goals of a data warehouse.

Steps  Identify Business Process

 Identify Grain (level of detail)  Identify Dimensions  Identify Facts  Build Star


Identify the Business Process Requirements + Data Availability Determine discrete business processes (e.g.)  Sales

 Inventory  Student Registration


Identify the Grain  Grain is the level of detail stored in the data warehouse.

 Typically implement the lowest possible dimension grain:

Do we store all products, or just product categories?

not because users need individual records

Each month, week, day, hour?

Has a big impact on size of database.

because they want to aggregate in many different ways

 Can be a different grain for each fact


Identify Dimensions  Selection Criteria (where Gender=“Female”)  Row Headers (“College Name”, “Region”, …)  How do you want to slice the data?  What are the artifacts of your business?

 Time Dimension - Always present  Conforming Dimensions – very important aspect of a successful data warehouse!* *More on this later


Identify the Facts Facts are the storage place for the measurements we take... Flavors of Facts  Counts, Sums

 Transaction Grain

 Additive

 Periodic Snapshot Grain

 Non-Additive

 Accumulating Snapshot Grain

 Semi-Additive  Fact-less Facts


Dimensional Modeling - Stars Why is it called a star?


Dimensional Modeling - Stars Because it looks like a star! (kinda)  Fact Table in the center  Dimension Tables surrounding it


Dimensional Modeling - Constellation


Dimensional Modeling – Fact Tables Fact Tables  The center of the star schema  Based on a business process  Contains the business process measures  All measures in the fact are of the same grain

 Fact tables are narrow but deep


Dimensional Modeling – Dim Tables Dimension Tables  Business entities used to slice up (determine the grain) of the Facts  Verbose and textual  Should be conformed across the organization  Wide but shallow  Always use surrogate keys* *exception for the Date Dimension


Star Schema – Physical Model


Date Dimension (my favorite dimension) The Basic Date Dimension


Date Dimension Special Date Dimension Attributes  In another language  Semester (First Semester, Second Semester, …)

 High Season (Y/N), Low Season (Y/N)  Season (Winter, Spring, Summer, Fall)  Reporting Day (CurrDay, CurrDay-1D, CurrDay-2d)  Reporting Month (CurrMonth, CurrMont-1M, …)  Last Day of Quarter (Y/N)  Last Day of Week (Y/N)  American Holiday (Independence Day, Christmas, …)

 Canadian Holiday  And so many more!


Slowly Changing Dimensions Known as SCDs Dimensions change, how do you handle this? Three Basic Types •Type 1 •Type 2 •Type 3

Hmmm.... these are very descriptive names.


Slowly Changing Dimensions (SCDs)  Type 1:

• •

Do not preserve history Overwrite the record

 Type 2:

• •

Preserve all history

Add a new record, indicate current version

 Type 3:

Preserve a point-in-time history

Add additional column(s)

Type 2


Slowly Changing Dimensions: Type 2  SCD workhorse approach  When a dimension attribute changes, add a new row and update effective dates  Old fact rows point to the previous dimension row  New fact rows point to the current dimension row  You can use a flag too


Other types of Dimensions  Rapidly Changing Dimensions  Mini-dimensions  Degenerate Dimension  Junk Dimension  Outrigger


Rapidly Changing Dimensions AKA: Rapidly Changing Monster Dimensions  A dimension with attributes that change frequently is considered a rapidly changing dimension  Produces very large dimension tables  Cannot be handled with Type 2 approach (gets too big)


Mini-dimensions Technique for Rapidly Changing Monster Dimension

Customer Dimension PK

Customer ID Name Address DoB Date of First Order ------Age Gender Annual Income Number of Children Marital Status

 Use mini-dimensions

Split up the rapidly changing attributes to a mini-dimension

Join the mini-dimension to the fact table

PK

Customer Key More Foreign Keys Facts...

Customer Key Customer ID Name Address DoB Date of First Order

Minimizes rows (no discrete values) A significant compromise

FK1

New Customer Dimension

 Use banded ranges

• •

Fact Table

Customer Key

Customer Demographics Dim PK

Customer Demo Key Age Band Gender Annual Income Band Num of Children Band Marital Status

Fact Table 2

FK2 FK3

Customer Key Customer Demo Key More Foreign Keys Facts...


Other Dimensions  Rapidly Changing Dimensions  Mini-dimensions  Degenerate Dimension  Junk Dimension  Outrigger


Other Dimensions  Rapidly Changing Dimensions  Mini-dimensions  Degenerate Dimension

A dimension key that has no attributes. A dimensional attribute stored in the fact table

 Junk Dimension  Outrigger

Examples:  Transaction Number  Invoice Number  Line Item Number  Ticket Number


Other Dimensions  Rapidly Changing Dimensions  Mini-dimensions  Degenerate Dimension  Junk Dimension  Outrigger

Do you have a drawer in your kitchen that is a catch all for stuff that you might need...the junk drawer? A collection of low cardinality flags and indicators that you might need. Examples: Payment Type, Inbound/Outbound, Order Type


Other Dimensions  Rapidly Changing Dimensions

Exception, not the rule! The start of snow-flaking

 Mini-dimensions  Degenerate Dimension  Junk Dimension

A secondary dimension table is connected to a dimension table (not via a fact).

 Outrigger

Human Resource Fact Employee Dimension FK1

Employee Key More FK HR Fact 1 HR Fact 2

PK

Employee Key

FK1

Employee Attributes ...... Emp Skill Key

Emplyee Skill Group (Outrigger) PK

Emp Skill Key Emp Skill Description Emp Skill Category


Just the Facts Tables Home for the numerical measures Typically Additive Three types of Fact Tables  Transactional Grain  Periodic Snapshot Grain  Accumulating Snapshot Grain


Comparison of Fact Table Types Characteristic

Transaction Grain

Periodic Snapshot Grain

Accumulating Snapshot Grain

Time period represented

Point in time

Regular, predictable intervals

Indeterminate time span, typically short-lived

Grain

One row per transaction event

One row per period One row per life

Fact table loads

Insert

Insert

Insert and update

Fact row updates

Not revisited

Not revisited

Revisited whenever activity

Date dimensions

Transaction date

End of period date

Multiple dates for standard milestones

Facts

Transaction activity Performance for predefined time interval

Performance over finite lifetime


What makes it Enterprise? Conformed Dimensions & Facts  Common fields across the enterprise domains  Common definition across the enterprise domains

The Bus Architecture  Allows traversing across business processes  Promotes conformity


Conformed Dimensions / Bus Architecture


Dimensional Modeling Embellishments Snowflaking

Bridges

 Normalizing a dimension table  OLTP modeler tendency

 Many to many relationships not resolved in fact tables

 Not optimal for query performance

 Sits between a dimension and a fact

Outriggers  A dimension table is referenced in another dimension (i.e. hire date example)

 Ragged and variable depth hierarchies


Snowflaking What is Snowflaking?  Normalizing in a star schema  Should be avoided

Adds complexity to presentation layer

• • • •

SQL is more complex Adds burden to database optimizers Very little space savings Impacts Bitmap indexes*

 Sometimes OK (Outriggers for low cardinality attributes) *good for low cardinality fields


Snowflaking What is Snowflaking?  Normalizing in a star schema  Should be avoided

Adds complexity to presentation layer

• • • •

SQL is more complex Adds burden to database optimizers Very little space savings Impacts Bitmap indexes*

 Sometimes OK (Outriggers for low cardinality attributes) *good for low cardinality fields


DW Tips: Dimensional Modeling Myths  Dimensional data warehouses are appropriate for summary level data only

 Dimensional models presuppose the business questions and therefore are inflexible  Dimensional models are departmental  Brining a new data source into a dimensional data warehouse breaks existing schemas and requires new fact tables

Kimball University White Paper

 A good way to narrow the scope and manage risk is to focus on delivering the report most often requested  Dimensional models are fully de-normalized  Ralph Kimball invented the fact and dimension terminology


DW Tips: 10 Essential Dim Mod Rules  Load detailed atomic data into dimensional structures  Structure dimensional models around business processes  Ensure every fact table has a date dimension table  Ensure all facts in a Fact table are the same grain  Resolve many-to-many relationships in fact tables  Resolve many to one relationships in dimension tables

Kimball University Article, Margy Ross, InformationWeek

 Store report lables and filter domain values in dimension tables

 Dimension tables should use surrogate keys  Create conformed dimensions to integrate data across the enterprise

 Continuously balance requirements and realities to deliver a DW/BI solution that’s accepted by business users and that supports their decision making


Thank You Future Webinars  The ETL Process  Stars in Motion  Columnar and In-memory databases  Modeling Business Process

• Retail Sales • Inventory • CRM • HR


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.