Jose Hernandez Director of Business Intelligence Dunn Solutions Group
Agenda Introduction What is a Data Warehouse? Dimensional Modeling
Overview
Chicago
Minneapolis
Raleigh
Full-service IT consulting firm Founded in 1988 Offices
Chicago Minneapolis Raleigh Bangalore, India
Bangalore
Practice Areas Application Development
Business Intelligence
Packaged Solutions
Custom App Dev
End-to-End BI
Legal Dashboard
Mobile App Dev
Data Warehouse
Visible Visitors
Portals
Dashboards
Web Design
DI + EIM/Quality
E-Commerce
Budgeting & Planning
Training Open-Enrollment
Map Intelligence On-Site + Custom Managed Services Jumpstart/Mentoring Predictive Analytics
Selected Clients
City of Chicago
Partnerships
Introduction: The New Series Focus on Data Warehousing Tool Agnostic Kimball Focus
Introduction: This Presentation Data Warehouse Back to Basics We start with 50,000 foot view Assuming you are new to data warehousing
Keep it fundamental Kimball point of view What, Why and How
Why Build a Data Warehouse We have mountains of data in this company but we can’t access it! We need to slice and dice the data in a variety of ways. You have to make it easy for business people to get at the data.
Two people present the same business metrics and the numbers are different! We want people to make decisions based on facts.
Why Build a Data Warehouse Operational systems are not integrated
• • •
IDs and Codes not conformed Inconsistent format Data quality issues
Operational systems generally not ideal for reporting
• • • •
Lack history Complex data structure Moving target
Poor query performance
Goals of a Data Warehouse Make an organization’s data easy to access Present the organization’s data consistently Be adaptive and resilient to change
Trusted and secure Serve as the foundation for informed decisions Business community must accept the warehouse if it is to be successful
Agenda Introduction What is a Data Warehouse? Dimensional Modeling
What is a Data Warehouse? • A simple question - does not seem to have simple answer! • Many definitions
• Two that you should consider
• Ralph Kimball • Bill Inmon
What is a Data Warehouse “A data warehouse is a system that extracts, cleans, conforms and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making...” …“It’s the place where users go to get their data” Ralph Kimball
What a Data Warehouse is NOT It is NOT…
Jose
A product A language A project
A data model A copy of your transactional systems *Note: There are bundled products that come close to covering many aspects of a data warehouse!
The BI Stack
Extract Clean Conform Deliver ETL Management Services ETL Data Stores
Presentation Server
ETL System
Source Systems
Legacy mainframe systems Production databases Transactional systems Subscription data ‌
Data Marts Stars & Snowflakes Conformed Dimensions Conformed Facts
BI Applications
Front Room
Back Room
Metadata Infrastructure and Security
Reporting systems Ad hoc systems Dashboards Analytics systems
Our Focus Today
Extract Clean Conform Deliver ETL Management Services ETL Data Stores
Presentation Server
ETL System
Source Systems
Legacy mainframe systems Production databases Transactional systems Subscription data ‌
Data Marts Stars & Snowflakes Conformed Dimensions Conformed Facts
BI Applications
Front Room
Back Room
Metadata Infrastructure and Security
Reporting systems Ad hoc systems Dashboards Analytics systems
Agenda Introduction What is a Data Warehouse? Dimensional Modeling
Dimensional Modeling Dimensional modeling is a technique which allows you to design a database that meets the goals of a data warehouse.
Steps Identify Business Process
Identify Grain (level of detail) Identify Dimensions Identify Facts Build Star
Identify the Business Process Requirements + Data Availability Determine discrete business processes (e.g.) Sales
Inventory Student Registration
Identify the Grain Grain is the level of detail stored in the data warehouse.
Typically implement the lowest possible dimension grain:
•
Do we store all products, or just product categories?
•
not because users need individual records
•
Each month, week, day, hour?
•
•
Has a big impact on size of database.
because they want to aggregate in many different ways
Can be a different grain for each fact
Identify Dimensions Selection Criteria (where Gender=“Female”) Row Headers (“College Name”, “Region”, …) How do you want to slice the data? What are the artifacts of your business?
Time Dimension - Always present Conforming Dimensions – very important aspect of a successful data warehouse!* *More on this later
Identify the Facts Facts are the storage place for the measurements we take... Flavors of Facts Counts, Sums
Transaction Grain
Additive
Periodic Snapshot Grain
Non-Additive
Accumulating Snapshot Grain
Semi-Additive Fact-less Facts
Dimensional Modeling - Stars Why is it called a star?
Dimensional Modeling - Stars Because it looks like a star! (kinda)  Fact Table in the center  Dimension Tables surrounding it
Dimensional Modeling - Constellation
Dimensional Modeling – Fact Tables Fact Tables The center of the star schema Based on a business process Contains the business process measures All measures in the fact are of the same grain
Fact tables are narrow but deep
Dimensional Modeling – Dim Tables Dimension Tables Business entities used to slice up (determine the grain) of the Facts Verbose and textual Should be conformed across the organization Wide but shallow Always use surrogate keys* *exception for the Date Dimension
Star Schema – Physical Model
Date Dimension (my favorite dimension) The Basic Date Dimension
Date Dimension Special Date Dimension Attributes In another language Semester (First Semester, Second Semester, …)
High Season (Y/N), Low Season (Y/N) Season (Winter, Spring, Summer, Fall) Reporting Day (CurrDay, CurrDay-1D, CurrDay-2d) Reporting Month (CurrMonth, CurrMont-1M, …) Last Day of Quarter (Y/N) Last Day of Week (Y/N) American Holiday (Independence Day, Christmas, …)
Canadian Holiday And so many more!
Slowly Changing Dimensions Known as SCDs Dimensions change, how do you handle this? Three Basic Types •Type 1 •Type 2 •Type 3
Hmmm.... these are very descriptive names.
Slowly Changing Dimensions (SCDs) Type 1:
• •
Do not preserve history Overwrite the record
Type 2:
• •
Preserve all history
Add a new record, indicate current version
Type 3:
•
Preserve a point-in-time history
•
Add additional column(s)
Type 2
Slowly Changing Dimensions: Type 2 SCD workhorse approach When a dimension attribute changes, add a new row and update effective dates Old fact rows point to the previous dimension row New fact rows point to the current dimension row You can use a flag too
Other types of Dimensions Rapidly Changing Dimensions Mini-dimensions Degenerate Dimension Junk Dimension Outrigger
Rapidly Changing Dimensions AKA: Rapidly Changing Monster Dimensions A dimension with attributes that change frequently is considered a rapidly changing dimension Produces very large dimension tables Cannot be handled with Type 2 approach (gets too big)
Mini-dimensions Technique for Rapidly Changing Monster Dimension
Customer Dimension PK
Customer ID Name Address DoB Date of First Order ------Age Gender Annual Income Number of Children Marital Status
Use mini-dimensions
•
Split up the rapidly changing attributes to a mini-dimension
•
Join the mini-dimension to the fact table
PK
Customer Key More Foreign Keys Facts...
Customer Key Customer ID Name Address DoB Date of First Order
Minimizes rows (no discrete values) A significant compromise
FK1
New Customer Dimension
Use banded ranges
• •
Fact Table
Customer Key
Customer Demographics Dim PK
Customer Demo Key Age Band Gender Annual Income Band Num of Children Band Marital Status
Fact Table 2
FK2 FK3
Customer Key Customer Demo Key More Foreign Keys Facts...
Other Dimensions Rapidly Changing Dimensions Mini-dimensions Degenerate Dimension Junk Dimension Outrigger
Other Dimensions Rapidly Changing Dimensions Mini-dimensions Degenerate Dimension
A dimension key that has no attributes. A dimensional attribute stored in the fact table
Junk Dimension Outrigger
Examples: Transaction Number Invoice Number Line Item Number Ticket Number
Other Dimensions Rapidly Changing Dimensions Mini-dimensions Degenerate Dimension Junk Dimension Outrigger
Do you have a drawer in your kitchen that is a catch all for stuff that you might need...the junk drawer? A collection of low cardinality flags and indicators that you might need. Examples: Payment Type, Inbound/Outbound, Order Type
Other Dimensions Rapidly Changing Dimensions
Exception, not the rule! The start of snow-flaking
Mini-dimensions Degenerate Dimension Junk Dimension
A secondary dimension table is connected to a dimension table (not via a fact).
Outrigger
Human Resource Fact Employee Dimension FK1
Employee Key More FK HR Fact 1 HR Fact 2
PK
Employee Key
FK1
Employee Attributes ...... Emp Skill Key
Emplyee Skill Group (Outrigger) PK
Emp Skill Key Emp Skill Description Emp Skill Category
Just the Facts Tables Home for the numerical measures Typically Additive Three types of Fact Tables Transactional Grain Periodic Snapshot Grain Accumulating Snapshot Grain
Comparison of Fact Table Types Characteristic
Transaction Grain
Periodic Snapshot Grain
Accumulating Snapshot Grain
Time period represented
Point in time
Regular, predictable intervals
Indeterminate time span, typically short-lived
Grain
One row per transaction event
One row per period One row per life
Fact table loads
Insert
Insert
Insert and update
Fact row updates
Not revisited
Not revisited
Revisited whenever activity
Date dimensions
Transaction date
End of period date
Multiple dates for standard milestones
Facts
Transaction activity Performance for predefined time interval
Performance over finite lifetime
What makes it Enterprise? Conformed Dimensions & Facts Common fields across the enterprise domains Common definition across the enterprise domains
The Bus Architecture Allows traversing across business processes Promotes conformity
Conformed Dimensions / Bus Architecture
Dimensional Modeling Embellishments Snowflaking
Bridges
Normalizing a dimension table OLTP modeler tendency
Many to many relationships not resolved in fact tables
Not optimal for query performance
Sits between a dimension and a fact
Outriggers A dimension table is referenced in another dimension (i.e. hire date example)
Ragged and variable depth hierarchies
Snowflaking What is Snowflaking? Normalizing in a star schema Should be avoided
•
Adds complexity to presentation layer
• • • •
SQL is more complex Adds burden to database optimizers Very little space savings Impacts Bitmap indexes*
Sometimes OK (Outriggers for low cardinality attributes) *good for low cardinality fields
Snowflaking What is Snowflaking? Normalizing in a star schema Should be avoided
•
Adds complexity to presentation layer
• • • •
SQL is more complex Adds burden to database optimizers Very little space savings Impacts Bitmap indexes*
Sometimes OK (Outriggers for low cardinality attributes) *good for low cardinality fields
DW Tips: Dimensional Modeling Myths Dimensional data warehouses are appropriate for summary level data only
Dimensional models presuppose the business questions and therefore are inflexible Dimensional models are departmental Brining a new data source into a dimensional data warehouse breaks existing schemas and requires new fact tables
Kimball University White Paper
A good way to narrow the scope and manage risk is to focus on delivering the report most often requested Dimensional models are fully de-normalized Ralph Kimball invented the fact and dimension terminology
DW Tips: 10 Essential Dim Mod Rules Load detailed atomic data into dimensional structures Structure dimensional models around business processes Ensure every fact table has a date dimension table Ensure all facts in a Fact table are the same grain Resolve many-to-many relationships in fact tables Resolve many to one relationships in dimension tables
Kimball University Article, Margy Ross, InformationWeek
Store report lables and filter domain values in dimension tables
Dimension tables should use surrogate keys Create conformed dimensions to integrate data across the enterprise
Continuously balance requirements and realities to deliver a DW/BI solution that’s accepted by business users and that supports their decision making
Thank You Future Webinars The ETL Process Stars in Motion Columnar and In-memory databases Modeling Business Process
• Retail Sales • Inventory • CRM • HR