Data Lakes and Your EDW Name 路 Title 路 Dunn Solutions Manish Motiramani 路 Sr. Manager, Analytics 路 Dunn Solutions
03/02/2017
Today’s Agenda Introduction Data Lake Overview Data Lake vs. Data Warehouse Data Lake Use Cases
Data Lake in the Cloud Data Lake Case Studies
Dunn Solutions Delivers Velocity to Businesses
Dunn Solutions is a digital commerce and business transformation consultancy focused on delivering velocity to our clients. Velocity is achieved by the combination of both speed and direction. Dunn Solutions helps our clients achieve speed by automating business processes and direction using advanced analytics. Our teams align with organizations to optimize their unique processes and help them discover the most profitable routes to business success.
Dunn Solutions is a full-service IT consulting firm founded in 1988
Minneapolis Delivery ďƒ— Training
Chicago Delivery
Raleigh, NC Delivery ďƒ— Training
Bangalore, India Delivery
Practice Areas
Solutions
Application Development
Analytics •
Data Lakes
Training
•
Portals
•
IoT
•
Certified SAP, Liferay, Microsoft
•
•
e-Commerce & Content Managed Websites
•
Predictive Analytics
Accountable Care Orgs (ACO’s)
•
•
Corporate Legal
•
Machine Learning
Classroom, Onsite, Computer Based & Virtual
•
Higher Education
Mobile App Development
•
e-Commerce
•
Optical Shop
•
Custom App Development
•
Analytics
•
•
Search Engine Optimization
Cloud - BI Platforms
•
DW & Data Integration
•
•
Mentoring & Custom Training
Frameworks
Selected Clients
Partnerships
Application Development Practice
Portals • • •
Innovation Collaboration Customer
Custom Application Development • • •
Custom Software Custom Off-the-Shelf Assurity™ Methodology
eCommerce & Content Managed Websites • •
Responsive Design Enterprise eCommerce Solutions
Mobile Application Development • • •
Business Consumer iOS, Android, Windows
Analytics Practice
Business Intelligence • • • •
KPI’s and Metrics Dashboards Data Exploration and Visualization Ad Hoc Analysis & Reporting
Big Data • • • •
Hadoop, MapReduce AWS and Azure Hive, Sqoop, Spark NoSQL
Business Analytics Data Integration • • • •
Data Mining Predictive Analytics Prescriptive Analytics R, AzureML
Data Warehousing • • • • •
Data Lake Columnar In-memory EIM (Data Integration and Data Quality Dimensional Modeling
Today’s Agenda Introduction Data Lake Overview Data Lake vs. Data Warehouse Data Lake Use Cases
Data Lake in the Cloud Data Lake Case Studies
About Manish
Data Lake Overview
What is a Data Lake? A Data Lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The Data Lake supports the democratization of data. It provides your organization a cost effective way to store all its data for later processing so that your information consumers and researchers can focus on finding the next big thing, and not waste time finding the data.
The Data Lake Ecosystem
The Data Lake Ecosystem – Inflow
Inflow
The Data Lake Ecosystem – Storage
Storage
The Data Lake Ecosystem – Consumption
Consumption
Does a Data Lake replace a Data Warehouse? Isn’t a data lake just the data warehouse revisited?
No. A data lake is not a data warehouse. They are both optimized for different purposes, and the goal is to use each one for what they were designed to do. Or in other words, use the best tool for the job.
Data Lake vs. Data Warehouse
Data Warehouse vs. Data Lake
Data Warehouse
Data Lake
• Focuses on Business Processes • Highly processed • Tabular & structured • Lots of effort on design & build • Optimized for data retrieval • Highly governed
• Stores everything • Unprocessed (raw) • Unstructured, semistructured, structured • Democratization of data • Shared data stewardship
Data Warehouse vs. Data Lake – Data Structure
Data Warehouse
Data Lake
• Very
• Structured, semi-structured, and unstructured • Schema on read
Very Very
Structured
• Pre-defined schema Dim
Vendor Data
Dim
Dim Fact
Dim
Clickstream
Dim
Email Campaigns
Active Campaigns
Weather
Historic Campaigns
Data Warehouse vs. Data Lake – Approach Data Warehouses use the Top-Down approach
Data Lakes use the Bottom-up approach
Data Warehouse vs. Data Lake – Development Development Lifecycle 1. Requirements 7. Refine
1. Source
2. Design
7. Refine
Data Lake
Data Warehouse 6. Consume
3. Source
5. Load
4. Transform
2. Load
3. Requirements
6. Consume
5. Transform
4. Design
Data Warehouse vs. Data Lake – Data Loading Data Warehouses use a traditional ETL Process Source
Extract
Transform
Load
Data Warehouse
Data is transformed when it enters the data warehouse
Data Lakes make use of the ELT Process Source
Extract / Load
Transform
Data is transformed when it is retrieved from the data lake
The Data Lake Difference • Economical file storage • Allows for storing more data • Longer retention
• Supports more data types • Traditional structured data • Semi-structured data (sensor data, external data) • Unstructured data (logs, social data, images, audio/video)
• Can answer a wider range of questions • Adaptable to changes • Can provide new insight, faster
Modern Data Platform Does a Data Lake replace a Data Warehouse? No. They co-exist!
Data Lake Uses Cases • Store massive data sets • Mix disparate data • Ingest high-velocity data • Apply structure to semi-structured / unstructured data • Improve machine learning and predictive analytics
Data Lake In The Cloud
Data Lake In The Cloud – IaaS Leading Cloud Providers have Infrastructure as a Service (IaaS) offerings for Data Lakes
Data Lake In The Cloud – PaaS • Data Lake Platform as a Service (PaaS) is becoming increasingly popular • Aka., Serverless Architecture • Zero Management at O/S Level • No monitoring of Disk usage, CPU and Memory consumption
Microsoft Azure Data Lake Products
Platform as a Service
Information as a Service
Data Lake Case Study 1
An Online Retailer A digital retailer offering exciting brands and unique products to customers around the country via television networks, mobile, online and social channels
Business Challenge • Generating Traffic and Leads • Are we creating the type of content customers will pay for? • Do we know what type of content customers want? • Do we know how customers like to consume content?
Providing ROI on Marketing Activities (ROMI) • Can you link marketing activities to sales results?
• What is the measurable impact on sales as a result of marketing? • Evidence of ROI = Business case to spend more on marketing
It’s a simple question, “Did our campaign impact sales?”
It’s a Simple Question … •
Doesn’t our data Did our campaign warehouse take care impact sales? of these challenges?
• The question may be simple, but getting an answer is not. The answer lies in the data: • Web Clickstream • Weather • Email campaigns • Social media • Competitive • Viewership pricing data • Promotional data • National events
• Why is it so hard to answer? • The data is not in the EDW (and possibly should not be there) • Data is in multiple places • The data is hard to get to • There is massive amounts of data • and it is difficult to combine
The Data Lake Phase 1: Success Criteria Success Criteria
• Cloud-based deployment • Data capture from multiple disparate systems (On-prem Data Warehouse, Email Campaign, Clickstream) • Data storage in diverse format (CSV and TSV raw files, database tables) • Ability to combine data from different systems and different formats • Performance comparable or better than onpremise systems
Data Lake Phase 1 – Architecture
On-premise Data Warehouse
Azure Data Factory
Azure SQL Data Warehouse
On-premise Staging DB
Polybase
Azure Data Lake Store
Vendor Files: 1. CheetahMail 2. Adobe Site Catalyst
Sources
Microsoft PowerBI
Capture
Store
Visualize
Data Lake Phase 1 – Outcome Dashboards and Reports
(Samples shown; not real client output)
Data Lake Case Study 2
An IoT Company Expecting the number of devices in the field to grow by 300% in the next three years • As a data company, it must… • Provide business analytics to its customers around safety, compliance and maintenance • Real-time and historical reporting and dashboards
Preparing for Expansion We want to be ahead of the curve and have our systems ready to adapt to the increasing data volumes, and the changing trends in the world of analytics. • We need the ability to perform near real-time Analytics
• Capture of streaming data • Enrichment of streaming data • Storage of streaming data Details in Design Document: Section 1.1 Preparing for expansion
Customer Name /
Streaming Processing – Requirements “Capture – Enrich – Store” in Real-time • Capture streaming data from its devices, as it streams through • Enrich data “in-Stream” with reference/lookup information in real time • Store data in a data lake and in a high-performance analytics cloud database
Stream Processing – Architecture
Process & Store
Capture
Kinesis Agent
Kinesis Firehose
Process Data
Process Bucket
Copy to DB
HighPerformance DW
Data Analyst Data Lake Process
Data Scientist Data Lake
Data Lake Bucket
Marketing
Summary • Data Lake is a vast storage repository • It stores structured, semi-structured, and unstructured data in its raw format • A data lake does not replace a data warehouse, it augments it. • Data Lake can provide new insight, faster
Call to Action Let us build your data lake in the cloud
Question & Answers
Manish Motiramani ¡ Senior Manager, Analytics ¡ Dunn Solutions