Data Lakes and Your Enterprsie Data Warehouse

Page 1

Data Lakes and Your EDW Name 路 Title 路 Dunn Solutions Manish Motiramani 路 Sr. Manager, Analytics 路 Dunn Solutions

03/02/2017


Today’s Agenda Introduction Data Lake Overview Data Lake vs. Data Warehouse Data Lake Use Cases

Data Lake in the Cloud Data Lake Case Studies


Dunn Solutions Delivers Velocity to Businesses

Dunn Solutions is a digital commerce and business transformation consultancy focused on delivering velocity to our clients. Velocity is achieved by the combination of both speed and direction. Dunn Solutions helps our clients achieve speed by automating business processes and direction using advanced analytics. Our teams align with organizations to optimize their unique processes and help them discover the most profitable routes to business success.


Dunn Solutions is a full-service IT consulting firm founded in 1988

Minneapolis Delivery ďƒ— Training

Chicago Delivery

Raleigh, NC Delivery ďƒ— Training

Bangalore, India Delivery


Practice Areas

Solutions

Application Development

Analytics •

Data Lakes

Training

Portals

IoT

Certified SAP, Liferay, Microsoft

e-Commerce & Content Managed Websites

Predictive Analytics

Accountable Care Orgs (ACO’s)

Corporate Legal

Machine Learning

Classroom, Onsite, Computer Based & Virtual

Higher Education

Mobile App Development

e-Commerce

Optical Shop

Custom App Development

Analytics

Search Engine Optimization

Cloud - BI Platforms

DW & Data Integration

Mentoring & Custom Training

Frameworks


Selected Clients


Partnerships


Application Development Practice

Portals • • •

Innovation Collaboration Customer

Custom Application Development • • •

Custom Software Custom Off-the-Shelf Assurity™ Methodology

eCommerce & Content Managed Websites • •

Responsive Design Enterprise eCommerce Solutions

Mobile Application Development • • •

Business Consumer iOS, Android, Windows


Analytics Practice

Business Intelligence • • • •

KPI’s and Metrics Dashboards Data Exploration and Visualization Ad Hoc Analysis & Reporting

Big Data • • • •

Hadoop, MapReduce AWS and Azure Hive, Sqoop, Spark NoSQL

Business Analytics Data Integration • • • •

Data Mining Predictive Analytics Prescriptive Analytics R, AzureML

Data Warehousing • • • • •

Data Lake Columnar In-memory EIM (Data Integration and Data Quality Dimensional Modeling


Today’s Agenda Introduction Data Lake Overview Data Lake vs. Data Warehouse Data Lake Use Cases

Data Lake in the Cloud Data Lake Case Studies


About Manish


Data Lake Overview


What is a Data Lake? A Data Lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The Data Lake supports the democratization of data. It provides your organization a cost effective way to store all its data for later processing so that your information consumers and researchers can focus on finding the next big thing, and not waste time finding the data.


The Data Lake Ecosystem


The Data Lake Ecosystem – Inflow

Inflow


The Data Lake Ecosystem – Storage

Storage


The Data Lake Ecosystem – Consumption

Consumption


Does a Data Lake replace a Data Warehouse? Isn’t a data lake just the data warehouse revisited?

No. A data lake is not a data warehouse. They are both optimized for different purposes, and the goal is to use each one for what they were designed to do. Or in other words, use the best tool for the job.


Data Lake vs. Data Warehouse


Data Warehouse vs. Data Lake

Data Warehouse

Data Lake

• Focuses on Business Processes • Highly processed • Tabular & structured • Lots of effort on design & build • Optimized for data retrieval • Highly governed

• Stores everything • Unprocessed (raw) • Unstructured, semistructured, structured • Democratization of data • Shared data stewardship


Data Warehouse vs. Data Lake – Data Structure

Data Warehouse

Data Lake

• Very

• Structured, semi-structured, and unstructured • Schema on read

Very Very

Structured

• Pre-defined schema Dim

Vendor Data

Dim

Dim Fact

Dim

Clickstream

Dim

Email Campaigns

Active Campaigns

Weather

Historic Campaigns


Data Warehouse vs. Data Lake – Approach Data Warehouses use the Top-Down approach

Data Lakes use the Bottom-up approach


Data Warehouse vs. Data Lake – Development Development Lifecycle 1. Requirements 7. Refine

1. Source

2. Design

7. Refine

Data Lake

Data Warehouse 6. Consume

3. Source

5. Load

4. Transform

2. Load

3. Requirements

6. Consume

5. Transform

4. Design


Data Warehouse vs. Data Lake – Data Loading Data Warehouses use a traditional ETL Process Source

Extract

Transform

Load

Data Warehouse

Data is transformed when it enters the data warehouse

Data Lakes make use of the ELT Process Source

Extract / Load

Transform

Data is transformed when it is retrieved from the data lake


The Data Lake Difference • Economical file storage • Allows for storing more data • Longer retention

• Supports more data types • Traditional structured data • Semi-structured data (sensor data, external data) • Unstructured data (logs, social data, images, audio/video)

• Can answer a wider range of questions • Adaptable to changes • Can provide new insight, faster


Modern Data Platform Does a Data Lake replace a Data Warehouse? No. They co-exist!


Data Lake Uses Cases • Store massive data sets • Mix disparate data • Ingest high-velocity data • Apply structure to semi-structured / unstructured data • Improve machine learning and predictive analytics


Data Lake In The Cloud


Data Lake In The Cloud – IaaS Leading Cloud Providers have Infrastructure as a Service (IaaS) offerings for Data Lakes


Data Lake In The Cloud – PaaS • Data Lake Platform as a Service (PaaS) is becoming increasingly popular • Aka., Serverless Architecture • Zero Management at O/S Level • No monitoring of Disk usage, CPU and Memory consumption


Microsoft Azure Data Lake Products

Platform as a Service

Information as a Service


Data Lake Case Study 1


An Online Retailer A digital retailer offering exciting brands and unique products to customers around the country via television networks, mobile, online and social channels


Business Challenge • Generating Traffic and Leads • Are we creating the type of content customers will pay for? • Do we know what type of content customers want? • Do we know how customers like to consume content?

Providing ROI on Marketing Activities (ROMI) • Can you link marketing activities to sales results?

• What is the measurable impact on sales as a result of marketing? • Evidence of ROI = Business case to spend more on marketing

It’s a simple question, “Did our campaign impact sales?”


It’s a Simple Question … •

Doesn’t our data Did our campaign warehouse take care impact sales? of these challenges?

• The question may be simple, but getting an answer is not. The answer lies in the data: • Web Clickstream • Weather • Email campaigns • Social media • Competitive • Viewership pricing data • Promotional data • National events

• Why is it so hard to answer? • The data is not in the EDW (and possibly should not be there) • Data is in multiple places • The data is hard to get to • There is massive amounts of data • and it is difficult to combine


The Data Lake Phase 1: Success Criteria Success Criteria

• Cloud-based deployment • Data capture from multiple disparate systems (On-prem Data Warehouse, Email Campaign, Clickstream) • Data storage in diverse format (CSV and TSV raw files, database tables) • Ability to combine data from different systems and different formats • Performance comparable or better than onpremise systems


Data Lake Phase 1 – Architecture

On-premise Data Warehouse

Azure Data Factory

Azure SQL Data Warehouse

On-premise Staging DB

Polybase

Azure Data Lake Store

Vendor Files: 1. CheetahMail 2. Adobe Site Catalyst

Sources

Microsoft PowerBI

Capture

Store

Visualize


Data Lake Phase 1 – Outcome Dashboards and Reports

(Samples shown; not real client output)


Data Lake Case Study 2


An IoT Company Expecting the number of devices in the field to grow by 300% in the next three years • As a data company, it must… • Provide business analytics to its customers around safety, compliance and maintenance • Real-time and historical reporting and dashboards


Preparing for Expansion We want to be ahead of the curve and have our systems ready to adapt to the increasing data volumes, and the changing trends in the world of analytics. • We need the ability to perform near real-time Analytics

• Capture of streaming data • Enrichment of streaming data • Storage of streaming data Details in Design Document: Section 1.1 Preparing for expansion

Customer Name /


Streaming Processing – Requirements “Capture – Enrich – Store” in Real-time • Capture streaming data from its devices, as it streams through • Enrich data “in-Stream” with reference/lookup information in real time • Store data in a data lake and in a high-performance analytics cloud database


Stream Processing – Architecture

Process & Store

Capture

Kinesis Agent

Kinesis Firehose

Process Data

Process Bucket

Copy to DB

HighPerformance DW

Data Analyst Data Lake Process

Data Scientist Data Lake

Data Lake Bucket

Marketing


Summary • Data Lake is a vast storage repository • It stores structured, semi-structured, and unstructured data in its raw format • A data lake does not replace a data warehouse, it augments it. • Data Lake can provide new insight, faster


Call to Action Let us build your data lake in the cloud


Question & Answers

Manish Motiramani ¡ Senior Manager, Analytics ¡ Dunn Solutions


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.