Big Data for Enterprise: Managing Data and Values
Tarun Sukhani NetCom Learning www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Agenda • • • • • • • • • • • • • •
Data & Information What is a Database Management System? File Management Systems Distribution Strategies for Databases Data Management Framework Key Supporting Data Management Components to Big Data Data Governance Council Roles and Responsibilities ETL Data Cleansing Overview of Big Data and Analytics Data Lake Hadoop & Its Role IoT and real-time data Modern Data Warehouse www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Data and Information DATA: Facts concerning people, objects, vents or other entities. Databases store data. INFORMATION: Data presented in a form suitable for interpretation. Data is converted into information by programs and queries. Data may be stored in files or in databases. Neither one stores information. KNOWLEDGE: Insights into appropriate actions based on interpreted data.
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Using a DBMS Data
• Database Design • Metadata
DBMS Engine Access
• Direct access • Host language
Data Management
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Basic Principles
DATABASE: A shared collection of interrelated data designed to meet the varied information needs of an organization. DATABASE MANAGEMENT SYSTEM: A collection of programs to create and maintain a database. Define Construct Manipulate
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Advantages of Database Processing • More information from same data • Shared data • Balancing conflicts among users • Controlled redundancy • Consistency
• • • •
Integrity Security Increased productivity Data independence
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Disadvantages of Database Processing • Increased size • Increased complexity
• More expensive personnel
• Increased impact of failure • Difficulty of recovery • Cost
• Especially server and mainframe systems
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Objectives of the DBMS Approach • SELF-DESCRIBING • DATA INDEPENDENCE • MULTIPLE VIEWS • MULTIPLE USERS
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
What is a Database Management System?
Data Files Directory Access Engine Utility Programs
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Database
DATA METADATA ACCESS ENGINE UTILITIES
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Files and Databases
Metadata “Data about data”
Description of fields Display and format instructions Structure of files and tables Security and access rules Triggers and operational rules
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
History of Database Management
• File Management Systems • Hierarchical Model IBM “Information Management System (IMS)” 1966
• Network Model Charles Bachman’s “Integraded Data Store (IDS)” 1965 Conference on Data Systems Languages /DataBase Task Group CODASYL/DBTG (1971)
• Relational Model E.F. Codd, 1970
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
File Management Systems
Provided facilities to extract data and share files, but did not implement any way to connect records in one file to those in another. Relationships had to be implemented in application code.
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Database vs File Systems
Program 1
Meta-Data
Program 2
Meta-Data
Program 3
Meta-Data
FILE SYSTEM
Data
DATABASE Program 1 Program 2
MetaData
Data
Program 3
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Structured Databases
Relationships were implemented by physical pointers (called “sets”) which allowed records to be connected in different files. Hierarchical databases allow only one parent set; networks allow several. These permit efficient processing but the sets must be constructed on data entry and cannot be rearranged later.
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Relational Models Relational models implement relationships with matched data values in related files (called primary and foreign keys). Any attributes can be matched. The connection is established at retrieval so interconnections can be developed as needed.
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Hierarchy SECTION
STUDENT
COLLEGE
INSTRUCTOR
COLLEGE
Each file can have only one parent. To implement a second “parent” (COLLEGE) we have to implement a shadow copy.
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Network
SECTION
STUDENT
INSTRUCTOR
COLLEGE Each file can have several parents. Both SECTION and COLLEGE are “parent” files..
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Relational
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Relational Terminology • Entity
• Person, place, thing or event about which we wish to keep data
• Attribute
• property of an entity
• Relationship
• an association among entities (entity records)
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Distribution Strategies for Databases Centralized Data and Processing: Dumb terminal with "screen scraping". Intelligent Terminal: Data and processing centralized; data preparation and display on remote devices. Distributed Logic: Data storage distributed; processed at the optimal location. A version of parallel processing. Client Server: Data (usually departmental) maintained on a server. Sub setting occurs on the server, processing on client machines. Distributed Database: Data distributed among different locations; processing access data wherever it is located. Data may be replicated or partitioned.
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Data Management
Designing and managing information in a data base environment requires: • Understanding the principles of data modeling in system design. • Using SQL for data manipulation. • Understanding the concepts of managing data in a database environment.
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Information System Modeling Approaches PROCESS MODELING: The traditional method of designing systems by following the changes to data flows. DATA MODELING: An approach to system development that specifies the file structure that conforms to the things important to the organization. PROTOTYPING: An iterative approach that focuses on building small operating OBJECT MODELING (Event driven design): Defines objects that contain data and associated processing rules encapsulated together.
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Data Management Framework •
• • • •
•
Holistic approach to understand the information needs of the enterprise & its stakeholders Consistency for planning & process development 10 major functional areas, including governance Aligns data with business strategy (above) and technology (below) Takes into account the data lifecycle – creation through destruction Internationally recognized through Data Management Association International (DAMA)
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Key Supporting Data Management Components to Big Data • Data Governance – Exercise of authority and controls over the management of data assets. Policies, processes, standards, definitions, metrics. • Councils, stewards, trustees roles and responsibilities defined
• Data Architecture - Defines data requirements, guides integration and control of data assets, aligns data investments with business strategy. Part of an overall enterprise architecture framework • Enterprise data models, definitions, and taxonomies Enterprise data delivery
• Master Data Management – Control over master data values to enable consistent, contextual use across systems of the most accurate, timely and relevant version of truth about essential business entities. • Meta Data Management – Descriptive tags about data, concepts, and connections between data and concepts. • Business, technical, process, and stewardship
• Data Security – Planning, development, and execution of security policies and procedures to provide proper authentication, authorization www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Key Questions to Drive Business Value from Data • What business opportunity/problem are we trying to solve?
• How do we integrate the right data together?
• What questions do we need to answer to solve the problem?
• How do we manage the quality of the data?
• What data do we need to answer the questions? • What data do we have? • How can data help differentiate us in the market? • What data is IP for us? Revenue generating for us?
• What data does this relate to (master data)? • Do we have all the data about this (person, event, thing, etc.)? • What are the permissible purposes of the data? (compliance, regulatory environment)
• Who is allowed to access the data? Use this data?
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Data Management Maturity in a Social Business
Partial Source: Social Business by Design, Dion Hinchcliffe www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Data Governance Council Roles and Responsibilities
DGC
DGP
Task Forces & Tiger Teams
Lines-of-Business
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Data Governance Operational View
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
ETL The process of updating the data warehouse.
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Two Data Warehousing Strategies • Enterprise-wide warehouse, top down, the Inmon methodology • Data mart, bottom up, the Kimball methodology • When properly executed, both result in an enterprise-wide data warehouse
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
The Data Mart Strategy • The most common approach • Begins with a single mart and architected marts are added over time for more subject areas • Relatively inexpensive and easy to implement • Can be used as a proof of concept for data warehousing • Can perpetuate the “silos of information” problem • Can postpone difficult decisions and activities • Requires an overall integration plan
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
The Enterprise-wide Strategy
• A comprehensive warehouse is built initially • An initial dependent data mart is built using a subset of the data in the warehouse • Additional data marts are built using subsets of the data in the warehouse • Like all complex projects, it is expensive, time consuming, and prone to failure • When successful, it results in an integrated, scalable warehouse
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Data Sources and Types
• • • •
Primarily from legacy, operational systems Almost exclusively numerical data at the present time External data may be included, often purchased from third-party sources Technology exists for storing unstructured data and expect this to become more important over time
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Extraction, Transformation, and Loading (ETL) Processes
• The “plumbing” work of data warehousing • Data are moved from source to target data bases • A very costly, time consuming part of data warehousing
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Recent Development: More Frequent Updates • Updates can be done in bulk and trickle modes • Business requirements, such as trading partner access to a Web site, requires current data • For international firms, there is no good time to load the warehouse
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Recent Development: Clickstream Data
• Results from clicks at web sites • A dialog manager handles user interactions. An ODS (operational data store in the data staging area) helps to custom tailor the dialog • The clickstream data is filtered and parsed and sent to a data warehouse where it is analyzed • Software is available to analyze the clickstream data
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Data Extraction
• Often performed by COBOL routines (not recommended because of high program maintenance and no automatically generated meta data) • Sometimes source data is copied to the target database using the replication capabilities of standard RDMS (not recommended because of “dirty data” in the source systems) • Increasing performed by specialized ETL software
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Sample ETL Tools
• Teradata Warehouse Builder from Teradata • DataStage from Ascential Software • SAS System from SAS Institute • Power Mart/Power Center from Informatica • Sagent Solution from Sagent Software • Hummingbird Genio Suite from Hummingbird Communications
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Reasons for “Dirty” Data • • • • • • • • • •
Dummy Values Absence of Data Multipurpose Fields Cryptic Data Contradicting Data Inappropriate Use of Address Lines Violation of Business Rules Reused Primary Keys, Non-Unique Identifiers Data Integration Problems
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Data Cleansing
• Source systems contain “dirty data” that must be cleansed • ETL software contains rudimentary data cleansing capabilities • Specialized data cleansing software is often used. Important for performing name and address correction and householding functions • Leading data cleansing vendors include Vality (Integrity), Harte-Hanks (Trillium), and Firstlogic (i.d.Centric)
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Steps in Data Cleansing
• Parsing • Correcting • Standardizing • Matching • Consolidating
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Parsing
• Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. • Examples include parsing the first, middle, and last name; street number and street name; and city and state.
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Correcting
• Corrects parsed individual data components using sophisticated data algorithms and secondary data sources. • Example include replacing a vanity address and adding a zip code.
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Standardizing • Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules. • Examples include adding a pre name, replacing a nickname, and using a preferred street name.
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Matching • Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications. • Examples include identifying similar names and addresses.
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Consolidating •
Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Data Staging • Often used as an interim step between data extraction and later steps • Accumulates data from asynchronous sources using native interfaces, flat files, FTP sessions, or other processes • At a predefined cutoff time, data in the staging file is transformed and loaded to the warehouse • There is usually no end user access to the staging file • An operational data store may be used for data staging
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Data Transformation
• Transforms the data in accordance with the business rules and standards that have been established • Example include: format changes, deduplication, splitting up fields, replacement of codes, derived values, and aggregates
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Data Loading • Data are physically moved to the data warehouse • The loading takes place within a “load window” • The trend is to near real time updates of the data warehouse as the warehouse is increasingly used for operational applications
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Meta Data
• Data about data • Needed by both information technology personnel and users • IT personnel need to know data sources and targets; database, table and column names; refresh schedules; data usage measures; etc. • Users need to know entity/attribute definitions; reports/query tools available; report distribution information; help desk contact information, etc.
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Recent Development: Meta Data Integration
• A growing realization that meta data is critical to data warehousing success • Progress is being made on getting vendors to agree on standards and to incorporate the sharing of meta data among their tools • Vendors like Microsoft, Computer Associates, and Oracle have entered the meta data marketplace with significant product offerings
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Overview of Big Data and Analytics
What differentiates today’s thriving organizations?
Data. www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
What is Big Data, really?
Data in all forms & sizes is being generated faster than ever before
Capture & combine it for new insights & better, faster decisions
11
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Collect any data
Harness the growing and changing nature of data Structured
Unstructured
Streaming
“”
Challenge is combining transactional data stored in relational databases with less structured data Big Data = All Data Get the right information to the right people at the right time in the right format www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
An illustration of the velocity of data created
Kalakota, R. (2012, October 22). Sizing “Mobile + Social” Big Data Stats. Retrieved from http://practicalanalytics.wordpress.com/
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
The three V’s
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Value
Technology innovation accelerates value
Machine learning In-memory Operational reporting Ad hoc analysis Transactional systems Complex implementations
ETL
OLAP
Any data
Internet of Things
Dashboards
Hadoop
Enterprise data warehouse
Spreadmarts
Siloed data
Innovation www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Discover and connect
Answering new questions
Value
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Put data to work for everyone in your organization
Inspire innovation
Accelerate decision-making Learn from & share insights
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Embrace Big Data across your business
Marketing
Finance
Sales
HR
Build deeper customer relationships
Impact your company’s bottom line
Improve revenue performance
Maximize employee engagement
Units Sold, Discounts, and Profit before Tax
XT2000 Status List Show Only Problems Indicator
Revenue and Target by Region
Status
Sales
15
2M
R&D
Book Advertising Slots Fall Showcase Event Analysis
Product D
10
Product C
1M
0.5M
5
60K
70K
80K
0 90K
100K
Human Resources Finance
Customer Support
Product G
0M 50K
IT Region: South Target: 13450 Highlighted: 4900
Product F
End User Survey Technical Review Milestone
Marketing
1.5M
(Thousands )
Materials and Packaging Review
Discounts (Millions)
Product A
Preliminary Budget
Departments Headcount
Administration North
South
110
Revenue
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Target
Accounting 0
5
10
Š1998-2018 NetCom Learning
15
The Data Divide
70% of data generated by customers
80% of data stored
3% prepared for analysis
0.5% being analyzed
<0.5% being operationalized
IDC says that right now, about 22% of data is useful. By 2020 that number will climb to 37%. www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Major Fail
Gartner: “Through 2017, 60% of big-data projects will fail to go beyond piloting and experimentation” Paradigm4: 76% of those who have used Hadoop or Apache Spark complained of significant limitations www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Analytics Solution
Capture and integrate data
Derive insight from data
from multiple internal and external sources
with rich, interactive dashboards and reports using the tools you know
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Put insight into action
to increase efficiency and constituent satisfaction
Š1998-2018 NetCom Learning
Advanced Analytics Defined
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Analytics Example
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
The end result of Big Data - Icing on the cake
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Use Cases
Data Analytics is neededeverywhere Legal discovery and document archiving
Social network analysis
Traffic flow optimization
Recommendation engines
Churn analysis
Location-based tracking & services
Oil & Gas exploration
Weather forecasting for business planning
Healthcare outcomes
Personalized Insurance
Fraud detection
Life sciences research
Advertising analysis
Equipment monitoring
Pricing Analysis
Smart meter monitoring
Intelligence Gathering
IT infrastructure & Web App optimization
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Personalized Insurance
Insurance companies can help (and some have already started helping) their customers with truly personalized insurance plans tailored to their needs and risks Insurance Companies can collect real-time data from incar sensors and combine it with geolocation and in-house systems. With information such as distance and speed, provide personalized insurance offers based on driving amount, risk, and other factors, for a truly personalized plan that may often save drivers money
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Personalized policies can reduce costs & better meet customer needs
$1,600/yr.
US national avg. car insurance premium
Š1998-2018 NetCom Learning
Recommendation Engines
Retailers can use customer purchase & rating information to serve recommendations to current customers, based on similarities across many dimensions The vast amount of current and ever-growing customer purchase, rating and click data can all be collected and managed with an Hadoop-based solution, to pinpoint preferences based on purchase history and demographics, and be able to serve useful and compelling cross-sell and up-sell recommendations.
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Significantly improve up-sell and cross-sell opportunities
158
Items sold/second by Amazon.com on 11/29/2010 (Cyber Monday)
Š1998-2018 NetCom Learning
Pricing Analysis
Retailers can use customer past purchase, preference, and demographic information to serve realtime custom pricing, instant discounts when near the store. Retailers â&#x20AC;&#x201C; whether large, small, online or in-store â&#x20AC;&#x201C; can improve margins with more detailed pricing analysis. When a customer is in range of a transaction (either in the store, online or perhaps passing by), offer personalized offers, real-time price quotes, or other frequent-buyer perks to help bring more customers to the store and improve repeat business.
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Significantly improve sales and customer satisfaction
up to 30%
Additional price Mac users accepted for travel from Orbitz
Š1998-2018 NetCom Learning
Using Big data to determine the best train schedules
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Data Lake
What is a datalake? A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed. • A place to store unlimited amounts of data in any format inexpensively • Allows collection of data that you may or may not use later: “just in case” • A way to describe any large data pool in which the schema and data requirements are not defined until the data is queried: “just in time” or “schema on read” • Complements EDW and can be seen as a data source for the EDW – capturing all data but only passing relevant data to the EDW • Frees up expensive EDW resources (storage and processing), especially for data refinement • Allows for data exploration to be performed without waiting for the EDW team to model and load the data • Some processing in better done on Hadoop than ETL tools like SSIS • Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera) www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Traditional Approaches
MONITORING AND TELEMETRY
Current state of a data warehouse
ETL
DATA SOURCES
OLTP
ERP
CRM
BI AND ANALYTCIS Star schemas, views other readoptimized structures
LOB
Well manicured, often relational sources Known and expected data volume and formats Little to no change
DATA WAREHOUSE
Complex, rigid transformations Required extensive monitoring Transformed historical into read structures
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Emailed, centrally stored Excel reports and dashboards
Flat, canned or multi-dimensional access to historical data Many reports, multiple versions of the truth 24 to 48h delay Š1998-2018 NetCom Learning
Traditional Approaches
MONITORING AND TELEMETRY
Current state of a data warehouse
ETL
DATA SOURCES
OLTP
INCREASING DATA VOLUME
ERP
CRM
DATA WAREHOUSE BI AND ANALYTCIS Star schemas, views other readoptimized structures
LOB
NON-RELATIONAL DATA
INCREASE IN TIME
Emailed, centrally stored Excel reports and dashboards
STALE REPORTING
Increase in variety of data sources
Complex, rigid transformations can’t longer keep pace
Reports become invalid or unusable
Increase in data volume
Monitoring is abandoned
Delay in preserved reports increases
Increase in types of data
Delay in data, inability to transform volumes, or react to new sources
Users begin to “innovate” to relieve starvation
Pressure on the ingestion engine
Repair, adjust and redesign ETL
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
New Approaches
DATA WAREHOUSE
BI AND ANALYTCIS
Star schemas, views other readoptimized structures
Data Lake Transformation (ELT not ETL)
Discover and consume predictive analytics, data sets and other reports
DATA SOURCES
EXTRACT AND LOAD
OLTP
ERP
NON-RELATIONAL DATA
CRM
DATA REFINERY PROCESS (TRANSFORM ON READ)
DATA LAKE
Transform relevant data into data sets
LOB
FUTURE DATA SOURCES
All data sources are considered Leverages the power of on-prem technologies and the cloud for storage and capture
OTHER REFINERY PROCESSES
Extract and load, no/minimal transform Storage of data in near-native format Orchestration becomes possible
Streaming data| accommodation www.netcomlearning.com | info@netcomlearning.com (888) 563 8266
Native formats, streaming data, big data
possible
becomes
Refineries transform data on read Produce curated data sets to integrate with traditional warehouses Users discover published data ©1998-2018 NetCom Learning sets/services using familiar tools
Hadoop and its role
What is Hadoop?
Distributed, scalable system on commodity HW Composed of a few parts:
DATA SERVICES
OPERATIONAL SERVICES AMBARI
FLUME PIG
OOZIE SQOOP
FALCON
MapReduce – Programming model Other tools: Hive, Pig, SQOOP, HCatalog, HBase, Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie, ZooKeeper, Flume, Storm
Main players are Hortonworks, Cloudera, MapR
WARNING: Hadoop, while ideal for processing huge volumes of data, is inadequate for analyzing that data in real time (companies do batch analytics instead) www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
HBASE
LOAD & EXTRACT
HDFS – Distributed file system
MAP REDUCE
NFS
Core Services
HIVE & HCATALOG
YARN
WebHDFS
HDFS
Hadoop Cluster compute & storage
.
.
.
.
.
.
.
compute & storage
.
.
.
Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware
©1998-2018 NetCom Learning
Hortonworks Data Platform 2.2
Simply put, Hortonworks ties all the open source products together (20) www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
The real cost of Hadoop
http://www.wintercorp.com/tcod-report/ www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Use cases using Hadoop and a DW in combination Bringing islands of Hadoop data together
• Archiving data warehouse data to Hadoop (move) • (Hadoop as cold storage) • Exporting relational data to Hadoop (copy) (Hadoop as backup/DR, analysis, cloud use) • Importing Hadoop data into data warehouse (copy) (Hadoop as staging area, sandbox, Data Lake) www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
IoT and real-time data
What is the Internet of Things? Things
Connectivity
Data
Analytics
IoT = sensor-acquired data www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
What is the Internet of Things (IoT)? Internet-connected devices that can perceive the environment in some way, share their data, and communicate with you. IoT is just a catch-all term for ways of using machine-generated data to create something useful. - Has it one processor and sensor to collect information - Examples: heart monitoring implants, biochip transponders on farm animals, automobiles with build-in sensors, field operation devices that assist firefighters in search and rescue - Excludes computers, tablets, and smart phones - But really, itâ&#x20AC;&#x2122;s in the sphere of business intelligence that IoT will really make a difference.
Cool possibilities - When a milk carton is almost empty it will ping you when you are near a store - An alarm clock that signals your coffee maker to start brewing when you wake up - An embedded chip that monitors your vital signs and notifies a medical provider if exceeds limit Gartner: 10 billion devices connected to the internet today, 26B by 2020
At some point in the future, nearly every manmade object will contain a device that transmits data! www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Modern Data Warehouse
Think about future needs: • • • • • •
Increasing data volumes Real-time performance New data sources and types Cloud-born data Multi-platform solution Hybrid architecture
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Modern Data Warehouse Defined
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Modern Data Warehouse The Dream
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
The Reality
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Federated Querying
Other names: Data virtualization, logical data warehouse, data federation, virtual database, and decentralized data warehouse.
A model that allows a single query to retrieve and combine data as it sits from multiple data sources, so as to not need to use ETL or learn more than one retrieval technology
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Federated Querying Select…
Result set
EDW SQL Server
Windows Azure HDInsight
Cloudera CHD Linux Hortonworks HDP
Relational Data
DB2 Query Model
Oracle MongoDB
NonRelational Data
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
DW and the Cloud
Can I use the cloud with my DW? • • • • •
Public and private cloud Cloud-born data vs on-prem born data Transfer cost from/to cloud and on-prem Sensitive data on-prem, non-sensitive in cloud Look at hybrid solutions
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
TDWI Best Practices Report (2015)
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
SMP vs MPP
• SMP - Symmetric
SMP Multiprocessing
•MPPMPP - Massively Parallel Processing
• • • •
Multiple CPUs used to complete individual processes simultaneously All CPUs share the same memory, disks, and network controllers (scale-up) All SQL Server implementations up until now have been SMP Mostly, the solution is housed on a shared SAN
• Uses many separate CPUs running in parallel to execute a single program • Shared Nothing: Each CPU has its own memory and disk (scale-out) • Segments communicate using high-speed network between nodes
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
DW SCALABILITY SPIDER CHART “Data Volume”
“Mixed Workload”
5 PB
100 TB
Strategic, Tactical Loads
Strategic, Tactical
10 TB
1.000
100
Batch Reporting, Repetitive Queries
Ad Hoc Queries Data Analysis/Mining
“Query complexity“
3-5 Way Joins
Weekly Load
Daily Load
5-10 Way Joins
Simple Star Multiple, MB’s Integrated Stars
SMP – Tunable in one dimension on cost of other dimensions
The spiderweb depicts important attributes to consider when evaluating Data Warehousing options.
10.000 50 TB
Strategic Near Real Time Data Feeds
“Query Concurrency“
500 TB
Strategic, Tactical Loads, SLA
“Data Freshness”
MPP – Multidimensional Scalability
▪ ▪ ▪ ▪ ▪ ▪
Big Data support is newest dimension.
Joins + OLAP operations + Aggregation + Complex “Where” constraints + Views Parallelism
Normalized GB’s
“Query Freedom“ TB’s
Multiple, Integrated Stars and Normalized
“Schema Sophistication“
“Query Data Volume“ www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Recorded Webinar Video
To watch the recorded webinar video for live demos, please access the link: https://goo.gl/rPrjZf
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
About NetCom Learning
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Recommended Courses & Marketing Assets Courses: » 20778: Analyzing Data with Power BI - Class scheduled on Jan 7 » 20775: Performing Data Engineering on Microsoft HD Insight - Class scheduled on Jan 14 » Tableau Desktop Level 1: Introduction - Class scheduled on Jan 21 » GL660 - Hadoop For Systems Administrators - Class scheduled on Feb 4
Marketing Assets: • Blog - Top AI, Big Data, & Analytics Trends to Follow in 2018 • Whitepaper - Curtailing the Talent Gap in Data Science
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Top Reasons to Master Agile Scrum and its Benefits Clean Architecture: Patterns, Practices, and Principles CEH: Understanding Ethical Hacking SQL Server 2017: Application Development Best Practices
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
Promotions
The year 2018 is coming to an end, though learning is a continuous process! Build your’s, or team’s, or department’s skills with the best training courses of 2018-19. With a range of Cloud, Security, Networking, Data & AI, Design & Multimedia, Business Application, Application Development and Business Process training at limited-time prices, you can imbibe in-demand skills while making a huge saving on the training cost. Learn More www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning
Follow Us On:
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
Š1998-2018 NetCom Learning
THANK YOU !!!
www.netcomlearning.com | info@netcomlearning.com | (888) 563 8266
©1998-2018 NetCom Learning