rediscoveringBI | May 2013 by Radiant Advisors

Radiant Advisors Publication

rediscoveringBI

REIMAGINING DATA INTEGRATION UNTELLING THE JUST SO STORY

DATA QUALITY & THERMODYNAMICS A PREDICTABLE DECLINE

FUTURE SHOCK

INTO DISORDER

THE DAUNTING PHYSICS OF THE BIG DATA UNIVERSE

LAW OF ATTRACTION CONTEXT AND DATA

05 MAY 2013 ISSUE 8

DATA INTEGRATION PHYSICS

rediscoveringBI

May 2013 Issue 8

SPOTLIGHT

[P5]

Just So Stories, DM Style

need to stop uncritically telling and retelling Just So Stories, data management-style.

[By Michael Whitehead] FEATURES

[P9]

[P13]

[P17]

understanding entropy can help us attack

data warehouse numbered? Not neces-

itself has had to be broken down to its most

our data quality problems.

sarily.

fundamental element: the key-value pair.

[By Dr. Gian Di Loreto]

[By Stephen Swoyer]

[By John O’Brien]

Second Law of Thermodynamics How

Future of Data Integration Are the days of the

EDITOR’S PICK

[P8] The One Thing

The key to success, according to Keller and Papasan, is to focus on ONE Thing.

[By Lindy Ryan]

Data and the Law of Attraction The data

SIDEBAR BIG DATA PHYSICS [P14] [By Stephen Swoyer]

DATA VIRTUALIZATION EXPLAINED [P22] [By Stephen Swoyer]

VENDOR GAIN BIG ADVANTAGE [P19] [By Robert Eve]

1 • rediscoveringBI Magazine • #rediscoveringBI

FROM THE EDITOR Though there is certainly an art behind data management (DM) -- one industry media toolkit even has an entire series of briefings devoted to “the art of managing data” -- DM is nevertheless a science -- and a complicated one at that. With increasingly large volumes and high

Radiant Advisors Publication

rediscoveringBI

REIMAGINING DATA INTEGRATION UNTELLING THE

DATA QUALITY & THERMODYNAMICS

JUST SO STORY

complexity, the heterogeneity in the syntax, structure, and semantics and data are only a few of the problems that persistently plague data management today.

A PREDICTABLE DECLINE

FUTURE SHOCK

INTO DISORDER

THE DAUNTING PHYSICS

Taking a scientific look at data management can make for some inter-

LAW OF ATTRACTION

OF THE BIG DATA UNIVERSE

esting analogies -- and lend some intriguing insights -- into the way we approach data integration. Sure, it’s not all as glamorous as time travel and string theory (though Stephen Hawking himself has been quoted as saying, “Science is not only a disciple of reason, but, also, one of romance and passion”), yet the physics of data integration still has

CONTEXT AND DATA

DATA INTEGRATION PHYSICS

MAY 2013 ISSUE 8

plenty of room for excitement in exploration and discovery. In this month’s edition of RediscoveringBI, authors Michael Whitehead, Dr. Gian Di Loreto, Stephen Swoyer, and John O’Brien put on their (data scientist) lab jackets and light up their proverbial Bunsen burners with thoughts on the evolution of data integration process; how entropy, and thermodynamics, can influence data quality; the physics of the big data universe itself; and the laws of attraction between data and context.

Editor In Chief Lindy Ryan lindy.ryan@radiantadvisors.com

Contributor Michael Whitehead mikew@wherescape.com

Contributor Gian Di Loreto, Ph.D

Lindy R yan

gian.d@loretotech.com

Contributor John O’Brien john.obrien@radiantadvisors.com

Distinguished Writer Stephen Swoyer stephen.swoyer@gmail.com

Art Director Brendan Ferguson brendan.ferguson@radiantadvisors.com

Lindy Ryan Editor in Chief

For More Information info@radiantadvisors.com

Radiant Advisors

rediscoveringBI Magazine • #rediscoveringBI • 2

OPINION

LETTERS TO THE EDITOR Radiant Advisors Publication

radiant Advisors publication

rediscoveringBI

EvEnt-drivEn architEcturEs

The Big DaTa honeyMoon Over AlreAdy?

thE shifting lAndscAPE

Bi's Big QUesTion

timE of rEckoning sElEcting thE right bi solution

hAs the bubble burst?

Big DaTa vs. DaTa ManageMenT

collision couRsE

tying goAls to REquiREMEnts

A zerO-sum scenAriO

arE data modEls dEad?

Bi anD Big DaTa bringing them tOgether

An ARchitEctuRAl

After the big dAtA pArty

April 2013 issue 7

On: Twilight of the (DM) Idols

thE REAl dEbAtE

shifting gEARs with ModERn bi ARchitEctuREs

MARch 2013 issuE 6

On: Time for An Architectural Reckoning

Mutual Victory?

The FaceBook DW

So, the French eventually won the 100 years war. I hope

In an earlier comment, one reader noted that Facebook's data

our current DM challenges will play out a bit differently.

warehouse is an extension of Hadoop (Hive). As of May 6,

Generally, I’d like to see the current big-data plague/war

2013, I can officially report that Facebook has implemented a

result in a mutual victory.

traditional data warehouse alongside (or as a consistent com-

To help move us toward this better future, I’d really like to

plement to) its Hive/Hadoop store. As Ken Rudin explained in

see more discussion about effective DM techniques on HDFS.

a keynote today, Facebook keeps “core” business information

While the power and appeal of Hadoop is undeniable, so are

in its relational DW and uses an unspecified non-relational

the DM challenges. Unfortunately, most of the focus I’ve seen

platform (probably Hive/Hadoop?) to support laissez-faire

around Hadoop (and the related technologies) has been about

analytic discovery or experimentation. Yes, Virginia: Facebook

power and speed. There isn’t much discussion about effectivly

has itself a venerable data warehouse platform.

managing some of the issues you address such as different

- Stephen Swoyer (Editor's Note: Time for An Architectural

data has different needs.

Reckoning was published in RediscoveringBI March 2013)

For instance, I would love to see some DMish discussions on recommended practices for managing raw data files versus computed files in HDFS. Frankly, the “never delete” mantra smells worse than a dead-skunk to my two-decade-old data warehousing nose. All data loses some value over time at varying rates — how do we address this in the big-data era? It is a problem that must be solved at some point, and the more direct and thoughful attention is given to this issue, the faster

Yes, Virginia: Facebook has itself a venerable data warehouse platform."

bridges will be built and our hybrid data ecosystem nirvana can be reached. - Cj Applequist

Have something to say? Send your letters to the editor at lindy.ryan@radiantadvisors.com 3 • rediscoveringBI Magazine • #rediscoveringBI

IT'S A WRAP context. MDP describes a synthetic architecture for addressing common and emerging business intelligence (BI) and business analytic (BA) requirements. It knits together traditional BI, dashboard, and OLAP technologies, which are all focused by the data warehouse (DW); analytic discovery, BA, and predictive analytics, which mix discovery platforms with specialty analytic tools; and big data platforms, which include Hadoop – and a retinue of NoSQL repositories. MDP also addresses issues like information lifecycle management (ILM), governance, and business process optimization. Elsewhere, Geoffrey Malafsky, CEO of data integration specialist Phasic Systems, outlined a methodology for rapidly reconciling data from disparate systems, using Phasic's sche-

SPARK! Austin Recap

Last month, Radiant Advisors brought its inaugural launch of SPARK!, a three-day networking symposium, to Austin, Texas. SPARK! events provide a forum for discussing and exploring data management issues – at the frontier. They focus on the intersection – and, frequently, the collision – of business needs with IT priorities, using the disruption triggered by new and emerging forces (such as big data technologies or new analytic practices and methods) as an organizing focus. SPARK! Austin mixed keynote presentations delivered by industry thought-leaders John O'Brien and Dr. Robin Bloor – even including a “dueling perspectives” keynote in which O’Brien and Bloor, in a lively conversation with the attendee audience, shared their experiences in enabling data scientists through analytic architectures and processes – and Data Strategy Sessions taught by Geoffrey Malafsky, William McKnight, and Mike Lampa, with presentations from sponsors Dell, Composite, and ParAccel. Its centerpiece: Radiant Advisor's own Modern Data Platforms (MDP) framework, which CEO John O'Brien outlined over the course of SPARK!'s three days. Day One was devoted to the MDP platform vision, Day Two to MDP integration, and Day Three to analytics in an MDP

ma-less “Corporate NoSQL” repository as a focus. Malafsky's mantra? "No data left behind!" William McKnight, founder of McKnight Consulting Group, emphasized the importance of choosing the right information store for the right workload; he urged attendees to compete on the basis of “information excellence.” Finally, Mike Lampa, a managing partner with Archipelago Information Strategies, discussed how analytics can be used to complement – and, ultimately, to drive – business decision-making, with an emphasis on predictive analytics. Panel events sponsored by Dell, Composite, and ParAccel provided an opportunity for discussion among vendor representatives, industry experts, and attendees. “The most exciting feedback from our Austin attendees was that this was the first event [they’d attended] where presenters didn’t have to explain the basics of what BI or Analytics was,” Radiant’s John O’Brien said. “Instead we jumped right in to the hard challenges facing BI professionals in their jobs today, and that’s what SPARK! and Modern Data Platforms is really all about: rethinking today’s BI paradigms and coming together as a network of peers to learn from each other on how to challenge them.” The next SPARK! event is in Redwood City, CA from June 26-28th. Registration is open!

rediscoveringBI Magazine • #rediscoveringBI • 4

SPOTLIGHT

JUST SO STORIES, DM-STYLE [We need to stop uncritically telling and retelling Just So Stories, data management-style.]

MICHAEL WHITEHEAD

E IN DATA M A NAGEMENT (DM) like to tell ourselves stories about the world in which we work.

I call these “Just So Stories,” after the collection of origin stories by Rudyard Kipling. Some of these stories encapsulate timeless truths. One of these – viz., the data warehouse (DW) – is our archetypal Just So Story: why do we have a data warehouse? Well, it's Just So. Were we to perform a bit of DM anthropology, we'd discover that it actually makes sense to persist data into a warehouse-like structure. The warehouse gives us a consistent, manageable, and meaningful basis for historical comparison. For this reason, it's able to address the overwhelming majority of business intelligence (BI) and decision support use cases. As a Just So Story, the warehouse isn't something that “just gets passed down” as part of the oral history of data management: it's a living and vibrant institution. Sometimes our Just So Stories do get passed down: they have a ceremonial purpose, as distinct to a living and vibrant necessity. In other words, we do certain things because we've always done them that way – because it's Just So. A purpose that's ceremonial no longer has any practical or economic use: it's something we're effectively subsidizing. For years now, we in DM have been subsidizing a senseless artifact of our past:

5 • rediscoveringBI Magazine • #rediscoveringBI

we've been treating data integration

ing ceased to be a good idea; nor

subsidize a process that requires it to

(DI) as its own separate category, with

because people stopped taking pic-

consistently re-extract, re-conform, and

– in many cases – its own discrete tier.

tures. Fotomat got disintermediated; the

re-load data back into its warehouse.

This long ago ceased to be necessary;

market selected against it.

Thus was engendered the concept of

as a Just So Story that we uncritically

The development of the photo mini-

“ETL push-down.” This is the “insight”

tell and retell, it's taken on an increas-

lab fundamentally altered (by eliminat-

that, in cases where it doesn't make

ingly absurd aspect, especially in the

ing) the conditions of the very market

sense to re-extract data from the ware-

age of big data.

Fotomat had emerged to exploit. Firstly,

house and re-load it back into the DI

the photo mini-lab was more conve-

middle tier – basically, in any case

nient: the first minilabs appeared in

involving large volumes of data – the

supermarkets, drug stores, and bigbox

ETL tool can instead push the required

retailers. Secondly, the mini-lab service

transformations down into the ware-

We need to re-imagine DI. This means seeing data integration as a process, instead of as a category unto itself. This means bringing DI back to data: i.e.,

house. In other words, the warehouse itself gets used to perform the ETL

The new way of 'doing' DI is to go to where the data lives." to where data lives and resides. This

workload. This should have torpedoed the viability of the DI middle tier. After all, if we can push transformations down into the database engine, why do we need a separate DI hub? One (insuf-

model was faster and in a sense more

ficient) answer is that the database

agile: it promised one-hour processing,

can't perform the required transforma-

to Fotomat's over-night service. Thirdly,

tions as quickly as can a dedicated ETL

it was cheaper: in the mini-lab model,

processing server; a rebuttal is that

processing was performed on-location.

few ETL platforms optimize their push-

Don't get me wrong: the creation of

This eliminated an extra tier of pricing.

downs on a per-platform basis: they

a discrete DI tools category was a

So it is with data integration. At one

don't optimize for Oracle and PL-SQL,

time, to be sure, the DI hub was a

for SQL Server and T-SQL, and so on.

necessity.

just

But with big data, this becomes a moot

weren't powerful enough, so source

point. As a practical necessity, when

data first had to be staged in a middle

you're dealing with data volumes in the

tier, where it could be transformed and

terabyte or petabyte range, you want to

conformed before loading it into the

move as little of it as you have to. The

warehouse or mart.

new way of “doing” DI is to go to where

Thus was born the ETL platform: a sep-

the data lives.

arate staging and processing area for

This means using (for example) Hive,

data, complete with its own toolset. As

HQL, and Hcatalog in connection with

database engines became more power-

Hadoop; it means using the data manip-

ful, this scheme became unnecessary.

ulation facilities offered by MongoDB,

Nowadays, it's flatly harmful: it has the

Cassandra, graphing databases, and

potential to break the warehouse. What

other nontraditional platforms. It means

happens when conditions change, as

integrating at both the origin and at the

they must and will? When performance

destination. As a practical matter, you

suffers? When – for whatever reason – it

just can't move 50 or 100 TB of data in

becomes necessary to refactor? In such

a timely or cost-effective manner. But

least from the perspective of suburban

cases, data must be re-extracted from

you don't have to! Just keep the data

shutterbugs. By 1980, Wikipedia tells

the warehouse, loaded back into the DI

where it's at – e.g., in Hadoop – and run

us, Fotomat operated more than 4,000

middle tier, reconformed, and – finally

some Hive SQL on it. The point is that

kiosks. That was its peak.

– pushed back into the warehouse all

you're only moving a small amount – in

By the mid-1990's, Fotomat was all-but

over again. This is a timeconsuming

most cases, less than one percent – of

relict as a brick-and-mortar franchise.

process. Change can and will happen.

a much larger volume of data. You're

It didn't fail because photoprocess-

No large organization can afford to

moving the data that's of interest to

means letting go of the concept of DI as its own middleware tier.

How the Data Warehouse Got its ETL Toolset market-driven phenomenon. A market for discrete DI tools emerged to fulfill functions that weren't being addressed by DBMS vendors. The category of DI tools evolved over time and can justly be called an unalloyed success. It even has its own Magic Quadrant! Now it's time for it to go away. Think back to the 1970's and 1980's. Remember “Fotomats” -- those little hut-like structures that once bedighted the parking lots of supermarkets and strip-malls? Fotomat, too, was a market-driven phenomenon. As one of the first franchises to offer over-night film development on a massive scale, Fotomat performed a critical service, at

Database

engines

rediscoveringBI Magazine • #rediscoveringBI • 6

Whether you're talking about big data or little data, DI is a process; it's about moving and transforming data." you, the data that you actually want to deal with.

To eliminate boxes is to cut costs; to maintain a cost-center

At the destination, you're performing whatever transforma-

after it has out-lived its usefulness is to subsidize it. We tend

tions you require in the DBMS engine itself: all of the big

not to do a lot of subsidizing in DM. For example, we used to

RDBMS platforms ship with built-in ETL facilities; there are

use triggered acquisitions whenever we needed to propagate

also commercial tools that can generate optimized, self-doc-

table changes: this meant (a) putting a “trigger” on the DBMS

umenting, platform-specific SQL for this purpose. (As some

tables from which we wanted to acquire data, (b) creating

of you will have guessed, WhereScape knows a thing or two

a “shadow” table into which this information could be “cap-

about this approach.) Other tools generate executable code

tured,” and (c) writing SQL code to copy this data from the

to accomplish the same thing.

master table into the shadow table.

This is the best way to deal with the practical physics – the

Once change data capture (CDC) technology became widely

math – of big data.

available, we used that instead. Nowadays, CDC technology is

It's likewise consonant with what it means to re-imagine data

commoditized: all major RDBMSes ship with integrated CDC

integration as a holistic process, instead of as a discrete tier

capabilities of some kind. We still use triggered acquisitions

or toolset: in other words, DI happens where my data is.

for certain tasks, but we don't do so by default. Instead, we ask

DI is a Process, Not a Category or Toolset

Whether you're talking about big data or little data, DI is a process; it's about moving and transforming data. On paper – or in Visio a flowchart – you'd use an arrow to depict the DI process: it points from one box or boxes (sources) to another box or boxes (destinations). The traditional DI model breaks that process flow, interposing still another box – the DI hub, or DI tools – in between source and destination boxes. Think of this DI box in another way: as a non-essential cost center. Whenever we draw a box on a flowchart, it represents a specific cost: e.g., hiring people, training people, maintaining skillsets, etc.

7 • rediscoveringBI Magazine • #rediscoveringBI

a question: “Under what circumstances would we use this?” To re-imagine data integration as a process is to ask this same question about DI concepts, methods, and tools. It's to critically interrogate our assumptions -- instead of continuing to subsidize them. Unless we do so, we're just projecting – we're dragging – our past into our present: We're uncritically telling our own DM-specific set of Just So Stories. Share your comments > Michael Whitehead is the founder and CEO of WhereScape Software. He has been involved in data warehousing and business intelligence for more than 15 years.

EDITOR’S PICK

THE ONE THING LINDY RYAN

f Jack Palance’s City Slickers char-

another domino 50-percent larger in

acter, Curly, didn’t quite drive home

size, then another, and another – with

the message about “one thing,” Gary

a practical application. Reaching one

Keller and Jay Papasan certainly will

goal topples another domino, building

in their newly released The ONE Thing:

success sequentially, ONE Thing at a

The Surprisingly Simple Truth Behind

time. This success discloses clues in

Extraordinary Results.

the form of one passion, one skill, and

Keller and Papasan’s central message

even one person that makes such an

is clear: going small (narrowing your

impact on our lives that it changes our

focus to ONE Thing) can help you realize

course indefinitely – reinforcing that

extraordinary results. This – focusing on

ONE Thing as a fundamental truth at the

the things we should be doing, rather

heart of our success.

than what we could be doing – isn’t

We may not all remember Curly’s speech

something we don’t already inherently

about “the one thing” in City Slickers, but

know: when we try to do too much and

we likely all can recall Jack Palance’s

we stretch ourselves too thin, we sac-

acceptance speech at the 64th Academy

rifice productivity and quality – among

Awards (where he took home the Oscar

other things. By pushing too hard –

for Best Supporting Actor for said film).

doing too much – we inevitably end

After talking roundabout-ly about not

up frustrated, exhausted, and stressed

being too old to achieve goals, and

(not to mention bankrupted by the toll

much to the amusement of everyone,

of missed opportunities with family,

Jack performed a series of one-armed

friends, and time spent pursuing other

push-ups on stage. Having proved his

interests). Over time, our expectations

point that you never give up on work-

lower, until eventually our hopes and

ing towards your goals – your ONE

dreams – our lives – become small. And,

Thing – Jack referred to his first pro-

it doesn’t take Keller and Papasan to

ducer who predicted, two-weeks into

tell us that our lives are the one thing

filming 1950’s Panic in the Streets, that

not to keep small!

Jack would one day (or, 42-years later)

The key to success, then, is to focus

win an Academy Award – and he did!

on ONE Thing and to do it well, do

Obviously, that Award was not Jack’s

it best, then do it bigger and better.

first domino, but every step along the

It’s Lorne Whitehead’s 1983 domino

way was one more domino knocked

experiment – wherein he discovered

down in pursuit of his ONE Thing.

that a single domino could bring down

Share your comments >

The ONE Thing is available on Amazon and the Radiant Advisors eBookshelf: www.radiantadvisors.com/ebookshelf The ONE Thing: The Surprisingly Simple Truth Behind Extraordinary Results published by Bard Press, ISBN-13: 978-1885167774, $24.95.

Lindy Ryan is Editor in Chief of Radiant Advisors. rediscoveringBI Magazine • #rediscoveringBI • 8

FEATURES

[We need to plan around the expectation that data quality will deteriorate.]

DATA QUALITY AND THE SECOND LAW OF THERMODYNAMICS GIAN DI LORETO, PH.D

H I S M O R N I N G I CA L L E D to

and make things easier than the old

systems was an immutable law of the

make an appointment with my

system?” I asked. She was not amused:

universe and, somehow, because of this

doctor. The woman on the other

she told me it was a “total nightmare”

the ill-fated theme park was doomed

end of the line was apologetic:

and she wasn’t in the mood to laugh

to fail? I didn’t like his character in the

she was working with a new system

about it. I spared her my sermon, even

book (although I did like Jeff Goldblum),

which had already gone live, though

though I wanted to point out that what

but this concept is actually a fact: a law

the data from the old system hadn’t yet

she was seeing was not a mistake or an

of physics which itself is responsible

been properly migrated. She was under

oversight -- it was a law of nature.

for poor and worsening data quality

orders to enter all appointments into

Do you remember Dr. Malcom -- the

-- and the frustration of the assistant I

the new system and the old system, and

annoying scientist from Jurassic Park

spoke with this morning.

to write them down on paper, too.

-- rambling on about chaos theory and

In this article I will explore the con-

“Shouldn’t the new system be better

how the unpredictability of complex

cept of entropy, defined below, how it

9 • rediscoveringBI Magazine • #rediscoveringBI

we can use this connection between the data world and the real world to understand and to attack our data quality problems." applies to data, and how we can use this connection between the data world and the real world to understand and to attack our data quality problems.

We Can Prepare for It Though she was in no mood to hear it, what my doctor’s assis-

The Second Law and Entropy

tant was dealing with this morning was completely predictable.

There are many different ways to state the Second Law of

healthcare industry as they move towards electronic medical

Thermodynamics but they all boil down to this: the entropy

records (EMRs) and as the industry as a whole grows. Add to that

of any system is monotonically increasing. This includes the

all the new tests and drugs we have today and we have a big

entropy of the universe as a whole.

data problem in healthcare -- a big, “big data” problem.

Entropy is a quantity used by physicists (and other scientists

We should collectively take some solace in the fact that this

and engineers) to measure the organization of a system: larg-

problem isn’t because someone in IT didn’t do his or her job, or

er values for entropy are associated with less organization,

because the new system was poorly designed (although either

while minimum entropy equals maximum organization. You

of these could be true).

can calculate the entropy of any physical system -- as long as

What is certain is that the project that involved the rollout of

you can define boundaries -- or you can calculate the entropy

the new scheduling system at my doctor’s office didn’t take into

of the entire universe. Don’t think of Chaos Theory or any of

account the Second Law, plan and prepare for the associated

that cool stuff: focus on how entropy measures organization.

fallout, or take proactive measures to minimize the downstream

One simple real life example is temperature. Temperate can

impact.

be defined as the average kinetic energy of particles in a system. A glass of water at room temperature has lots of water molecules spread out and lined up in no particular order wiggling back and forth, bouncing off each other. Bring that

Moreover, it’s a phenomenon that will continue to plague the

Energy In, Entropy Down The trick when rolling out a new system and/or migrating data between systems is to assume that: 1. The source data is bad, and

glass of water to freezing and the water molecules line up in

2. Migrating the data from the old system to the new system

nice patterns -- entropy decreases. Bring that glass of water

will “raise the temperature” of the overall data and allow for

to absolute zero (total absence of any heat at all) and you

larger values of entropy, i.e. a decrease in data quality

achieve zero entropy. Even after freezing the water molecules

While it is true that the overall entropy of the universe is

still oscillate back and forth, but at absolute zero they stop

increasing and there is nothing that we can do about it, we can

moving altogether.

act on individual closed systems and decrease the associated

Now, think of a database rather than a glass of water. For a

entropy: by adding energy to a system, we reduce its entropy. It

database, absolute zero is achieved in two ways:

took energy, for example, to run the compressor and remove heat

1. Make sure all the data in your system is correct and

from the glass of water we described above. So, too, does it

perfect, and 2. Disconnect it from all other systems that may write-to or alter the data

require energy to reduce the entropy of our data. Energy -- in this context -- refers to a project plan, time, and resources specifically dedicated to this immutable fact. Do not

The first is, as we know, very hard to do, and the second is doable

be satisfied with the standard approach of shoving the old data

but will yield a static, useless database. However, we can use

into the new systems and handling the kick-outs one by one.

these concepts to help us understand, measure, and improve

This is still how I see 90% of implementations handled and it

data quality.

is bound to fail. The approach doesn’t pay the proper respect to

rediscoveringBI Magazine • #rediscoveringBI • 10

Data Quality must be a major consideration in any new system rollout or data conversion -- virtually any data project." the problem; it doesn’t treat it as a serious issue that will deeply affect the chances of the project’s success -- whether you expect it to or not. Data Quality must be a major consideration in any new system rollout or data conversion, data warehousing -- virtually any data project. Ignoring it is exactly the same as creating a project plan that calls for lowering data quality. It will happen: it is a law of nature and it cannot be stopped. The only way to minimize the damage is to plan and assign the appropriate resources to data quality in the project plan, or to put it in the spirit of this discussion: to add energ to the system.

Conclusion

The future of our universe, unfortunately, is bleak. The Hubble telescope confirms that the universe is expanding, which together with the Second Law, leads us to believe that the universe is retreating from a state when we were all at maximum order, minimum entropy. The Big Bang theory, in fact, tells us that, at the beginning, all of space and time and mass and energy were at a single, very heavy singularity in space-time. For some reason, this singularity exploded and the universe has

absolute zero in the lab is practically (and actually) impossible:

been moving away from its center ever since. One theory pre-

we can come very close, but we need exponentially more energy

dicts that, at some point, the expansion will stop and the retrac-

the closer we try to get. In the data world, this is one great

tion will start (you can see why I didn’t think the assistant this

advantage we have over the real world.

morning wouldn’t want to hear all this). However, if we restrict

What we need is the expectation that data quality will deterio-

ourselves to closed systems, we can carefully apply energy and

rate and to plan around this fact -- rather than to be surprised

vastly reduce the entropy.

after the fact, after timelines and budgets are established.

We have the desire, the tools, and the knowledge, but what I

Because ultimately, it’s not anyone’s fault: it’s a law of nature.

continue to observe is the surprise when a data project ends up

Share your comments >

in rubble -- the go–live is pushed, or we go live with bad data and spend another year (or more) hammering out the issues one

Dr. Gian Di Loreto training in physics,

by one.

math, and, science, combined with

One of the great things about the concept of a data warehouse

fourteen years in the trenches deal-

is that it allows us to create a closed system and then to open

ing with real life data quality issues,

that system to the outside, carefully and gradually. It is, by the

puts him among the nation’s top

way, much harder to create closed systems in real life: to achieve

experts in the science of data quality.

11 • rediscoveringBI Magazine • #rediscoveringBI

redwood city, ca June 26 - 28, 2013

Modern Data Platforms BI BI and Analytics Network Symposium Day One | Designing Modern Data Platforms These sessions provide an approach to confidently assess and make architecture changes, beginning with an understanding of how data warehouse architectures evolve and mature over time, balancing technical and strategic value delivery. We break down best practices into principles for creating new data platforms.

Day Two | Modern Data Integration These sessions provide the knowledge needed for understanding and modeling data integration frameworks to make confident decisions to approach, design, and manage evolving data integration blueprints that leverage agile techniques. We recognize data integration patterns for refactoring into optimized engines.

Day Three | Databases for Analytics These sessions review several of the most significant trends in analytic databases challenging BI architects today. Cutting through the definitions and hype of big data in the market, NoSQL databases offer a solution for a variety of data warehouse requirements. Register now at: http://radiantadvisors.com

Featured Keynotes By:

Sponsored By:

John Oâ&#x20AC;&#x2122;Brien

Shawn Rodgers

Founder and CEO

Vice President Research, BI

Radiant Advisors

Enterprise Management Associates

FEATURES

[Are the days of the data warehouse numbered?]

DATA INTEGRATION FUTURES: THE DATA WAREHOUSE GETS A REPRIEVE? STEPHEN SWOYER

re the days of the data warehouse (DW) numbered? Not

“It means that … data lives in different places and [that] you

necessarily.

don't want to or can't move it to other places. It [means] that

Take it from Composite Software, which established its

it also has different shapes,” Besemer indicated. “Data stored in

reputation as an early champion of data virtualization

Hadoop … might be stored in a sparse structure or [as] semi-

(DV) software.

structured. It's not just about connecting via ODBC or JDBC;

DV and data warehousing have had something of a rocky rela-

[sometimes] you have to use HTTP or maybe JMS.”

tionship, thanks chiefly to the way in which DV (in a former incarnation) used to be marketed. But this past is just that: past, argues David Besemer, chief technology officer with Composite: “There's always been this tension between which data should go into the warehouse and which data should not -- and how are we going to get it there. I think we can all agree that all of the data does not go into the warehouse.” “This doesn't mean that the data warehouse goes away or becomes unimportant. It doesn't mean that ETL goes away and becomes unimportant,” he continued.

13 • rediscoveringBI Magazine • #rediscoveringBI

DV Reconsidered

DV is in the midst of a rehabilitation of sorts. Many experts, both in and out of data management (DM), suggest that a unifying DV layer could provide the substrate for a synthetic architecture that scales to address common and emerging DM use cases. These include reporting, dashboards, and OLAP-driven discovery; business analytics and analytic discovery; specialty databases; and big data analytics. DV figures prominently in several next-gen architectures, including Gartner's Logical Data Warehouse vision, the “Hybrid Data Ecosystem” from

SIDEBAR:

BIG DATA at big data scale, it is physically and PHYSICS economically impossible to sustain [STEPHEN SWOYER] the ETL-powered DI"

From a traditional data management perspective, the physics of the big data universe might as well partake of those of a black hole. The key thing about the physics of a black hole is

Enterprise Management Associates, and

Kopp-Hensley concedes. “And I have the

IBM's Big Data Platform. DV plays a key

gray hairs to prove it. Where [the market

role in Radiant Advisors’ Modern Data

is going is that] you're going to opti-

Platforms framework, too.

mize your workloads in the right place

DV has come a long way.

and you're only going to move some

Half a decade ago, one of its key

data. And that's just a fact. That's why

enabling technologies – viz., data fed-

virtualization becomes more and more

eration – was treated as a kind of red-

important. As a technology, it helps to

headed step-child, at least in BI circles.

minimize this movement [of data].”

Nancy Kopp-Hensley, program director for Netezza marketing and strategy with IBM, well remembers those days. Before federation first fell out of favor, Big Blue had bet big on it as a pragmatic way to manage exploding data volumes. But people tried to do too much with federation, Kopp-Hensley indicates, noting that distributed federated queries posed particular problems. “As a concept, [federation] was a fantastic thing, but people went overboard, instead of using it where it was needed and where it made sense,” she says, noting that one customer attempted to federate queries across almost all of its distributed data sources. “We said 'federate for 20 percent of your data, for the occasional queries, for [information that you have in] disparate systems.’ They were doing it with 80 percent [of their data] instead.” Kopp-Hensley still has something of a federation hangover, although she says that IBM – and, yes, its customers – are giving federation, as an enabling technology for DV, a long second look. “I still remember those days” when IBM was an aggressive promoter of federation,

Federation: The Agony and the Ecstasy

that we don't know anything about them. This isn't strictly true of big data: we do know a lot about it. What we know flies in the face of DM orthodoxy, however. The data warehouse (DW) expects to model the data that it consumes in advance; big data platforms were conceived precisely to avoid this requirement. The DW model strictly separates data processing from data storage; big data tightly couples both. The warehouse is grounded in still another assumption: that conditions or requirements won't significantly change. It isn't

A decade ago, DV was indeed known as

quite true that big data platforms don't make

“data federation,” or (more imposingly)

assumptions about changing conditions, but it is

as “enterprise information integration”

the case that big data platforms are perceived as

-- EII for short. For a brief period, fed-

more agile or flexible than “inflexible” business

eration -- as EII -- it was hyped as a data

intelligence or DW systems.

warehouse replacement technology: the

Physicists speculate that the physics inside the

idea was that instead of persisting all of

event horizon of a black hole actually take the

your data into a single, central reposito-

form of a sort of bizarro inversion of our own

ry, you could connect to it where it lives

low-gravity physics. It isn't a stretch to say

-- in multiple, disparate repositories --

that the physics of the big data universe have

and surface it in the form of canonical

a similar relationship to those of classical DM

representations or business views.

model: inverted.

The primary claimed benefit of doing

This isn't just a question of theory, either; it has

so was agility: mappings weren't fixed,

a practical aspect, too. If you want to move just

data models weren't static, the ETL pro-

a subset of a 10 PB Hadoop repository, you must

cess -- with its data source mappings,

still move several hundred gigabytes of data.

its impact analysis, and its many other

A lot of organizations have already broken the

time-consuming requirements -- was

petabyte barrier: Teradata, for example, has doz-

eliminated altogether.

ens of customers with a PB or more of relational

Not so fast: literally. Adoptees found

data. This doesn't include the semi-structured

that federation was slow -- so slow that

information – harvested from social media;

even with caching, it just wasn't up to

continuously streaming from machines, sensors,

the task. The EII tools of old likewise

etc. – that's being consumed as grist for the big

lacked many of the data preparation or

data analytic mill. Most of this semi-structured

data manipulation facilities offered by

stuff is landing in Hadoop, Cassandra, or in other

ETL products.

(typically NoSQL) repositories – where it will remain.

rediscoveringBI Magazine • #rediscoveringBI • 14

This last is the primary difference between data federation

an exception. Vendors such as QlikTech (developer of QlikView),

and data virtualization: like federation, DV enables an abstrac-

Tableau Software (developer of Tableau), and Tibco (developer

tion layer – but it's able to do a lot more with data. In a sense,

of Spotfire) have successfully marketed their products directly

DV effectively encapsulates the “T” of ETL in that it's able to

to the line-of-business. This is the analytic discovery use case,

manipulate data -- e.g., transforming, recalculating, and so on

which has done so much to contest the centralized authority

-- to conform to business views or canonical representations.

of the data warehouse.

As a caching technology, some DV tools effectively perform

So, too, has bring-your-own-device, or, – in its DM-specific

ETL: e.g., extracting data from source systems and loading it

variant – bring-your-own-DBMS. Even though Teradata is an

-- prepared, manipulated, and transformed -- into cache. DV can

indefatigable champion of the EDW, many Teradata shops have

also incorporate data profiling and data cleansing routines.

rogue SQL Server instances hanging off of (i.e., siphoning data

So DV isn't data federation. Why does this matter? According

out of) their Teradata EDWs.

to advocates, it matters because of the daunting physics of

This isn't an anomalous phenomenon. From the perspective

the big data universe: at big data scale, it is physically and

of Jane C. Frustrated-Business-Analyst, the allure of Excel-

economically impossible to sustain the ETL-powered DI sta-

powered discovery has always proven to be irresistible; this

tus quo, with its emphasis on data movement and (in some

– along with the inflexibility of the DW-driven BI model –

incarnations) its insistence on a separate staging area for

helped create the conditions for what industry luminary Wayne

data. It matters, proponents say, because federation provides

Eckerson famously called “spreadmart hell.”

a practical means of managing or negotiating the physics of

This could account for the bring-your-own-SQL-Server trend in

big data while DV, with its built-in support for in-flight data

Teradata environments, too: Microsoft has bundled SQL Server

manipulations, also conforms source data so that it can be

Analysis Services (SSAS) with its flagship RDBMS for the last

consumed by BI, analytic discovery, Web applications, or other

15 years; the line of business can easily buy and deploy SQL

front-end tools.

Server on its own: why not do so? “The idea of being able to

Sons (and Daughters) of Anarchy Big data scale is just part of the problem, said Composite's Besemer. Even in shops that have an enterprise data warehouse (EDW), rogue or non-centralized platforms are a rule, not

15 • rediscoveringBI Magazine • #rediscoveringBI

claw back these rogue data marts and making them virtual data marts, even if they're cached, is quite attractive,” Besemer indicated. This is another reason, he reiterated, why DV isn't going to

replace ETL -- or most other DM techniques, for that matter:

is to push as much work as you can down to the leaf nodes.”

as a technology, DV provides a virtual abstraction layer; it

(DV has a number of other potential issues, too. )

connects to platforms or repositories where they live. Many

Kopp-Hensley thinks that virtualization, with its ability (a) to

of these sources will be ETL-fed data warehouses, others will

present a unified view of disparate data sources and (b) to

be operational systems, still others will be Web applications

conform source data so that it can be consumed by BI tools, (c)

or social media services. “I will emphasize again: this isn't a

to accommodate non-relational or non-ODBC/JDBC APIs, such

replacement for the DM techniques you already have; it essen-

as REST – to say nothing of (d) its capacity (with caching and

tially augments those techniques,” he commented.

a good query optimizer) to support limited federated query

“You may … want to use [DV] canonicals as data services in

scenarios – is a pragmatic solution for “doing” big data physics.

your business processes as you move forward to a more loose-

In this respect, she echoes the assessment of Composite's

ly-coupled architecture. At the top of the data virtualization

Besemer: ETL – as a process – isn't going anywhere; the data

layer, you have a choice to deliver data in either a relational

warehouse – as a conceptual structure – will survive and

model that looks like a database, or as a service, in HTTP or

(perhaps even, if only as a virtual canonical representation)

JMS. Most customers end up doing both.”

thrive – and DV will knit everything together. Does this include

Future Shock

Performance was the Achilles Heel of DV back when it

IBM's own Big Data Platforms vision? Kopp-Hensley is coy. Virtualization technology (in the form of both Vivisimo and IBM's InfoSphere-branded

was hyped as a

data federation tech-

data warehouse replacement technology. All DV vendors tend to address performance issues by using caching; most also focus on query optimization,

nologies) does have a

The EII tools of old [likewise] lacked many of the data preparation or data manipulation facilities offered by ETL products."

push

joins

and queries down into the source database engine; this has the effect of reducing the amount of data that needs to be moved. Radiant Advisors’ O'Brien and several other industry watchers have praised Composite's query optimizer. O'Brien -- invoking the unofficial sobriquet of the Oracle database -even dubbed it the “Cadillac” of query optimizers. Nevertheless, Besemer acknowledges that there will be cases in which even aggressive caching and a smart optimizer won't be enough to address service level agreements. “[T]his [i.e., the distributed query optimizer] is the hard part.

Platforms vision, she stresses. “It's going to evolve,” she says, hinting that IBM may – or may not – be planning a virtualization appliance. “It's not going to be

too. The idea is to

role in its Big Data

snap-your-fingersand-all-of-these-things-consolidate; these things are going to take some time. The data warehouse is not going away anytime soon, but it is evolving,” Kopp-Hensley continues. “We're going to have a heterogeneous environment, so virtualization is starting to make more sense, [as is] the ability to explore [information] where [it lives] and [to] not have the applications care about where the data is. This is the new priority: ideally you want your applications to not care where the data is.” Share your comments >

The key to making this work is to reduce the amount of data that has to be moved, in order to reduce this … you have to optimize where the work gets done to reduce [the movement of] that data. Oftentimes, that means pushing joins and queries down to the data source,” he said. “In the end, even with the best optimization … if it doesn't run as fast as you need it to run, then you need to consider another possibility. But the idea

Stephen Swoyer is a technology writer with more than 15 years of experience. His writing has focused on business intelligence and data warehousing for almost a decade. rediscoveringBI Magazine • #rediscoveringBI • 16

FEATURES

DATA AND THE

LAW OF ATTRACTION

[With great flexibility comes greater responsibility -- this is the next challenge.]

JOHN O'BRIEN the users from the structured world of data. Here’s the twist: what we have gained is the default aspect of separating the persisted data from its semantic context. An application or user that retrieves data from flexible key-value stores (like Hadoop) can now determine the context in which they need to work with the data, whether creating a record by constructing key-value pairs into a database tuple or by loading key-value pairs into a similar constructed object model: the knowledge and responsibility lies with the developer. This process of abstracting the semantic context for the data is not completely new to data modelers, though it has taken on a much purer form in recent years. Abstraction has occurred at three different levels within the business intelligence (BI) architecture: inside the database through the use of views, links, and remote databases; above the database with data virtualization technologies; and finally, in the BI tools themselves, through the use of meta catalogs and access layers

across multiple databases. Each time RO M A DATA M A N AG E M E N T

broken down to its most fundamental

this projection (or view of data) con-

(DM) perspective, one of the more

element: the key-value pair. Hadoop

tains the semantic context delivered by

interesting aspects of Hadoop’s

and other key-value data stores lever-

the administrator.

impact on the world of enterprise

age this simplicity -- coupled with the

In Radiant Advisors’ three-tired modern

data management (EDM) hasn’t really

power of the MapReduce framework

data platform (MDP) framework for BI,

been about the “three V’s and C” (vol-

-- to work with data however the user

tier one is for flexible data manage-

ume, velocity, variety, or complexity).

needs to. This has also given us the

ment with scalable data stores, like

It’s been that, in order to achieve grand

neologism “NoSQL,” which we know as

Hadoop, that rely on some form of

scalability, the data itself has had to be

“Not-Only” SQL -- and which unshackles

abstraction for semantic context to

17 • rediscoveringBI Magazine • #rediscoveringBI

The answer for BI architects is to blend these abstraction approaches and to focus on governing semantic context carefully, rather than taking it for granted." work with data. This flexibility allows for users to perform

data and analytics management.

semantic discovery in a very agile fashion through the ben-

•

Step 1 begins with “Semantic Discovery” in flexible envi-

efits of meta-driven data abstraction and no data movement.

ronments at the hands of business analysts or data sci-

Hadoop leverages Hive -- or HCatalog -- for the semantic

entists to define new context in the abstraction layer, or

definition of tables, rows, and columns familiar to SQL users.

in MapReduce programs. This should be dependent upon

These definitions can be created, tested, and modified (or

a proper definition of new data governance roles (such

discarded) quickly and easily without having to migrate data

as data scientists) and corresponding responsibilities,

from structure to structure in order to verify semantics. Tier 2 in the MDP framework is designated for data stores

accountably, and delegation rights. •

Step 2, “Semantic Adoption,” is for the data governance

that require highly-optimized analytic workloads, such as

process to evaluate and decide whether the discovered

cubes, columnar, MPP, and in-memory data, and for highly-

context needs to governed and consumed in temporary

specialized analytic workloads, such as text analytics and graph databases.

or permanent -- and local or enterprise -- context. •

Step 3 involves deciding how and where context needs

Tier 3, however, is for reference data management with sche-

to be governed -- this can remain in the Hadoop abstrac-

mas for storing data in structures that are based upon seman-

tion layer for a defined set of users, or have BI projects

tic context. This data tends to be more critical subject areas

deliver the context via ETL or data virtualization to the

or master lists, or business event data that needs to maintain

reference data tier.

a high-definitional consistency for enterprise consumers (and

With great flexibility comes greater responsibility, and this is

used with data from the Hadoop world to provide qualifying

the next challenge for data governance and big data. Our best

context to events). While you can store a master customer list,

practice is a fundamental “law of attraction” as to whether the

product list, or their key subject area(s) in Hadoop, you have to

semantic context should be attracted closest to the data as

ensure consistency by tightly governing its abstraction layer,

schema, or whether it should be attracted closer to the end

which inherently allows for other semantic definitions to be

user in one of the layers of abstraction.

easily created on the same data. It simply makes more sense –

The law of attraction is simple: the more there is a need

and carries less risk -- to embed this context into the schema

for enterprise consistency and mass consumption, then

itself to ensure a derived consistency of use in the enterprise.

the semantic context should be closer to the data through

Following this three-tier framework, BI architects have a

schema to be inherited by all consumers. Likewise, when

more balanced approach to managing semantic context of

there’s more of a localized context or perspective of the data

data for the enterprise, while still having to make key deci-

involved, then context should reside in one of the abstraction

sions regarding their architecture. Data stored in Hadoop

layers. Strong governance and standards are the same wheth-

relies on Hive or HCatalog to store the proper context, while

er the data persists in key-value data stores or structured

data stored in data warehouses has the context embedded

schema data. Governing semantic context is a function of its

into the schema for derived consistency. Analytic databases

role within the enterprise, and this should allow BI architects

can access data from both through the use of projections,

to blend fixed schemas and abstraction layers while relying

views, and links. Data virtualization provides an abstraction

on data owners and stewards to make key determinations.

layer for users around all three data tiers. The answer for BI

Share your comments >

architects is to blend these abstraction approaches and to focus on governing semantic context carefully, rather than taking it for granted. Radiant Advisors has been working with companies to extend existing (or create new) data governance processes that handle the concept of a new Semantic Context Lifecycle for

John O’Brien is the Principal and CEO of Radiant Advisors, a strategic advisory and research firm that delivers innovative thought-leadership, publications, and industry news. rediscoveringBI Magazine • #rediscoveringBI • 18

VENDOR

GAIN BIG ADVANTAGE WITH DATA VIRTUALIZATION ROBERT EVE

usiness agility requirements and the proliferation of big data and cloud sources outside the enterprise data warehouse are challenging traditional data integration techniques, such as consolidation of summarized

data into the EDW with ETL. As a result, more modern data integration approaches, such as data virtualization, are now seeing accelerated adoption. In 2012, Gartner surveys showed approximately 27% of respondents actively involved in or had plans for deployment of federated or virtualized views of data. This year, TDWI Best Practice report Achieving Greater Agility with Business Intelligence revealed even higher data virtualization adoption statistics with 19% currently in use and 31% planning to implement.

Why Use Data Virtualization? With so much data today, the difference between business leaders and also-rans is often how well they leverage their data. Significant leverage equals significant business value, and that’s a big advantage over the competition. Data virtualization provides instant access to all the data you want, the way you want it. Enterprise, cloud, big data, and more: no problem! With data virtualization, you benefit in several important ways: Gain more business insights by leveraging all your data – Empower your people with instant access to all the data they want, the way they want it. Respond faster to your ever-changing analytics and BI needs – Five- to ten-times faster time-to-solution than traditional data integration.

What is Data Virtualization? Data virtualization is an agile data integration approach organizations use to gain more insight from their data. Unlike data consolidation or data replication, data virtualization integrates diverse data without costly extra copies and additional data management complexity.

Fast track your data management evolution – Start quickly and scale successfully with an easy-to-adopt overlay to existing infrastructure. Save 50-75% over data replication and consolidation – Data virtualization’s streamlined approach reduces complexity and saves money.

19 • rediscoveringBI Magazine • #rediscoveringBI

Who Uses Data Virtualization?

When To Use Data Virtualization?

Data virtualization is used across your business and IT orga-

You can use data virtualization to enable a wide range of

nizations.

applications.

Business Leaders – Data virtualization helps you drive busi-

Agile BI and Analytics – Improve your insight sooner.

ness advantage from your data. Data Warehouse Extension – Increase your return on data Information Consumers – From spreadsheet user to data sci-

warehouse investments.

entist, data virtualization provides instant access to all the data you want, the way you want it.

Data Virtualization Architecture – Gain information agility and reduce costs.

CIOs and IT Leaders – Data virtualization’s agile integration approach lets you respond faster to ever changing analytics

Data Integration – Integrate your Big Data, Cloud, SAP, Oracle

and BI needs and do it for less.

Applications and other sources more easily.

CTOs and Architects – Data virtualization adds data integra-

Logical Data Warehouse – Modernize your information man-

tion flexibility so you can successfully evolve and improve

agement architecture.

your data management strategy and architecture. Business and Industry Solutions – Support your unique inforIntegration Developers – Easy-to-learn data virtualization

mation needs.

design and development tools allow you to build business views (also known as data services) faster to deliver more business value sooner.

Data virtualization is not the answer to every data integration problem. Sometimes data consolidation in a warehouse, along

IT Operations – Data virtualization’s management, monitoring,

with ETL or ELT is a better solution for a particular use case.

security and governance functions ensure security, reliability

And sometimes a hybrid mix is the right answer.

and scalability. Robert Eve leads marketing for Composite Software. His experience includes executive level marketing and business development at leading enterprise software companies. rediscoveringBI Magazine • #rediscoveringBI • 20

UPCOMING INDUSTRY

EVENTS

SIDEBAR

DATA VIRUTALIZATION EXPLAINED [STEPHEN SWOYER] Think of data virtualization (DV) as kind of like iTunes – for

specific metadata, such as “Artist,” “Album Title,” “Track Title,”

distributed data.

and so on.

Like iTunes, DV has several different aspects: connectivity,

When it connects to a data source, a DV tool does more or less

preparation, and presentation.

the same thing: it doesn't replicate the contents of a source

On the connectivity tip, iTunes technically doesn't care

database table – just information about its structure. You can

where the actual files in your music library reside. It wants

choose to create a representation of the complete structure of

to consolidate most of your media files into a single central

a source table – e.g., all of its rows and columns – or, more

location – but it doesn't have to. It's possible to store media

commonly, just a subset of this structure. You can (and, in the

content on removable devices, on network storage, or – even

latter case, must) “tell” a view how you want to prepare or

– across a virtual private network (VPN).

manipulate data: you can concatenate it, change the names

Ditto for DV, which uses federation technology to enable a

in a tablespace, perform statistical functions, etc. You can

kind of abstraction layer: a DV tool doesn't have to care if your

also “stack” views in a DV tool to build more complex (i.e.,

data's sitting locally on an internal LAN or SAN – or 1,000

“composite”) views, or to “blend” data from multiple sources.

miles away, in a remote office or business campus. (It doesn't

Before data can be consumed, it first has to be prepared. DV

have to, but it does – that's where DV-specific performance

and iTunes do similar things on the preparation tip, too. In

optimizations, such as query optimization and caching, come

the case of iTunes, your media library is a collection of het-

into the picture.) The emphasis with DV is on minimizing

erogeneous files: you'll have stuff encoded in AAC (M4A), stuff

data movement; iTunes, by contrast, moves all of the bits of a

encoded in AVC (MP4), and so on. Some of these files could be

source file – whether it's stored locally on the LAN or (in the

sourced from a CD, which means they'll be encoded at 16-bit

case of a VPN) sitting on a file server in Cupertino.

resolution with a 44.1-kHz sampling rate. Other files could

When it comes to presentation, both iTunes and DV are all

be sourced from a DVD, which means they'll be encoded at a

about the view.

range of different resolutions and sampling rates.

This is kind of how DV technology works, too: you use it to

Your computer's audio hardware might not be able to repro-

build business views, which are canonical representations of

duce these files. It might also expect to consume audio at a

the tables and columns in a source database. One key differ-

certain sampling rate. In other words, iTunes has to be able

ence is that iTunes is its own front-end application – it's own

to transform audio data so that it conforms to a format that's

Tableau or QlikView, so to speak.

supported by your computer.

Unlike iTunes, “presentation” in a DV context is completely

DV isn't a silver bullet. Like any other technology, it has costs

virtual. DV technologies are middlemen: you build a view in a

and benefits. Performance is one common cost – particularly

DV tool with the expectation that it's going to be consumed

with respect to federated query scenarios. Adopters also need

by another application – typically one which speaks ODBC,

to be mindful of DV's impact on operational systems. And

JDBC, or SQL.

while DV technologies do encapsulate ETL or data quality

In both cases, a lot's happening beneath the covers. When

functionality, their capabilities in this regard tend to be com-

iTunes builds a library view, it basically catalogs your media

paratively immature, at least with respect to best-of-breed

files, reading and caching metadata about them. It looks for

ETL offerings.

rediscoveringBI Magazine • #rediscoveringBI • 22

CHADVISED OPRESEAR REARCHAD CHADVISED DVISEDEVE DEVELOPR RADIANT ADVISORS

R E S E A R C H . . . A D V I S E . . . D E V E L O P. . . Radiant Advisors is a strategic advisory and research firm that networks with industry experts to deliver innovative thought-leadership, cutting-edge publications and events, and in-depth industry research.

rediscoveringBI F o l l o w u s o n Tw i t t e r ! @ r a d i a n t a d v i s o r s