Radiant Advisors Publication
rediscoveringBI
REIMAGINING DATA INTEGRATION UNTELLING THE JUST SO STORY
DATA QUALITY & THERMODYNAMICS A PREDICTABLE DECLINE
FUTURE SHOCK
INTO DISORDER
THE DAUNTING PHYSICS OF THE BIG DATA UNIVERSE
LAW OF ATTRACTION CONTEXT AND DATA
05 MAY 2013 ISSUE 8
DATA INTEGRATION PHYSICS
rediscoveringBI
May 2013 Issue 8
SPOTLIGHT
[P5]
Just So Stories, DM Style
We
need to stop uncritically telling and retelling Just So Stories, data management-style.
[By Michael Whitehead] FEATURES
[P9]
[P13]
[P17]
understanding entropy can help us attack
data warehouse numbered? Not neces-
itself has had to be broken down to its most
our data quality problems.
sarily.
fundamental element: the key-value pair.
[By Dr. Gian Di Loreto]
[By Stephen Swoyer]
[By John O’Brien]
Second Law of Thermodynamics How
Future of Data Integration Are the days of the
EDITOR’S PICK
[P8] The One Thing
The key to success, according to Keller and Papasan, is to focus on ONE Thing.
[By Lindy Ryan]
Data and the Law of Attraction The data
SIDEBAR BIG DATA PHYSICS [P14] [By Stephen Swoyer]
DATA VIRTUALIZATION EXPLAINED [P22] [By Stephen Swoyer]
VENDOR GAIN BIG ADVANTAGE [P19] [By Robert Eve]
1 • rediscoveringBI Magazine • #rediscoveringBI
FROM THE EDITOR Though there is certainly an art behind data management (DM) -- one industry media toolkit even has an entire series of briefings devoted to “the art of managing data” -- DM is nevertheless a science -- and a complicated one at that. With increasingly large volumes and high
Radiant Advisors Publication
rediscoveringBI
REIMAGINING DATA INTEGRATION UNTELLING THE
DATA QUALITY & THERMODYNAMICS
JUST SO STORY
complexity, the heterogeneity in the syntax, structure, and semantics and data are only a few of the problems that persistently plague data management today.
A PREDICTABLE DECLINE
FUTURE SHOCK
INTO DISORDER
THE DAUNTING PHYSICS
Taking a scientific look at data management can make for some inter-
LAW OF ATTRACTION
OF THE BIG DATA UNIVERSE
esting analogies -- and lend some intriguing insights -- into the way we approach data integration. Sure, it’s not all as glamorous as time travel and string theory (though Stephen Hawking himself has been quoted as saying, “Science is not only a disciple of reason, but, also, one of romance and passion”), yet the physics of data integration still has
CONTEXT AND DATA
05
DATA INTEGRATION PHYSICS
MAY 2013 ISSUE 8
plenty of room for excitement in exploration and discovery. In this month’s edition of RediscoveringBI, authors Michael Whitehead, Dr. Gian Di Loreto, Stephen Swoyer, and John O’Brien put on their (data scientist) lab jackets and light up their proverbial Bunsen burners with thoughts on the evolution of data integration process; how entropy, and thermodynamics, can influence data quality; the physics of the big data universe itself; and the laws of attraction between data and context.
Editor In Chief Lindy Ryan lindy.ryan@radiantadvisors.com
Contributor Michael Whitehead mikew@wherescape.com
Contributor Gian Di Loreto, Ph.D
Lindy R yan
gian.d@loretotech.com
Contributor John O’Brien john.obrien@radiantadvisors.com
Distinguished Writer Stephen Swoyer stephen.swoyer@gmail.com
Art Director Brendan Ferguson brendan.ferguson@radiantadvisors.com
Lindy Ryan Editor in Chief
For More Information info@radiantadvisors.com
Radiant Advisors
rediscoveringBI Magazine • #rediscoveringBI • 2
OPINION
LETTERS TO THE EDITOR Radiant Advisors Publication
radiant Advisors publication
rediscoveringBI
rediscoveringBI
EvEnt-drivEn architEcturEs
The Big DaTa honeyMoon Over AlreAdy?
thE shifting lAndscAPE
Bi's Big QUesTion
timE of rEckoning sElEcting thE right bi solution
hAs the bubble burst?
Big DaTa vs. DaTa ManageMenT
collision couRsE
tying goAls to REquiREMEnts
A zerO-sum scenAriO
arE data modEls dEad?
Bi anD Big DaTa bringing them tOgether
04
An ARchitEctuRAl
After the big dAtA pArty
April 2013 issue 7
On: Twilight of the (DM) Idols
thE REAl dEbAtE
03
shifting gEARs with ModERn bi ARchitEctuREs
MARch 2013 issuE 6
On: Time for An Architectural Reckoning
Mutual Victory?
The FaceBook DW
So, the French eventually won the 100 years war. I hope
In an earlier comment, one reader noted that Facebook's data
our current DM challenges will play out a bit differently.
warehouse is an extension of Hadoop (Hive). As of May 6,
Generally, I’d like to see the current big-data plague/war
2013, I can officially report that Facebook has implemented a
result in a mutual victory.
traditional data warehouse alongside (or as a consistent com-
To help move us toward this better future, I’d really like to
plement to) its Hive/Hadoop store. As Ken Rudin explained in
see more discussion about effective DM techniques on HDFS.
a keynote today, Facebook keeps “core” business information
While the power and appeal of Hadoop is undeniable, so are
in its relational DW and uses an unspecified non-relational
the DM challenges. Unfortunately, most of the focus I’ve seen
platform (probably Hive/Hadoop?) to support laissez-faire
around Hadoop (and the related technologies) has been about
analytic discovery or experimentation. Yes, Virginia: Facebook
power and speed. There isn’t much discussion about effectivly
has itself a venerable data warehouse platform.
managing some of the issues you address such as different
- Stephen Swoyer (Editor's Note: Time for An Architectural
data has different needs.
Reckoning was published in RediscoveringBI March 2013)
For instance, I would love to see some DMish discussions on recommended practices for managing raw data files versus computed files in HDFS. Frankly, the “never delete” mantra smells worse than a dead-skunk to my two-decade-old data warehousing nose. All data loses some value over time at varying rates — how do we address this in the big-data era? It is a problem that must be solved at some point, and the more direct and thoughful attention is given to this issue, the faster
Yes, Virginia: Facebook has itself a venerable data warehouse platform."
bridges will be built and our hybrid data ecosystem nirvana can be reached. - Cj Applequist
Have something to say? Send your letters to the editor at lindy.ryan@radiantadvisors.com 3 • rediscoveringBI Magazine • #rediscoveringBI
IT'S A WRAP context. MDP describes a synthetic architecture for addressing common and emerging business intelligence (BI) and business analytic (BA) requirements. It knits together traditional BI, dashboard, and OLAP technologies, which are all focused by the data warehouse (DW); analytic discovery, BA, and predictive analytics, which mix discovery platforms with specialty analytic tools; and big data platforms, which include Hadoop – and a retinue of NoSQL repositories. MDP also addresses issues like information lifecycle management (ILM), governance, and business process optimization. Elsewhere, Geoffrey Malafsky, CEO of data integration specialist Phasic Systems, outlined a methodology for rapidly reconciling data from disparate systems, using Phasic's sche-
SPARK! Austin Recap
Last month, Radiant Advisors brought its inaugural launch of SPARK!, a three-day networking symposium, to Austin, Texas. SPARK! events provide a forum for discussing and exploring data management issues – at the frontier. They focus on the intersection – and, frequently, the collision – of business needs with IT priorities, using the disruption triggered by new and emerging forces (such as big data technologies or new analytic practices and methods) as an organizing focus. SPARK! Austin mixed keynote presentations delivered by industry thought-leaders John O'Brien and Dr. Robin Bloor – even including a “dueling perspectives” keynote in which O’Brien and Bloor, in a lively conversation with the attendee audience, shared their experiences in enabling data scientists through analytic architectures and processes – and Data Strategy Sessions taught by Geoffrey Malafsky, William McKnight, and Mike Lampa, with presentations from sponsors Dell, Composite, and ParAccel. Its centerpiece: Radiant Advisor's own Modern Data Platforms (MDP) framework, which CEO John O'Brien outlined over the course of SPARK!'s three days. Day One was devoted to the MDP platform vision, Day Two to MDP integration, and Day Three to analytics in an MDP
ma-less “Corporate NoSQL” repository as a focus. Malafsky's mantra? "No data left behind!" William McKnight, founder of McKnight Consulting Group, emphasized the importance of choosing the right information store for the right workload; he urged attendees to compete on the basis of “information excellence.” Finally, Mike Lampa, a managing partner with Archipelago Information Strategies, discussed how analytics can be used to complement – and, ultimately, to drive – business decision-making, with an emphasis on predictive analytics. Panel events sponsored by Dell, Composite, and ParAccel provided an opportunity for discussion among vendor representatives, industry experts, and attendees. “The most exciting feedback from our Austin attendees was that this was the first event [they’d attended] where presenters didn’t have to explain the basics of what BI or Analytics was,” Radiant’s John O’Brien said. “Instead we jumped right in to the hard challenges facing BI professionals in their jobs today, and that’s what SPARK! and Modern Data Platforms is really all about: rethinking today’s BI paradigms and coming together as a network of peers to learn from each other on how to challenge them.” The next SPARK! event is in Redwood City, CA from June 26-28th. Registration is open!
rediscoveringBI Magazine • #rediscoveringBI • 4
SPOTLIGHT
JUST SO STORIES, DM-STYLE [We need to stop uncritically telling and retelling Just So Stories, data management-style.]
MICHAEL WHITEHEAD
W
E IN DATA M A NAGEMENT (DM) like to tell ourselves stories about the world in which we work.
I call these “Just So Stories,” after the collection of origin stories by Rudyard Kipling. Some of these stories encapsulate timeless truths. One of these – viz., the data warehouse (DW) – is our archetypal Just So Story: why do we have a data warehouse? Well, it's Just So. Were we to perform a bit of DM anthropology, we'd discover that it actually makes sense to persist data into a warehouse-like structure. The warehouse gives us a consistent, manageable, and meaningful basis for historical comparison. For this reason, it's able to address the overwhelming majority of business intelligence (BI) and decision support use cases. As a Just So Story, the warehouse isn't something that “just gets passed down” as part of the oral history of data management: it's a living and vibrant institution. Sometimes our Just So Stories do get passed down: they have a ceremonial purpose, as distinct to a living and vibrant necessity. In other words, we do certain things because we've always done them that way – because it's Just So. A purpose that's ceremonial no longer has any practical or economic use: it's something we're effectively subsidizing. For years now, we in DM have been subsidizing a senseless artifact of our past:
5 • rediscoveringBI Magazine • #rediscoveringBI
we've been treating data integration
ing ceased to be a good idea; nor
subsidize a process that requires it to
(DI) as its own separate category, with
because people stopped taking pic-
consistently re-extract, re-conform, and
– in many cases – its own discrete tier.
tures. Fotomat got disintermediated; the
re-load data back into its warehouse.
This long ago ceased to be necessary;
market selected against it.
Thus was engendered the concept of
as a Just So Story that we uncritically
The development of the photo mini-
“ETL push-down.” This is the “insight”
tell and retell, it's taken on an increas-
lab fundamentally altered (by eliminat-
that, in cases where it doesn't make
ingly absurd aspect, especially in the
ing) the conditions of the very market
sense to re-extract data from the ware-
age of big data.
Fotomat had emerged to exploit. Firstly,
house and re-load it back into the DI
the photo mini-lab was more conve-
middle tier – basically, in any case
nient: the first minilabs appeared in
involving large volumes of data – the
supermarkets, drug stores, and bigbox
ETL tool can instead push the required
retailers. Secondly, the mini-lab service
transformations down into the ware-
We need to re-imagine DI. This means seeing data integration as a process, instead of as a category unto itself. This means bringing DI back to data: i.e.,
house. In other words, the warehouse itself gets used to perform the ETL
The new way of 'doing' DI is to go to where the data lives." to where data lives and resides. This
workload. This should have torpedoed the viability of the DI middle tier. After all, if we can push transformations down into the database engine, why do we need a separate DI hub? One (insuf-
model was faster and in a sense more
ficient) answer is that the database
agile: it promised one-hour processing,
can't perform the required transforma-
to Fotomat's over-night service. Thirdly,
tions as quickly as can a dedicated ETL
it was cheaper: in the mini-lab model,
processing server; a rebuttal is that
processing was performed on-location.
few ETL platforms optimize their push-
Don't get me wrong: the creation of
This eliminated an extra tier of pricing.
downs on a per-platform basis: they
a discrete DI tools category was a
So it is with data integration. At one
don't optimize for Oracle and PL-SQL,
time, to be sure, the DI hub was a
for SQL Server and T-SQL, and so on.
necessity.
just
But with big data, this becomes a moot
weren't powerful enough, so source
point. As a practical necessity, when
data first had to be staged in a middle
you're dealing with data volumes in the
tier, where it could be transformed and
terabyte or petabyte range, you want to
conformed before loading it into the
move as little of it as you have to. The
warehouse or mart.
new way of “doing” DI is to go to where
Thus was born the ETL platform: a sep-
the data lives.
arate staging and processing area for
This means using (for example) Hive,
data, complete with its own toolset. As
HQL, and Hcatalog in connection with
database engines became more power-
Hadoop; it means using the data manip-
ful, this scheme became unnecessary.
ulation facilities offered by MongoDB,
Nowadays, it's flatly harmful: it has the
Cassandra, graphing databases, and
potential to break the warehouse. What
other nontraditional platforms. It means
happens when conditions change, as
integrating at both the origin and at the
they must and will? When performance
destination. As a practical matter, you
suffers? When – for whatever reason – it
just can't move 50 or 100 TB of data in
becomes necessary to refactor? In such
a timely or cost-effective manner. But
least from the perspective of suburban
cases, data must be re-extracted from
you don't have to! Just keep the data
shutterbugs. By 1980, Wikipedia tells
the warehouse, loaded back into the DI
where it's at – e.g., in Hadoop – and run
us, Fotomat operated more than 4,000
middle tier, reconformed, and – finally
some Hive SQL on it. The point is that
kiosks. That was its peak.
– pushed back into the warehouse all
you're only moving a small amount – in
By the mid-1990's, Fotomat was all-but
over again. This is a timeconsuming
most cases, less than one percent – of
relict as a brick-and-mortar franchise.
process. Change can and will happen.
a much larger volume of data. You're
It didn't fail because photoprocess-
No large organization can afford to
moving the data that's of interest to
means letting go of the concept of DI as its own middleware tier.
How the Data Warehouse Got its ETL Toolset market-driven phenomenon. A market for discrete DI tools emerged to fulfill functions that weren't being addressed by DBMS vendors. The category of DI tools evolved over time and can justly be called an unalloyed success. It even has its own Magic Quadrant! Now it's time for it to go away. Think back to the 1970's and 1980's. Remember “Fotomats” -- those little hut-like structures that once bedighted the parking lots of supermarkets and strip-malls? Fotomat, too, was a market-driven phenomenon. As one of the first franchises to offer over-night film development on a massive scale, Fotomat performed a critical service, at
Database
engines
rediscoveringBI Magazine • #rediscoveringBI • 6
Whether you're talking about big data or little data, DI is a process; it's about moving and transforming data." you, the data that you actually want to deal with.
To eliminate boxes is to cut costs; to maintain a cost-center
At the destination, you're performing whatever transforma-
after it has out-lived its usefulness is to subsidize it. We tend
tions you require in the DBMS engine itself: all of the big
not to do a lot of subsidizing in DM. For example, we used to
RDBMS platforms ship with built-in ETL facilities; there are
use triggered acquisitions whenever we needed to propagate
also commercial tools that can generate optimized, self-doc-
table changes: this meant (a) putting a “trigger” on the DBMS
umenting, platform-specific SQL for this purpose. (As some
tables from which we wanted to acquire data, (b) creating
of you will have guessed, WhereScape knows a thing or two
a “shadow” table into which this information could be “cap-
about this approach.) Other tools generate executable code
tured,” and (c) writing SQL code to copy this data from the
to accomplish the same thing.
master table into the shadow table.
This is the best way to deal with the practical physics – the
Once change data capture (CDC) technology became widely
math – of big data.
available, we used that instead. Nowadays, CDC technology is
It's likewise consonant with what it means to re-imagine data
commoditized: all major RDBMSes ship with integrated CDC
integration as a holistic process, instead of as a discrete tier
capabilities of some kind. We still use triggered acquisitions
or toolset: in other words, DI happens where my data is.
for certain tasks, but we don't do so by default. Instead, we ask
DI is a Process, Not a Category or Toolset
Whether you're talking about big data or little data, DI is a process; it's about moving and transforming data. On paper – or in Visio a flowchart – you'd use an arrow to depict the DI process: it points from one box or boxes (sources) to another box or boxes (destinations). The traditional DI model breaks that process flow, interposing still another box – the DI hub, or DI tools – in between source and destination boxes. Think of this DI box in another way: as a non-essential cost center. Whenever we draw a box on a flowchart, it represents a specific cost: e.g., hiring people, training people, maintaining skillsets, etc.
7 • rediscoveringBI Magazine • #rediscoveringBI
a question: “Under what circumstances would we use this?” To re-imagine data integration as a process is to ask this same question about DI concepts, methods, and tools. It's to critically interrogate our assumptions -- instead of continuing to subsidize them. Unless we do so, we're just projecting – we're dragging – our past into our present: We're uncritically telling our own DM-specific set of Just So Stories. Share your comments > Michael Whitehead is the founder and CEO of WhereScape Software. He has been involved in data warehousing and business intelligence for more than 15 years.
EDITOR’S PICK
THE ONE THING LINDY RYAN
I
f Jack Palance’s City Slickers char-
another domino 50-percent larger in
acter, Curly, didn’t quite drive home
size, then another, and another – with
the message about “one thing,” Gary
a practical application. Reaching one
Keller and Jay Papasan certainly will
goal topples another domino, building
in their newly released The ONE Thing:
success sequentially, ONE Thing at a
The Surprisingly Simple Truth Behind
time. This success discloses clues in
Extraordinary Results.
the form of one passion, one skill, and
Keller and Papasan’s central message
even one person that makes such an
is clear: going small (narrowing your
impact on our lives that it changes our
focus to ONE Thing) can help you realize
course indefinitely – reinforcing that
extraordinary results. This – focusing on
ONE Thing as a fundamental truth at the
the things we should be doing, rather
heart of our success.
than what we could be doing – isn’t
We may not all remember Curly’s speech
something we don’t already inherently
about “the one thing” in City Slickers, but
know: when we try to do too much and
we likely all can recall Jack Palance’s
we stretch ourselves too thin, we sac-
acceptance speech at the 64th Academy
rifice productivity and quality – among
Awards (where he took home the Oscar
other things. By pushing too hard –
for Best Supporting Actor for said film).
doing too much – we inevitably end
After talking roundabout-ly about not
up frustrated, exhausted, and stressed
being too old to achieve goals, and
(not to mention bankrupted by the toll
much to the amusement of everyone,
of missed opportunities with family,
Jack performed a series of one-armed
friends, and time spent pursuing other
push-ups on stage. Having proved his
interests). Over time, our expectations
point that you never give up on work-
lower, until eventually our hopes and
ing towards your goals – your ONE
dreams – our lives – become small. And,
Thing – Jack referred to his first pro-
it doesn’t take Keller and Papasan to
ducer who predicted, two-weeks into
tell us that our lives are the one thing
filming 1950’s Panic in the Streets, that
not to keep small!
Jack would one day (or, 42-years later)
The key to success, then, is to focus
win an Academy Award – and he did!
on ONE Thing and to do it well, do
Obviously, that Award was not Jack’s
it best, then do it bigger and better.
first domino, but every step along the
It’s Lorne Whitehead’s 1983 domino
way was one more domino knocked
experiment – wherein he discovered
down in pursuit of his ONE Thing.
that a single domino could bring down
Share your comments >
The ONE Thing is available on Amazon and the Radiant Advisors eBookshelf: www.radiantadvisors.com/ebookshelf The ONE Thing: The Surprisingly Simple Truth Behind Extraordinary Results published by Bard Press, ISBN-13: 978-1885167774, $24.95.
Lindy Ryan is Editor in Chief of Radiant Advisors. rediscoveringBI Magazine • #rediscoveringBI • 8
FEATURES
[We need to plan around the expectation that data quality will deteriorate.]
DATA QUALITY AND THE SECOND LAW OF THERMODYNAMICS GIAN DI LORETO, PH.D
T
H I S M O R N I N G I CA L L E D to
and make things easier than the old
systems was an immutable law of the
make an appointment with my
system?” I asked. She was not amused:
universe and, somehow, because of this
doctor. The woman on the other
she told me it was a “total nightmare”
the ill-fated theme park was doomed
end of the line was apologetic:
and she wasn’t in the mood to laugh
to fail? I didn’t like his character in the
she was working with a new system
about it. I spared her my sermon, even
book (although I did like Jeff Goldblum),
which had already gone live, though
though I wanted to point out that what
but this concept is actually a fact: a law
the data from the old system hadn’t yet
she was seeing was not a mistake or an
of physics which itself is responsible
been properly migrated. She was under
oversight -- it was a law of nature.
for poor and worsening data quality
orders to enter all appointments into
Do you remember Dr. Malcom -- the
-- and the frustration of the assistant I
the new system and the old system, and
annoying scientist from Jurassic Park
spoke with this morning.
to write them down on paper, too.
-- rambling on about chaos theory and
In this article I will explore the con-
“Shouldn’t the new system be better
how the unpredictability of complex
cept of entropy, defined below, how it
9 • rediscoveringBI Magazine • #rediscoveringBI
we can use this connection between the data world and the real world to understand and to attack our data quality problems." applies to data, and how we can use this connection between the data world and the real world to understand and to attack our data quality problems.
We Can Prepare for It Though she was in no mood to hear it, what my doctor’s assis-
The Second Law and Entropy
tant was dealing with this morning was completely predictable.
There are many different ways to state the Second Law of
healthcare industry as they move towards electronic medical
Thermodynamics but they all boil down to this: the entropy
records (EMRs) and as the industry as a whole grows. Add to that
of any system is monotonically increasing. This includes the
all the new tests and drugs we have today and we have a big
entropy of the universe as a whole.
data problem in healthcare -- a big, “big data” problem.
Entropy is a quantity used by physicists (and other scientists
We should collectively take some solace in the fact that this
and engineers) to measure the organization of a system: larg-
problem isn’t because someone in IT didn’t do his or her job, or
er values for entropy are associated with less organization,
because the new system was poorly designed (although either
while minimum entropy equals maximum organization. You
of these could be true).
can calculate the entropy of any physical system -- as long as
What is certain is that the project that involved the rollout of
you can define boundaries -- or you can calculate the entropy
the new scheduling system at my doctor’s office didn’t take into
of the entire universe. Don’t think of Chaos Theory or any of
account the Second Law, plan and prepare for the associated
that cool stuff: focus on how entropy measures organization.
fallout, or take proactive measures to minimize the downstream
One simple real life example is temperature. Temperate can
impact.
be defined as the average kinetic energy of particles in a system. A glass of water at room temperature has lots of water molecules spread out and lined up in no particular order wiggling back and forth, bouncing off each other. Bring that
Moreover, it’s a phenomenon that will continue to plague the
Energy In, Entropy Down The trick when rolling out a new system and/or migrating data between systems is to assume that: 1. The source data is bad, and
glass of water to freezing and the water molecules line up in
2. Migrating the data from the old system to the new system
nice patterns -- entropy decreases. Bring that glass of water
will “raise the temperature” of the overall data and allow for
to absolute zero (total absence of any heat at all) and you
larger values of entropy, i.e. a decrease in data quality
achieve zero entropy. Even after freezing the water molecules
While it is true that the overall entropy of the universe is
still oscillate back and forth, but at absolute zero they stop
increasing and there is nothing that we can do about it, we can
moving altogether.
act on individual closed systems and decrease the associated
Now, think of a database rather than a glass of water. For a
entropy: by adding energy to a system, we reduce its entropy. It
database, absolute zero is achieved in two ways:
took energy, for example, to run the compressor and remove heat
1. Make sure all the data in your system is correct and
from the glass of water we described above. So, too, does it
perfect, and 2. Disconnect it from all other systems that may write-to or alter the data
require energy to reduce the entropy of our data. Energy -- in this context -- refers to a project plan, time, and resources specifically dedicated to this immutable fact. Do not
The first is, as we know, very hard to do, and the second is doable
be satisfied with the standard approach of shoving the old data
but will yield a static, useless database. However, we can use
into the new systems and handling the kick-outs one by one.
these concepts to help us understand, measure, and improve
This is still how I see 90% of implementations handled and it
data quality.
is bound to fail. The approach doesn’t pay the proper respect to
rediscoveringBI Magazine • #rediscoveringBI • 10
Data Quality must be a major consideration in any new system rollout or data conversion -- virtually any data project." the problem; it doesn’t treat it as a serious issue that will deeply affect the chances of the project’s success -- whether you expect it to or not. Data Quality must be a major consideration in any new system rollout or data conversion, data warehousing -- virtually any data project. Ignoring it is exactly the same as creating a project plan that calls for lowering data quality. It will happen: it is a law of nature and it cannot be stopped. The only way to minimize the damage is to plan and assign the appropriate resources to data quality in the project plan, or to put it in the spirit of this discussion: to add energ to the system.
Conclusion
The future of our universe, unfortunately, is bleak. The Hubble telescope confirms that the universe is expanding, which together with the Second Law, leads us to believe that the universe is retreating from a state when we were all at maximum order, minimum entropy. The Big Bang theory, in fact, tells us that, at the beginning, all of space and time and mass and energy were at a single, very heavy singularity in space-time. For some reason, this singularity exploded and the universe has
absolute zero in the lab is practically (and actually) impossible:
been moving away from its center ever since. One theory pre-
we can come very close, but we need exponentially more energy
dicts that, at some point, the expansion will stop and the retrac-
the closer we try to get. In the data world, this is one great
tion will start (you can see why I didn’t think the assistant this
advantage we have over the real world.
morning wouldn’t want to hear all this). However, if we restrict
What we need is the expectation that data quality will deterio-
ourselves to closed systems, we can carefully apply energy and
rate and to plan around this fact -- rather than to be surprised
vastly reduce the entropy.
after the fact, after timelines and budgets are established.
We have the desire, the tools, and the knowledge, but what I
Because ultimately, it’s not anyone’s fault: it’s a law of nature.
continue to observe is the surprise when a data project ends up
Share your comments >
in rubble -- the go–live is pushed, or we go live with bad data and spend another year (or more) hammering out the issues one
Dr. Gian Di Loreto training in physics,
by one.
math, and, science, combined with
One of the great things about the concept of a data warehouse
fourteen years in the trenches deal-
is that it allows us to create a closed system and then to open
ing with real life data quality issues,
that system to the outside, carefully and gradually. It is, by the
puts him among the nation’s top
way, much harder to create closed systems in real life: to achieve
experts in the science of data quality.
11 • rediscoveringBI Magazine • #rediscoveringBI
redwood city, ca June 26 - 28, 2013
Modern Data Platforms BI BI and Analytics Network Symposium Day One | Designing Modern Data Platforms These sessions provide an approach to confidently assess and make architecture changes, beginning with an understanding of how data warehouse architectures evolve and mature over time, balancing technical and strategic value delivery. We break down best practices into principles for creating new data platforms.
Day Two | Modern Data Integration These sessions provide the knowledge needed for understanding and modeling data integration frameworks to make confident decisions to approach, design, and manage evolving data integration blueprints that leverage agile techniques. We recognize data integration patterns for refactoring into optimized engines.
Day Three | Databases for Analytics These sessions review several of the most significant trends in analytic databases challenging BI architects today. Cutting through the definitions and hype of big data in the market, NoSQL databases offer a solution for a variety of data warehouse requirements. Register now at: http://radiantadvisors.com
Featured Keynotes By:
Sponsored By:
John O’Brien
Shawn Rodgers
Founder and CEO
Vice President Research, BI
Radiant Advisors
Enterprise Management Associates
FEATURES
[Are the days of the data warehouse numbered?]
DATA INTEGRATION FUTURES: THE DATA WAREHOUSE GETS A REPRIEVE? STEPHEN SWOYER
A
re the days of the data warehouse (DW) numbered? Not
“It means that … data lives in different places and [that] you
necessarily.
don't want to or can't move it to other places. It [means] that
Take it from Composite Software, which established its
it also has different shapes,” Besemer indicated. “Data stored in
reputation as an early champion of data virtualization
Hadoop … might be stored in a sparse structure or [as] semi-
(DV) software.
structured. It's not just about connecting via ODBC or JDBC;
DV and data warehousing have had something of a rocky rela-
[sometimes] you have to use HTTP or maybe JMS.”
tionship, thanks chiefly to the way in which DV (in a former incarnation) used to be marketed. But this past is just that: past, argues David Besemer, chief technology officer with Composite: “There's always been this tension between which data should go into the warehouse and which data should not -- and how are we going to get it there. I think we can all agree that all of the data does not go into the warehouse.” “This doesn't mean that the data warehouse goes away or becomes unimportant. It doesn't mean that ETL goes away and becomes unimportant,” he continued.
13 • rediscoveringBI Magazine • #rediscoveringBI
DV Reconsidered
DV is in the midst of a rehabilitation of sorts. Many experts, both in and out of data management (DM), suggest that a unifying DV layer could provide the substrate for a synthetic architecture that scales to address common and emerging DM use cases. These include reporting, dashboards, and OLAP-driven discovery; business analytics and analytic discovery; specialty databases; and big data analytics. DV figures prominently in several next-gen architectures, including Gartner's Logical Data Warehouse vision, the “Hybrid Data Ecosystem” from
SIDEBAR:
BIG DATA at big data scale, it is physically and PHYSICS economically impossible to sustain [STEPHEN SWOYER] the ETL-powered DI"
From a traditional data management perspective, the physics of the big data universe might as well partake of those of a black hole. The key thing about the physics of a black hole is
Enterprise Management Associates, and
Kopp-Hensley concedes. “And I have the
IBM's Big Data Platform. DV plays a key
gray hairs to prove it. Where [the market
role in Radiant Advisors’ Modern Data
is going is that] you're going to opti-
Platforms framework, too.
mize your workloads in the right place
DV has come a long way.
and you're only going to move some
Half a decade ago, one of its key
data. And that's just a fact. That's why
enabling technologies – viz., data fed-
virtualization becomes more and more
eration – was treated as a kind of red-
important. As a technology, it helps to
headed step-child, at least in BI circles.
minimize this movement [of data].”
Nancy Kopp-Hensley, program director for Netezza marketing and strategy with IBM, well remembers those days. Before federation first fell out of favor, Big Blue had bet big on it as a pragmatic way to manage exploding data volumes. But people tried to do too much with federation, Kopp-Hensley indicates, noting that distributed federated queries posed particular problems. “As a concept, [federation] was a fantastic thing, but people went overboard, instead of using it where it was needed and where it made sense,” she says, noting that one customer attempted to federate queries across almost all of its distributed data sources. “We said 'federate for 20 percent of your data, for the occasional queries, for [information that you have in] disparate systems.’ They were doing it with 80 percent [of their data] instead.” Kopp-Hensley still has something of a federation hangover, although she says that IBM – and, yes, its customers – are giving federation, as an enabling technology for DV, a long second look. “I still remember those days” when IBM was an aggressive promoter of federation,
Federation: The Agony and the Ecstasy
that we don't know anything about them. This isn't strictly true of big data: we do know a lot about it. What we know flies in the face of DM orthodoxy, however. The data warehouse (DW) expects to model the data that it consumes in advance; big data platforms were conceived precisely to avoid this requirement. The DW model strictly separates data processing from data storage; big data tightly couples both. The warehouse is grounded in still another assumption: that conditions or requirements won't significantly change. It isn't
A decade ago, DV was indeed known as
quite true that big data platforms don't make
“data federation,” or (more imposingly)
assumptions about changing conditions, but it is
as “enterprise information integration”
the case that big data platforms are perceived as
-- EII for short. For a brief period, fed-
more agile or flexible than “inflexible” business
eration -- as EII -- it was hyped as a data
intelligence or DW systems.
warehouse replacement technology: the
Physicists speculate that the physics inside the
idea was that instead of persisting all of
event horizon of a black hole actually take the
your data into a single, central reposito-
form of a sort of bizarro inversion of our own
ry, you could connect to it where it lives
low-gravity physics. It isn't a stretch to say
-- in multiple, disparate repositories --
that the physics of the big data universe have
and surface it in the form of canonical
a similar relationship to those of classical DM
representations or business views.
model: inverted.
The primary claimed benefit of doing
This isn't just a question of theory, either; it has
so was agility: mappings weren't fixed,
a practical aspect, too. If you want to move just
data models weren't static, the ETL pro-
a subset of a 10 PB Hadoop repository, you must
cess -- with its data source mappings,
still move several hundred gigabytes of data.
its impact analysis, and its many other
A lot of organizations have already broken the
time-consuming requirements -- was
petabyte barrier: Teradata, for example, has doz-
eliminated altogether.
ens of customers with a PB or more of relational
Not so fast: literally. Adoptees found
data. This doesn't include the semi-structured
that federation was slow -- so slow that
information – harvested from social media;
even with caching, it just wasn't up to
continuously streaming from machines, sensors,
the task. The EII tools of old likewise
etc. – that's being consumed as grist for the big
lacked many of the data preparation or
data analytic mill. Most of this semi-structured
data manipulation facilities offered by
stuff is landing in Hadoop, Cassandra, or in other
ETL products.
(typically NoSQL) repositories – where it will remain.
rediscoveringBI Magazine • #rediscoveringBI • 14
This last is the primary difference between data federation
an exception. Vendors such as QlikTech (developer of QlikView),
and data virtualization: like federation, DV enables an abstrac-
Tableau Software (developer of Tableau), and Tibco (developer
tion layer – but it's able to do a lot more with data. In a sense,
of Spotfire) have successfully marketed their products directly
DV effectively encapsulates the “T” of ETL in that it's able to
to the line-of-business. This is the analytic discovery use case,
manipulate data -- e.g., transforming, recalculating, and so on
which has done so much to contest the centralized authority
-- to conform to business views or canonical representations.
of the data warehouse.
As a caching technology, some DV tools effectively perform
So, too, has bring-your-own-device, or, – in its DM-specific
ETL: e.g., extracting data from source systems and loading it
variant – bring-your-own-DBMS. Even though Teradata is an
-- prepared, manipulated, and transformed -- into cache. DV can
indefatigable champion of the EDW, many Teradata shops have
also incorporate data profiling and data cleansing routines.
rogue SQL Server instances hanging off of (i.e., siphoning data
So DV isn't data federation. Why does this matter? According
out of) their Teradata EDWs.
to advocates, it matters because of the daunting physics of
This isn't an anomalous phenomenon. From the perspective
the big data universe: at big data scale, it is physically and
of Jane C. Frustrated-Business-Analyst, the allure of Excel-
economically impossible to sustain the ETL-powered DI sta-
powered discovery has always proven to be irresistible; this
tus quo, with its emphasis on data movement and (in some
– along with the inflexibility of the DW-driven BI model –
incarnations) its insistence on a separate staging area for
helped create the conditions for what industry luminary Wayne
data. It matters, proponents say, because federation provides
Eckerson famously called “spreadmart hell.”
a practical means of managing or negotiating the physics of
This could account for the bring-your-own-SQL-Server trend in
big data while DV, with its built-in support for in-flight data
Teradata environments, too: Microsoft has bundled SQL Server
manipulations, also conforms source data so that it can be
Analysis Services (SSAS) with its flagship RDBMS for the last
consumed by BI, analytic discovery, Web applications, or other
15 years; the line of business can easily buy and deploy SQL
front-end tools.
Server on its own: why not do so? “The idea of being able to
Sons (and Daughters) of Anarchy Big data scale is just part of the problem, said Composite's Besemer. Even in shops that have an enterprise data warehouse (EDW), rogue or non-centralized platforms are a rule, not
15 • rediscoveringBI Magazine • #rediscoveringBI
claw back these rogue data marts and making them virtual data marts, even if they're cached, is quite attractive,” Besemer indicated. This is another reason, he reiterated, why DV isn't going to
replace ETL -- or most other DM techniques, for that matter:
is to push as much work as you can down to the leaf nodes.”
as a technology, DV provides a virtual abstraction layer; it
(DV has a number of other potential issues, too. )
connects to platforms or repositories where they live. Many
Kopp-Hensley thinks that virtualization, with its ability (a) to
of these sources will be ETL-fed data warehouses, others will
present a unified view of disparate data sources and (b) to
be operational systems, still others will be Web applications
conform source data so that it can be consumed by BI tools, (c)
or social media services. “I will emphasize again: this isn't a
to accommodate non-relational or non-ODBC/JDBC APIs, such
replacement for the DM techniques you already have; it essen-
as REST – to say nothing of (d) its capacity (with caching and
tially augments those techniques,” he commented.
a good query optimizer) to support limited federated query
“You may … want to use [DV] canonicals as data services in
scenarios – is a pragmatic solution for “doing” big data physics.
your business processes as you move forward to a more loose-
In this respect, she echoes the assessment of Composite's
ly-coupled architecture. At the top of the data virtualization
Besemer: ETL – as a process – isn't going anywhere; the data
layer, you have a choice to deliver data in either a relational
warehouse – as a conceptual structure – will survive and
model that looks like a database, or as a service, in HTTP or
(perhaps even, if only as a virtual canonical representation)
JMS. Most customers end up doing both.”
thrive – and DV will knit everything together. Does this include
Future Shock
Performance was the Achilles Heel of DV back when it
IBM's own Big Data Platforms vision? Kopp-Hensley is coy. Virtualization technology (in the form of both Vivisimo and IBM's InfoSphere-branded
was hyped as a
data federation tech-
data warehouse replacement technology. All DV vendors tend to address performance issues by using caching; most also focus on query optimization,
nologies) does have a
The EII tools of old [likewise] lacked many of the data preparation or data manipulation facilities offered by ETL products."
push
joins
and queries down into the source database engine; this has the effect of reducing the amount of data that needs to be moved. Radiant Advisors’ O'Brien and several other industry watchers have praised Composite's query optimizer. O'Brien -- invoking the unofficial sobriquet of the Oracle database -even dubbed it the “Cadillac” of query optimizers. Nevertheless, Besemer acknowledges that there will be cases in which even aggressive caching and a smart optimizer won't be enough to address service level agreements. “[T]his [i.e., the distributed query optimizer] is the hard part.
Platforms vision, she stresses. “It's going to evolve,” she says, hinting that IBM may – or may not – be planning a virtualization appliance. “It's not going to be
too. The idea is to
role in its Big Data
snap-your-fingersand-all-of-these-things-consolidate; these things are going to take some time. The data warehouse is not going away anytime soon, but it is evolving,” Kopp-Hensley continues. “We're going to have a heterogeneous environment, so virtualization is starting to make more sense, [as is] the ability to explore [information] where [it lives] and [to] not have the applications care about where the data is. This is the new priority: ideally you want your applications to not care where the data is.” Share your comments >
The key to making this work is to reduce the amount of data that has to be moved, in order to reduce this … you have to optimize where the work gets done to reduce [the movement of] that data. Oftentimes, that means pushing joins and queries down to the data source,” he said. “In the end, even with the best optimization … if it doesn't run as fast as you need it to run, then you need to consider another possibility. But the idea
Stephen Swoyer is a technology writer with more than 15 years of experience. His writing has focused on business intelligence and data warehousing for almost a decade. rediscoveringBI Magazine • #rediscoveringBI • 16
FEATURES
DATA AND THE
LAW OF ATTRACTION
[With great flexibility comes greater responsibility -- this is the next challenge.]
JOHN O'BRIEN the users from the structured world of data. Here’s the twist: what we have gained is the default aspect of separating the persisted data from its semantic context. An application or user that retrieves data from flexible key-value stores (like Hadoop) can now determine the context in which they need to work with the data, whether creating a record by constructing key-value pairs into a database tuple or by loading key-value pairs into a similar constructed object model: the knowledge and responsibility lies with the developer. This process of abstracting the semantic context for the data is not completely new to data modelers, though it has taken on a much purer form in recent years. Abstraction has occurred at three different levels within the business intelligence (BI) architecture: inside the database through the use of views, links, and remote databases; above the database with data virtualization technologies; and finally, in the BI tools themselves, through the use of meta catalogs and access layers
F
across multiple databases. Each time RO M A DATA M A N AG E M E N T
broken down to its most fundamental
this projection (or view of data) con-
(DM) perspective, one of the more
element: the key-value pair. Hadoop
tains the semantic context delivered by
interesting aspects of Hadoop’s
and other key-value data stores lever-
the administrator.
impact on the world of enterprise
age this simplicity -- coupled with the
In Radiant Advisors’ three-tired modern
data management (EDM) hasn’t really
power of the MapReduce framework
data platform (MDP) framework for BI,
been about the “three V’s and C” (vol-
-- to work with data however the user
tier one is for flexible data manage-
ume, velocity, variety, or complexity).
needs to. This has also given us the
ment with scalable data stores, like
It’s been that, in order to achieve grand
neologism “NoSQL,” which we know as
Hadoop, that rely on some form of
scalability, the data itself has had to be
“Not-Only” SQL -- and which unshackles
abstraction for semantic context to
17 • rediscoveringBI Magazine • #rediscoveringBI
The answer for BI architects is to blend these abstraction approaches and to focus on governing semantic context carefully, rather than taking it for granted." work with data. This flexibility allows for users to perform
data and analytics management.
semantic discovery in a very agile fashion through the ben-
•
Step 1 begins with “Semantic Discovery” in flexible envi-
efits of meta-driven data abstraction and no data movement.
ronments at the hands of business analysts or data sci-
Hadoop leverages Hive -- or HCatalog -- for the semantic
entists to define new context in the abstraction layer, or
definition of tables, rows, and columns familiar to SQL users.
in MapReduce programs. This should be dependent upon
These definitions can be created, tested, and modified (or
a proper definition of new data governance roles (such
discarded) quickly and easily without having to migrate data
as data scientists) and corresponding responsibilities,
from structure to structure in order to verify semantics. Tier 2 in the MDP framework is designated for data stores
accountably, and delegation rights. •
Step 2, “Semantic Adoption,” is for the data governance
that require highly-optimized analytic workloads, such as
process to evaluate and decide whether the discovered
cubes, columnar, MPP, and in-memory data, and for highly-
context needs to governed and consumed in temporary
specialized analytic workloads, such as text analytics and graph databases.
or permanent -- and local or enterprise -- context. •
Step 3 involves deciding how and where context needs
Tier 3, however, is for reference data management with sche-
to be governed -- this can remain in the Hadoop abstrac-
mas for storing data in structures that are based upon seman-
tion layer for a defined set of users, or have BI projects
tic context. This data tends to be more critical subject areas
deliver the context via ETL or data virtualization to the
or master lists, or business event data that needs to maintain
reference data tier.
a high-definitional consistency for enterprise consumers (and
With great flexibility comes greater responsibility, and this is
used with data from the Hadoop world to provide qualifying
the next challenge for data governance and big data. Our best
context to events). While you can store a master customer list,
practice is a fundamental “law of attraction” as to whether the
product list, or their key subject area(s) in Hadoop, you have to
semantic context should be attracted closest to the data as
ensure consistency by tightly governing its abstraction layer,
schema, or whether it should be attracted closer to the end
which inherently allows for other semantic definitions to be
user in one of the layers of abstraction.
easily created on the same data. It simply makes more sense –
The law of attraction is simple: the more there is a need
and carries less risk -- to embed this context into the schema
for enterprise consistency and mass consumption, then
itself to ensure a derived consistency of use in the enterprise.
the semantic context should be closer to the data through
Following this three-tier framework, BI architects have a
schema to be inherited by all consumers. Likewise, when
more balanced approach to managing semantic context of
there’s more of a localized context or perspective of the data
data for the enterprise, while still having to make key deci-
involved, then context should reside in one of the abstraction
sions regarding their architecture. Data stored in Hadoop
layers. Strong governance and standards are the same wheth-
relies on Hive or HCatalog to store the proper context, while
er the data persists in key-value data stores or structured
data stored in data warehouses has the context embedded
schema data. Governing semantic context is a function of its
into the schema for derived consistency. Analytic databases
role within the enterprise, and this should allow BI architects
can access data from both through the use of projections,
to blend fixed schemas and abstraction layers while relying
views, and links. Data virtualization provides an abstraction
on data owners and stewards to make key determinations.
layer for users around all three data tiers. The answer for BI
Share your comments >
architects is to blend these abstraction approaches and to focus on governing semantic context carefully, rather than taking it for granted. Radiant Advisors has been working with companies to extend existing (or create new) data governance processes that handle the concept of a new Semantic Context Lifecycle for
John O’Brien is the Principal and CEO of Radiant Advisors, a strategic advisory and research firm that delivers innovative thought-leadership, publications, and industry news. rediscoveringBI Magazine • #rediscoveringBI • 18
VENDOR
GAIN BIG ADVANTAGE WITH DATA VIRTUALIZATION ROBERT EVE
B
usiness agility requirements and the proliferation of big data and cloud sources outside the enterprise data warehouse are challenging traditional data integration techniques, such as consolidation of summarized
data into the EDW with ETL. As a result, more modern data integration approaches, such as data virtualization, are now seeing accelerated adoption. In 2012, Gartner surveys showed approximately 27% of respondents actively involved in or had plans for deployment of federated or virtualized views of data. This year, TDWI Best Practice report Achieving Greater Agility with Business Intelligence revealed even higher data virtualization adoption statistics with 19% currently in use and 31% planning to implement.
Why Use Data Virtualization? With so much data today, the difference between business leaders and also-rans is often how well they leverage their data. Significant leverage equals significant business value, and that’s a big advantage over the competition. Data virtualization provides instant access to all the data you want, the way you want it. Enterprise, cloud, big data, and more: no problem! With data virtualization, you benefit in several important ways: Gain more business insights by leveraging all your data – Empower your people with instant access to all the data they want, the way they want it. Respond faster to your ever-changing analytics and BI needs – Five- to ten-times faster time-to-solution than traditional data integration.
What is Data Virtualization? Data virtualization is an agile data integration approach organizations use to gain more insight from their data. Unlike data consolidation or data replication, data virtualization integrates diverse data without costly extra copies and additional data management complexity.
Fast track your data management evolution – Start quickly and scale successfully with an easy-to-adopt overlay to existing infrastructure. Save 50-75% over data replication and consolidation – Data virtualization’s streamlined approach reduces complexity and saves money.
19 • rediscoveringBI Magazine • #rediscoveringBI
Who Uses Data Virtualization?
When To Use Data Virtualization?
Data virtualization is used across your business and IT orga-
You can use data virtualization to enable a wide range of
nizations.
applications.
Business Leaders – Data virtualization helps you drive busi-
Agile BI and Analytics – Improve your insight sooner.
ness advantage from your data. Data Warehouse Extension – Increase your return on data Information Consumers – From spreadsheet user to data sci-
warehouse investments.
entist, data virtualization provides instant access to all the data you want, the way you want it.
Data Virtualization Architecture – Gain information agility and reduce costs.
CIOs and IT Leaders – Data virtualization’s agile integration approach lets you respond faster to ever changing analytics
Data Integration – Integrate your Big Data, Cloud, SAP, Oracle
and BI needs and do it for less.
Applications and other sources more easily.
CTOs and Architects – Data virtualization adds data integra-
Logical Data Warehouse – Modernize your information man-
tion flexibility so you can successfully evolve and improve
agement architecture.
your data management strategy and architecture. Business and Industry Solutions – Support your unique inforIntegration Developers – Easy-to-learn data virtualization
mation needs.
design and development tools allow you to build business views (also known as data services) faster to deliver more business value sooner.
Data virtualization is not the answer to every data integration problem. Sometimes data consolidation in a warehouse, along
IT Operations – Data virtualization’s management, monitoring,
with ETL or ELT is a better solution for a particular use case.
security and governance functions ensure security, reliability
And sometimes a hybrid mix is the right answer.
and scalability. Robert Eve leads marketing for Composite Software. His experience includes executive level marketing and business development at leading enterprise software companies. rediscoveringBI Magazine • #rediscoveringBI • 20
UPCOMING INDUSTRY
EVENTS
SIDEBAR
DATA VIRUTALIZATION EXPLAINED [STEPHEN SWOYER] Think of data virtualization (DV) as kind of like iTunes – for
specific metadata, such as “Artist,” “Album Title,” “Track Title,”
distributed data.
and so on.
Like iTunes, DV has several different aspects: connectivity,
When it connects to a data source, a DV tool does more or less
preparation, and presentation.
the same thing: it doesn't replicate the contents of a source
On the connectivity tip, iTunes technically doesn't care
database table – just information about its structure. You can
where the actual files in your music library reside. It wants
choose to create a representation of the complete structure of
to consolidate most of your media files into a single central
a source table – e.g., all of its rows and columns – or, more
location – but it doesn't have to. It's possible to store media
commonly, just a subset of this structure. You can (and, in the
content on removable devices, on network storage, or – even
latter case, must) “tell” a view how you want to prepare or
– across a virtual private network (VPN).
manipulate data: you can concatenate it, change the names
Ditto for DV, which uses federation technology to enable a
in a tablespace, perform statistical functions, etc. You can
kind of abstraction layer: a DV tool doesn't have to care if your
also “stack” views in a DV tool to build more complex (i.e.,
data's sitting locally on an internal LAN or SAN – or 1,000
“composite”) views, or to “blend” data from multiple sources.
miles away, in a remote office or business campus. (It doesn't
Before data can be consumed, it first has to be prepared. DV
have to, but it does – that's where DV-specific performance
and iTunes do similar things on the preparation tip, too. In
optimizations, such as query optimization and caching, come
the case of iTunes, your media library is a collection of het-
into the picture.) The emphasis with DV is on minimizing
erogeneous files: you'll have stuff encoded in AAC (M4A), stuff
data movement; iTunes, by contrast, moves all of the bits of a
encoded in AVC (MP4), and so on. Some of these files could be
source file – whether it's stored locally on the LAN or (in the
sourced from a CD, which means they'll be encoded at 16-bit
case of a VPN) sitting on a file server in Cupertino.
resolution with a 44.1-kHz sampling rate. Other files could
When it comes to presentation, both iTunes and DV are all
be sourced from a DVD, which means they'll be encoded at a
about the view.
range of different resolutions and sampling rates.
This is kind of how DV technology works, too: you use it to
Your computer's audio hardware might not be able to repro-
build business views, which are canonical representations of
duce these files. It might also expect to consume audio at a
the tables and columns in a source database. One key differ-
certain sampling rate. In other words, iTunes has to be able
ence is that iTunes is its own front-end application – it's own
to transform audio data so that it conforms to a format that's
Tableau or QlikView, so to speak.
supported by your computer.
Unlike iTunes, “presentation” in a DV context is completely
DV isn't a silver bullet. Like any other technology, it has costs
virtual. DV technologies are middlemen: you build a view in a
and benefits. Performance is one common cost – particularly
DV tool with the expectation that it's going to be consumed
with respect to federated query scenarios. Adopters also need
by another application – typically one which speaks ODBC,
to be mindful of DV's impact on operational systems. And
JDBC, or SQL.
while DV technologies do encapsulate ETL or data quality
In both cases, a lot's happening beneath the covers. When
functionality, their capabilities in this regard tend to be com-
iTunes builds a library view, it basically catalogs your media
paratively immature, at least with respect to best-of-breed
files, reading and caching metadata about them. It looks for
ETL offerings.
rediscoveringBI Magazine • #rediscoveringBI • 22
CHADVISED OPRESEAR REARCHAD CHADVISED DVISEDEVE DEVELOPR RADIANT ADVISORS
R E S E A R C H . . . A D V I S E . . . D E V E L O P. . . Radiant Advisors is a strategic advisory and research firm that networks with industry experts to deliver innovative thought-leadership, cutting-edge publications and events, and in-depth industry research.
rediscoveringBI F o l l o w u s o n Tw i t t e r ! @ r a d i a n t a d v i s o r s