Business Intelligence Vol. 7, No. 1
Data Architecture, Data Warehousing, and Master Data Management by Ken Orr, Fellow, Cutter Business Technology Council
In large organizations everywhere, there is increasing interest in data architecture and associated issues such as data warehousing strategies and master data management. This Executive Report addresses the nature of data architecture, how “operational data” and “informational data” differ, and where data warehousing fits in. The report asserts that over the last 30-40 years, there has been a number of profound developments in data architecture: the emergence of database management systems; the emergence of relational database systems; and the emergence of data warehousing. Additionally, we discuss how master data management systems and data hubs are becoming mainstream ways of linking vital information.
Cutter Business Technology Council Rob Austin
Tom DeMarco
Christine Davis
Access to the
Experts
Lynne Ellyn
Jim Highsmith
Tim Lister
Ken Orr
Lou Mazzucchelli
Ed Yourdon
About Cutter Consortium Cutter Consortium is a unique IT advisory firm, comprising a group of more than 150 internationally recognized experts who have come together to offer content, consulting, and training to our clients. These experts are committed to delivering top-level, critical, and objective advice. They have done, and are doing, groundbreaking work in organizations worldwide, helping companies deal with issues in the core areas of software development and agile project management, enterprise architecture, business technology trends and strategies, enterprise risk management, business intelligence, metrics, and sourcing. Cutter delivers what no other IT research firm can: We give you Access to the Experts. You get practitioners’ points of view, derived from hands-on experience with the same critical issues you are facing, not the perspective of a desk-bound analyst who can only make predictions and observations on what’s happening in the marketplace. With Cutter Consortium, you get the best practices and lessons learned from the world’s leading experts, experts who are implementing these techniques at companies like yours right now. Cutter’s clients are able to tap into its expertise in a variety of formats including print and online advisory services and journals, mentoring, workshops, training, and consulting. And by customizing our information products and training/consulting services, you get the solutions you need, while staying within your budget. Cutter Consortium’s philosophy is that there is no single right solution for all enterprises, or all departments within one enterprise, or even all projects within a department. Cutter believes that the complexity of the business technology issues confronting corporations today demands multiple detailed perspectives from which a company can view its opportunities and risks in order to make the right strategic and tactical decisions. The simplistic pronouncements other analyst firms make do not take into account the unique situation of each organization. This is another reason to present the several sides to each issue: to enable clients to determine the course of action that best fits their unique situation. For more information, contact Cutter Consortium at +1 781 648 8700 or sales@cutter.com.
Data Architecture, Data Warehousing, and Master Data Management BUSINESS INTELLIGENCE ADVISORY SERVICE Executive Report, Vol. 7, No. 1
by Ken Orr, Fellow, Cutter Business Technology Council Print, film, magnetic, and optical storage media produced about five exabytes1 of new information in 2002. Ninety-two percent of the new information was stored on magnetic media, mostly in hard disks. [6]
Data architecture is one of the most important things that any IT organization can do, because data is one of the principle things people want from IT, and ultimately, data is what top management is paying for. Unfortunately, for a variety of reasons, IT has never quite lived up to its commitment to provide the right information to the right people at the right time. Survey after survey shows that top
1
managers are sure their organizations have the data they want somewhere in the organization, but they don’t know how to locate, manipulate, and extract it. Over the last four decades, this data access problem has been attacked multiple times. In the 1960s and early 1970s, the answer was database management systems (DBMSs). Then in the late 1970s and early 1980s, it was 4GLs, spreadsheets, and executive IS. In the 1990s, it was multidimensional databases, business intelligence (BI), and data warehousing. Each of these approaches has solved part of the overall information access problem, but getting at the right information in near real time
is still tough. Even in old line companies, closing the books at the end of a month or the end of a quarter can take too much time. For a long period, nonoperational IS was referred to as “end-user computing.” This term was occasioned by the notion that the operational folks who enter the raw transaction information were “information providers” or “information stewards,” while those who used the data for nonoperational functions were “information users.” Historically, end-user computing was built one user community (marketing, product development, R&D) at a time, and over time; this created an enormous number of problems.
How big is five exabytes? If digitized with full formatting, the 17 million books in the Library of Congress contain about 136 terabytes of information; five exabytes of information is equivalent in size to the information contained in 37,000 new libraries the size of the Library of Congress’s book collection.
2 The biggest problem with the bifurcation of operational and informational data is the difficulty in getting what top management often refers to as a “single view of the truth.” In most cases, large organizations have many different views of the truth. When managers obliquely refer to data quality problems, what they’re saying is that they don’t know which view of the truth to rely on (e.g., the one from the sales department or the one put together by the folks in accounting). The demand for more corporate transparency brought on by Sarbanes-Oxley (SOX) and Basel II has raised the stakes for improving data quality. Because top managers are required to ensure that the public view of the truth (i.e., public financial reports) is correct, they are now willing to invest more in their underlying information infrastructure. The problem is that if you don’t know which version of the truth corresponds with the real world (and it often turns out that none of various truths are correct), then it is hard to swear by any set of numbers. Moreover, the Internet and electronic communication have vastly complicated our already vexing data problems. In addition to facilitating the access to the structured data that exists in enterprise databases,
BUSINESS INTELLIGENCE ADVISORY SERVICE organizations everywhere now handle increasingly larger amounts of unstructured data, including e-mail, attachments, voice mail, multimedia, Web content, and so on. What started as a trickle has turned into a flood. Today, as organizations become truly paperless, the information that used to be on paper is now in electronic form, and something must be done with it. Most formal communication these days is via e-mail and attachments, so much so that the unstructured data in an average organization is growing much faster than its structured data, with fewer good processes and tools to manage it.
that exists in most organizations but, more importantly, with the massive inconsistency brought on by the redundancy. Although data architecture is a new box on the organizational chart, it is not a new function; indeed, various groups and individuals in our organizations have been trying to manage data for decades under difficult circumstances. But data architecture has become such a critical issue that it is now a top management concern, and as a consequence, the function is being pushed toward to the top of the org chart. Data is the lifeblood of the electronic realtime enterprise, and data management is the name of the game.
To make matters worse, structured business data (transactions, product updates, customer updates) is managed by one group (database administration) while unstructured data is often managed by another (network administration). Clearly, this situation requires new approaches and new governance.
As organizations become more serious about managing their data — all of their data — new approaches and technologies get invented, tested, and deployed. Tools for accessing, integrating, and rationalizing data from lots of disparate sources are perfected.
Data architecture is one new approach for solving large enterprise data problems and issues. Data architects are concerned with all the data in the organization and how the various data pieces fit together with all the other pieces. It is concerned not just with the massive redundancy
There tends to be a circular pattern in systems development in which developers, working for different users, develop a series of systems. Then, over time, the smart ones notice that certain major data components are constantly reappearing. For example, in manufacturing applications,
The Business Intelligence Advisory Service Executive Report is published by Cutter Consortium, 37 Broadway, Suite 1, Arlington, MA 02474-5552, USA. Client Services: Tel: +1 781 641 9876 or, within North America, +1 800 492 1650; Fax: +1 781 648 1950 or, within North America, +1 800 888 1816; E-mail: service@cutter.com; Web site: www.cutter.com. Group Publisher: Chris Generali, E-mail: cgenerali@cutter.com. Managing Editor: Cindy Swain, E-mail: cswain@cutter.com. ISSN: 1540-7403. ©2007 by Cutter Consortium. All rights reserved. Unauthorized reproduction in any form, including photocopying, faxing, and image scanning, is against the law. Reprints make an excellent training tool. For information about reprints and/or back issues, call +1 781 648 8700 or e-mail service@cutter.com.
VOL. 7, NO. 1
www.cutter.com
3
EXECUTIVE REPORT there is always a need for “customer data” and “product data,” and in HR applications there is a need for “employee data” and “job data.” In purchasing, there is always a need for “vendor data” and “vendor product (asset) data.” This discovery was first made in the early days of the mainframe and ultimately resulted (in the 1960s) in the development of the first serious DBMSs (e.g., GE’s Integrated Data Store, CINCOM’s Total, and IBM’s Information Management System [IMS]). These databases were built around the idea that, while different systems often did different things, most systems required access to a common set of core data and that it didn’t make sense to have dozens of “customer files” or dozens of “product files.” The first systems that utilized common database software were dramatic improvements over previous applications. Instead of creating new definitions for “customer,” “product,” or “order,” it was possible to reuse definitions in developing new applications. Better yet, DBMSs allowed developers to use data that already existed and, if certain data elements were missing, to simply add what was new (attributes, tables, etc.) to create subsequent applications. At least that was the plan. But in reality, no matter how sensible the idea of common databases with common data element definitions (metadata), the idea
©2007 CUTTER CONSORTIUM
created a new class of project management problems. Adding a second application to a planned central/corporate database that was already in operation wasn’t too difficult. If a field or file was required in order to satisfy the second application, then so be it. Fields, and sometimes completely new data files, were added or modified to meet these new demands. But as more and more applications were added, there was a need for lots of traffic cops to keep all the changes from screwing up one another and, most importantly, to keep from screwing up all the applications already in operation. So what happened was a second cycle of fragmentation. Project managers began to clone existing databases rather than integrate their database with all the others. This meant that the new application could be developed, tested, and installed without worrying about knocking over the existing house of cards. In practice, it made the project manager’s job easier, but in the end, it made many other things worse. Instead of a single set of customer data, there were once again multiple sets of customer data using, at least initially, mostly the same data definitions (e.g., “customer. no,” “customer.name”). Once again, there were multiple copies of the same data, and instead of sharing common databases, data was transferred from one application to another largely through
interface (data exchange) files. The more common the data, the more copies. The more copies, the more views of the truth. By the end of this second cycle of fragmentation, which occurred toward the end of the 1980s, a catalog of common data problems existed in nearly every large group of computer users in the developed world. To solve these problems, a new approach was invented. The result was the move to a new data architecture for informational data: data warehousing. In general, data warehousing projects cost more and took longer than most people expected. But in most organizations, real headway took place, and today, most large organizations have made major strides with their data warehousing initiatives. But there is no resting on past laurels; competitive pressures require even better data, especially data about our major products and services and about our major customers, vendors, etc. The new buzzword for projects that build customer and product hubs is “master data management” (MDM). In a sense, MDM is a new name for an old concept. Since the earliest days of mainframe computers, systems developers and database designers have been stumbling over the same basic truth: namely, that the most important data components correspond to the most important business entities. This
VOL. 7 NO. 1
4 discovery provided the spark that has triggered some of the most important developments in the history of software. Now, after nearly 20 years of data warehousing — though significant improvements have been made — a third cycle of fragmentation has occurred within the industry. While some progressive organizations have built unified data warehousing solutions, a larger number have opted to develop quasi-independent data marts under the cover of creating a data warehousing architecture. In this Executive Report, we first explore the fundamental concepts behind data architecture. We then examine some of the challenges that data architecture faces and some of the drivers and enablers that are pushing it forward. Finally, we look at the solutions enterprises are utilizing to help better manage their data. FUNDAMENTAL IDEAS BEHIND DATA ARCHITECTURE Before we begin discussing the basic concepts, I think it is important to get some things straight. The first thing to note is that architecture is not just drawing bigger data models. Data architecture is about all the data and information that the organization keeps. It is mostly about the care and feeding of data/information kept in electronic form, but it includes the
2
BUSINESS INTELLIGENCE ADVISORY SERVICE data in whatever form is important to the organization.
and experience who can think seriously about the future.
Data architecture is not just records management for electronic records, either. There is something fundamentally different about data architecture in this era of computers and communication. Today, someone can easily make off with the entire customer master information on a small laptop. Information that once filled floor upon floor of file cabinets currently occupies only a couple of disk drives. We have at our disposal more computing power than anyone could have imagined a few decades ago, but we also live in an age where a knowledgeable person armed with a desire to do harm to our organization could wreak havoc.
Elsewhere I have written that databases are not about tables, attributes, or relationships; rather they are about the real world. A database, by its very essence, is a model of the real world, and therefore, the quality of a database is, at its heart, a function of how well our electronic models mirror what’s going on outside.
Data architecture in most organizations is a new discipline. It involves thinking broader and longer than we have in the past. It involves worrying more about the future, raising inconvenient questions, and taking risks. Throughout the history of IS, thinking about data has produced enormous benefits. No single tool has proved as valuable over time as DBMSs, but we are now in a period in which unstructured data is rapidly overwhelming our ability to understand, model, and manage it, and our tools have not yet caught up. But more important than tools, we need people with broad interests
Over the years, it has occurred to me repeatedly that the key to data architecture is knowing what is important in our model of the real world. A major systems application, for example, may include two tables or 300 tables, but only about 10% of those tables are significant; the rest are typically reference tables that we use to make sure our data is correct and consistent. But it is that 10% that represents our core strategic data and therefore our data architecture. Ultimately, coming up with a real-world data architecture comes down to semantics — business semantics. Business Semantics
I wrote an Executive Report entitled “Business Semantics,”2 which dealt with what is perhaps the most important thing I’ve learned in 30-plus years of software development — the critical role of semantics and ontology in IS and data architecture.
See my Business Intelligence Executive Report “Business Semantics” (Vol. 5, No. 7).
VOL. 7, NO. 1
www.cutter.com
5
EXECUTIVE REPORT Semantics is the study of meaning, and ontology is the study of what exists in the real world. While terms like semantics and ontology may seem overly philosophical to today’s businesspeople and software developers, these subjects are as relevant now as they were 2,000 years ago when the Greeks first proposed them.3 Data architecture, databases, data warehouses, and MDM are all about the real world and the things (and classes of things) in the real world that affect our businesses. Enterprises keep data about products and customers, because products are what the enterprise sells, and customers are the real-world entities they sell them to. The same is true with purchased products, vendors, jobs, services, and employees.
to a small class of categories (classes of entities) that have to do with the business context for a given enterprise. One would like to suggest that this insight was rationally worked out top-down from first principles, but like most really important insights, this was the result of discovering a tool for doing requirements that focused on context. In the late 1970s, my colleagues, clients, and I began using something we now call context diagrams as the starting point for doing requirements definition (see Figure 1).
a context diagram). On these diagrams, there were only two classes (categories) of entities: (1) bubbles, which represented “actors,” and (2) arrows, which represented “messages.” Context diagrams are stylized communication diagrams. Actors communicate by sending and receiving messages: Actors are individuals,
organizations, or systems. The distinguishing characteristic of actors is that they can autonomously send and receive messages.4 In this environment, actors “act.” They do something; more specifically, they communicate. As we will see later in this report, key actors are often the folks who are most important to our enterprise (e.g., customers,
We were looking to find some way to get groups of systems users to help us quickly understand a given systems domain. What we came up with was something we initially called an entity diagram (what we now call
Too many people think of data design as largely a technical or formal activity, which it is partially, but first of all it is about the things in the real world — the actors, transactions, events, locations, roles, organizational structures, and relations.
invoice Accounting Customer order
Order entry
delivered equipment
Actors and Messages
In my work on systems and data design over the last 30 years, one thing has increasingly stood out: all of the major characteristics of business IS are ultimately related
payment
Warehouse billing notice
entered order
Credit manager
approved order
Sales manager
shipping notice
Figure 1 — Business context diagram.
3
The earliest philosophers were concerned with the relationship between our thoughts, language, and the real world, but the first organized ontology was propounded by Aristotle in the 4th century BC. In a series of seven books — historically called the Organon (works) — Aristotle set forth the first definitive ontology and the first organized rules of logical and scientific reasoning. In the Organon, Aristotle defined three things that are now recognized to be part of any well-formed ontology: a set of categories (classes of entities), rules for forming logical sentences (propositions) using the categories, and a set of inference rules for reasoning from a given set of propositions.
4
Actors are very similar to agents in agent technology.
©2007 CUTTER CONSORTIUM
VOL. 7, NO. 1
6 vendors, employees, and shareholders). Messages are what are
communicated. Messages can be in any physical form: documents, packages, letters, e-mail, voice, or multimedia. In an IS context, key messages are often the business transactions that make up the system (e.g., orders, shipments, bills, payments, and returns). On the surface, context diagrams are elegantly simple, but they are also amazingly effective. As far as we can tell, they have been used for decades and have been reinvented dozens of times. Context diagrams only show actors and messages, but we have found that they also are excellent guides to uncovering the other semantically important entities as well. Objects and Subjects
Any context diagram is about something; that something is what we call the object or subject. The objects involved in business semantics are the objects of the communication that goes on between the major actors. Consider the case where our business (the enterprise) does business with a customer. Here, the object is normally some kind of product or service. In modern terms, for legal business exchange to take place, a number of messages have to go back and forth (e.g., order and shipment). What is
VOL. 7, NO. 1
BUSINESS INTELLIGENCE ADVISORY SERVICE ultimately being exchanged are the objects of the communication — in this case, product and money. Objects can be something other than products or services. In the case of buying or selling real estate, they are about parcels of land and improvements (e.g., houses and barns). In the case of employment, the object is a job or position. And in the case of intellectual property, music, or publishing, the objects are likely to be patents, copyrights, or trademarks. Finally, objects can be people. When that is the case, we refer to them as “subjects.” A few years ago, I worked on the requirements for a juvenile justice system, in which the juveniles were both an actor in the system and the object of the system at the same time. In corrections, welfare, medical, and any number of case systems, the object of the communication (and therefore the system) are the subjects that the agency deals with or is responsible for.
Aperiodic events might be a car accident, death, birth, customer order, or any other nonpredictable happening. Locations
It is important to know where an actor or object is located or where a message should be sent; locations define physical or logical addresses. Physical locations are often referred to as addresses or coordinates. Historically, physical locations have been thought of as relatively static things, but with the advent of technologies like GPS and wireless communications, more and more things (actors, objects, subjects) will have transient addresses that tell you where they are right now (or at least where they have been recently). Logical locations are things like e-mail, Web addresses, or telephone numbers. Historically, things like telephone numbers indicated physical areas, but with cell phones, modern switching systems, and VoIP, unless you’re the service provider, you have no way of knowing where somebody actually is.
Events
In business semantics, the sending of a message is triggered by an event, and the receiving of a message is signaled by yet another event. Business events tend to fall into two subclasses: periodic and aperiodic. Common periodic events in the financial world are triggered by the end of month, end of quarter, or end of year.
Context
Part of what natural language does well and computer languages do poorly is take into account the “context” we talked about previously. Because there are more things in the world than there are words, human language frequently uses the same word or phrase to mean different things in
www.cutter.com
7
EXECUTIVE REPORT different contexts or uses different words to refer to the same thing (or person) in different contexts. Identifying context turns out to be really important and difficult, since context can be both subtle and tricky. Ludwig Wittgenstein spent decades trying to tease out a common definition of “game” from all of the different ways that the term is used in normal language, and he was never really successful (which was his point, after all). Roles
Perhaps the most significant topic of ongoing debate in business semantics today is the increasing importance of “roles.” When we talk about individuals and organizations by themselves, it is normally clear what we’re talking about. However, when we refer to “customer” or “employee,” these terms are highly dependent on context. For example, if we were talking about an individual walking around a department store, whether they are a “customer,” “employee,” or “security guard” depends on whether they were buying products, selling products, or watching people. In business semantics, we begin by defining actors and segue into defining roles that actors play by the time we are done developing business process (workflow) diagrams.
Relationships
General relationships are important semantic categories as well. The most important things to people are their family, business, and social relationships. Organizational structure is usually modeled as relationships. And where humans are good at reasoning based on relationships (e.g., If Y is the son of Z, and X is the daughter of Y, then … ), computers have not been so good historically. Business Exchanges, Processes, and Relationships
The final group of categories are extensions of the ones that have already been covered. Business exchanges are modeled by classic bartering transactions in which both actors get something.5 In today’s world, a complete business exchange often includes a series of messages or transactions (e.g., order, shipment, bill, and payment). Major large business processes (also referred to as value chains) often relate to real business exchanges (e.g., order to collection, and requisition to acquisition). Finally, business relationships refer to all of the business exchanges between two business partners over their entire history.6
Some Comments on Business Semantics
Every Journalism 101 student is taught to ask six basic questions: Who? What? Where? When? Why? How? These fundamental questions have to do with recognition that the real world has only a small number of important categories that everything else revolves around. The study of business semantics has reinforced the importance of these major categories: actors, messages, objects, events, and location. In every business application, the semantic categories dominate the important physical data organization. Actors refer to customers, objects refer to products, and messages refer to orders, shipments, bills, and payments. MDM is, after all, about the most fundamental actors and objects. Within enterprise data architecture, business semantics play an increasingly important role. While no one can possibly understand all the various tables and files that exist in a large organization, they can come to understand what the major actors, messages (transactions), and objects are, and those are the highest-level categories that the data architect has to deal will. The major entities then help us set boundaries.
5
In legal terms, a contract between two parties (legal persons) is not binding unless both parties get some consideration. That’s why if you are using a contract even to give something valuable to someone else, they have to pay you a nominal amount (e.g., $1) as consideration.
6
While there has been a great deal of interest in customer relationship management (CRM) over the years, very few CRM systems actually attempt to bring together all of the interactions that the enterprise and the customer have had over the years. There is, however, much more interest is doing just this in many of the customer hub implementations that are underway.
©2007 CUTTER CONSORTIUM
VOL. 7, NO. 1
8 Attributes, Views, and Databases
The previous section discussed the fundamental real-world semantic basis for data architecture. In this section, we discuss a small set of fundamental issues and confusions surrounding database management. One of the most fundamental confusions that hinders communication among users, database designers, and developers is the distinction between attributes, views, and databases. Over and over again, throughout the history of systems development, this confusion has kept organizations from developing sound databases and data architectures. Attributes, Collections, and Identifiers
Attributes (data fields) are some element of data that represents some property of some business semantic entity (e.g., “name,” “address,” and “age”). A derived attribute is some computed or derived value created from some other piece of data in a given context. Records are a set of attributes belonging to a common subject.7 Identifiers are attributes that identify the subject of a record. Views
A view is a structured (normally hierarchical) set of data attributes that a user sees (or wants to see)
BUSINESS INTELLIGENCE ADVISORY SERVICE in some report, screen, or other representation.8 In an ideal world, users would be able to define just the views that they need in their systems absent any concern for what data attributes or database structure already exists, or conversely use existing metadata to select existing attributes and add any that don’t currently exist. Databases
Essentially, a database is the logical sum of all the views for a given application domain. In an ideal world, the logical database would be automatically derived from all the views defined. Data (Information) Organization Issues
Working down from business semantics and fundamental concepts, a data architect needs to understand the basic approaches to organizing data. Among the most popular approaches are network, hierarchical, relational (tabular), object-oriented (OO), inverted file, multidimensional, XML, and spreadsheets. Network Databases
Network-oriented databases were one of the earliest approaches to organizing large amounts of data. The first industry database standard, for example — proposed by the CODASYL Consortium — was
a network database. Over time, network databases lost out, first to hierarchical databases and then to relational databases. But network databases have always excelled in flexibility, and the amazing growth and adaptability of the Internet is due in large part to the fact that it is built on something very much like a simple network database. The problem with network databases is the difficulty in handling updates, especially deleted nodes, which is the Internet’s biggest problem as well. Hierarchical Databases
Hierarchical databases are another popular method of organizing data. If user views are hierarchical, the reasoning goes, why not store the data in the same sequence as the views? IBM’s IMS is perhaps the most famous hierarchical database. Historically, hierarchical databases have always provided high performance in terms of transactions per second. The problems with hierarchical organization have to do with flexibility and with the ability to support large numbers of multiple views at the same time. Both network and hierarchical databases are sometimes referred to as navigational databases because they require the user (i.e., the programmer) to navigate through the database to find information or create a view. Large
7
Here, records refer to logical, not physical, sets of data.
8
A view can be a series of sounds (e.g., voicegram) or tactical (e.g., braille) but logically, the data structure remains the same.
VOL. 7, NO. 1
www.cutter.com
9
EXECUTIVE REPORT numbers of network and hierarchical databases exist today in legacy applications, and hierarchical organization has been resurrected by the OO folks. Relational Databases
Certain database researchers considered navigational databases as merely a stepping stone to a more logical and mathematical approach for defining, updating, storing, and retrieving information. Edgar (Ted) Codd was the leader of a group at IBM who came up with what they considered a better way of database organization. In 1970, Codd published a landmark paper entitled, “A Relational Model of Data for Large Shared Data Banks,” which set forth a number of basic principles. The most significant were: (1) data independence from hardware and storage implementation and (2) automatic navigation, or a high-level, nonprocedural language for accessing data. Instead of processing one record at a time, a programmer could use the language to specify single operations that would be performed across the entire data set. [1]
Relational databases are made up of a number of tables, each of which has some variable number of rows (records) and is made up of the same number of columns (attributes). Associations between tables are created through operations on common attributes
©2007 CUTTER CONSORTIUM
(identifiers, also called keys). The elegance of this construction allowed the development of mathematically provable operations on data, as well as a mathematical approach to database design. The database approach was called normalization, which provided rules for developing a database from a set of views. Others have taken the same idea to create approaches for the automatic design of relational databases. Relational databases began as research projects in the 1970s and became commercial tools in the 1980s. By the mid-1990s, relational databases had overcome a number of performance and integrity problems to become the dominant form of database orientation. The vast majority of large databases developed over the last 15 years have been on relational databases. This is still true today, though new database approaches are increasingly popular. OO Databases
At about the same time that relational databases were beginning to replace network and hierarchical databases as the principal form of commercial database, OO programming was beginning to gain popularity. Instead of being built on a simple tabular data model, OO programming and design is built on a set of objectclass hierarchies. Object-class definitions contain not only data but also “methods” (behavior/
programs). The use of object data structures has led the way to development of much more complex data types (structures) that, in turn, have made it possible to operate easily upon a wide variety of complex sets of information (documents, sound, video, and multimedia). Today, most content management systems employ some form of OO database. While OO has become the preferred programming approach, most large commercial databases are still stored and maintained on one of the more popular relational database platforms (Oracle, IBM DB/2, Microsoft SQL Server, Informix, etc.). This has created what is referred to as the “relational-OO impedance mismatch.” Though more complex, OO databases share both the advantages and disadvantages of hierarchical databases. As my colleague Arun Majumdar has phrased it, “OO databases are good vertically but not so good horizontally.” Simply put, objects work well where the data is naturally hierarchical and not so well where the natural organization is some form of network or multiple simultaneous hierarchies. In something of a step backward, OO databases are very navigational in nature, which makes it difficult to build a common query language that will operate over large numbers of different objects as SQL can do over hundreds or thousands of tables in a very large database.
VOL. 7, NO. 1
10 Inverted-File Databases
Inverted-file databases were created to deal with unstructured document-oriented data. Invertedfile databases operate by: (1) indexing every word in every sentence of every document; (2) excluding noise words (a, and, or, the, etc.); (3) sorting all the word references alphabetically; and then (4) querying data by looking up words and phrases. This last trick is accomplished by applying logical operations against those indexed words (or common phrases) to retrieve all the appropriate sentences and documents where those words or phrases are used. Inverted-file technology is still a base technology of all the popular search engines.9 Multidimensional Databases
Historically, databases used for end-user (analytical) applications were largely hierarchical. In the 1980s, a number of multidimensional databases became popular to support the need to be able to quickly analyze and display management information. These databases were built around the ability to store and quickly retrieve data organized in multiple hierarchical dimensions at the same time (e.g., geographic region, product family, time). Such systems, which usually were built around proprietary data structures, made it
BUSINESS INTELLIGENCE ADVISORY SERVICE possible for end users to instantaneously drill down or collapse layers in these hierarchies. Conceptually, multidimensional databases can be thought of as very large matrices. While multidimensional databases are popular with end users, they take a lot of time to build. Hardware and software advances have hidden much of background work, and large numbers of dimensions are still a problem. As the number of hierarchical dimensions increases arithmetically, the size of the matrices increases geometrically (i.e., exponentially). Another approach for creating multidimensional databases is Ralph Kimball’s star schema approach to data warehousing. In this case, a multidimensional structure is represented in a relational database with a central fact table and a series of connected dimensional tables. Here, relational technology is used to implement a multidimensional view. XML
One of the major developments in data organization in the last decade has been the increasing importance of XML as a data exchange mechanism. XML is an outgrowth of the SGML markup language, which emerged in the late 1990s as a way to produce
data structures that were both human- and machine-readable. XML has become increasingly popular as a way to communicate complex data structures (documents, images, video, etc.) and as the basis for information exchange in a number of domains. Like OODBMSs, XML is spawning XML databases that mirror the structure of the XML that is processed. Spreadsheets
Any discussion of database organization strategies would be incomplete without a discussion of spreadsheets. While database administrators and most data architects disdain to discuss spreadsheets, they are by far the most widely used form of data organization among non-IT professionals. In finance and any number of other domains, missioncritical spreadsheets contain much of most enterprises’ critical data. Spreadsheets make it possible for people with limited understanding of programming or data processing to compute complex functions and to manipulate and analyze complex “what if ” scenarios. In recent years, Microsoft has made Excel the principal user interface to tabular data. Because of the ubiquity of Excel, other database and BI vendors have made Excel their preferred user interface.
9
The most important addition to basic inverted-file technology for search engines has been the “page-rank strategy” developed by the founders of Google — Sergey Brin and Lawrence Page. This algorithm was an outgrowth of work by information scientists working in the field of research publication. These researchers discovered that one of the best ways to rank the importance of a specific scholarly publication was to determine how many times that publication was cited in other scholarly papers. The page-rank algorithm essentially refined the same approach to rank Web sites.
VOL. 7, NO. 1
www.cutter.com
11
EXECUTIVE REPORT Persistence and Transience (Where Do Object-Class and XML Models Fit?)
One interesting development of the last decade has been the influence of OO thinking with respect to databases. One of the things that differentiated OO from previous programming paradigms was that OO focused on creating sophisticated objects and objectclass hierarchies to simplify programming and that, historically, those objects have lived in transient memory rather than on some sort of persistent storage medium like a hard disk.10 By and large, data architecture deals with persistent data, not transient data, but with the importance of objectclass models, some provision has to be made to incorporate these models into the overall data architecture. Some Comments on Database Organization
Besides networks, hierarchical, relational, OO, inverted-file, and multidimensional organization structures, there are a number of other approaches as well. Each method of data organization has advantages and disadvantages. Indeed, all of the major commercial databases utilize, under the covers at any rate, most of the major organizational strategies.11 And, for historical reasons, a great many of these organizational
schemes are found in most large organizations. So data architects need to be familiar with all (or nearly all) of the major approaches. To assume that one particular form of database organization is, or will ever be, the de facto corporate standard forever is just wishful thinking. Data architecture, like the architecture of any large city, is a complex, ongoing, historical happening. Old patterns continue to exist alongside new ones used for new classes of applications. And old (classical) organization approaches are reborn with new names and slick technological implementations. The trick today is to be able to combine these various organizational strategies without introducing more complexity than already exists. Distributed Database
A distributed database is a database that is under the control of a central DBMS in which storage devices are not all attached to a common CPU. It may be stored in multiple computers located in the same physical location or may be dispersed over a network of interconnected computers. Collections of data (e.g., in a database) can be distributed across multiple physical locations. A distributed database is distributed into
separate partitions/fragments. Each partition/fragment of a distributed database may be replicated. Besides distributed database replication and fragmentation, there are many other distributed database design technologies — for example, local autonomy, synchronous, and asynchronous distributed database technologies. These technologies’ implementation can and does depend on the needs of the business and the sensitivity/confidentiality of the data to be stored in the database. Hence, the price the business is willing to spend on ensuring data security, consistency and integrity. One of the most important concepts a real data architect or data modeler needs to understand is “distributed database.” In text books, distributed database has to do with databases with planned redundancy of data distributed over multiple servers, normally in different physical locations. In practice, distributed database applies to all copies of the same (or similar) data stored anywhere in the enterprise. Historically, database designers considered any form of redundancy a bad idea because the data is often not updated correctly. But redundancy is often necessary, even desirable, especially where availability or
10
OO databases have been around for at least three decades, but only recently have they begun to gain much traction. This means, by and large, that OO developers have to map their in-memory objects onto relational tables for storage and retrieval.
11
For the sake of brevity, list structures were omitted from the set of major approaches to data organizations. Lists are, in fact, a very popular way of organizing data in a variety of application domains.
©2007 CUTTER CONSORTIUM
VOL. 7, NO. 1
12 performance are significant problems. As it turns out, redundant data is not bad from either a database or data architecture standpoint. It is inconsistency that is the real enemy! If one could guarantee that the network and all of the servers that contained portions of very large databases would always and instantly be available, then distributed database design wouldn’t be a problem. However, in the real world, at any one time, various network links may not be available, and one or more of the servers may be (or need to be) offline. As stated, the principle problem with distributed database is updating. If you have redundant data, how do you keep it up to date? If the data doesn’t have to be always in sync, then batch processing will work. In data warehousing, even today, batch updating of the data warehouse from existing databases is the most common approach. In the 1970s and 1980s, database researchers moved to solve the problem of distributed database updating. The basic problem, simply stated, is the following: if the same data (tables, rows, columns) exists in two databases, how do you apply the updating so that at the end of the update cycle, all of redundant data in all of the databases is identical?
BUSINESS INTELLIGENCE ADVISORY SERVICE Two major solutions were created: two-phase commit and replication. The two-phase commit involves the server that initiates the updating of its local data and then sends independent update commands to each of the other servers on which distributed data resides. Each of these remote servers, in turn, updates its local data and sends a “successful update message” back to the original server. If all of the distributed sites communicate back success within a specific period of time, then the originating site sends out a “commit message,” and all the servers update their data. If one or more of the distributed sites fails to send its “successful update” message back in the allotted time, then the originating site sends out a “rollback message” to each of the distributed sites. Two-phase commit ensures that all of the persistent data in a system, however large, is always consistent; because of this, it is the gold standard for distributed updating. Unfortunately, two-phase commit assumes that the devices containing data are nearly always connected to a highly reliable network. As a result, two-phase commit is not widely used for distributed applications where the network is unreliable or where the various servers (or laptops) may be offline for significant periods. Today, an increasing number of organizations are providing their
workforce with portable computers (laptops, PDAs, cell phones), which, even with wireless capability, are often offline for significant periods. Such systems rely not on two-phase commit but on a form of data replication. Usually this approach works best where there is a central database server and many remote devices with a relatively small amount of shared data on the portable device. This approach allows users to work offline, update the data on their portable machine, and then synchronize with the central site when they next connect.12 Straight Through Processing
If one looks at data architecture from an enterprise standpoint, the most common way to replicate data is batch updating (i.e., copying the data from one system and then using it to update common and redundant data on the second system). Historically, batch processing was the most common form of serial intersystem communication. In most large organizations, data exchange to move information from one system to another was accomplished either by batch processing or reentering data. In certain application domains (e.g., financial trading and payment areas), this has led to unacceptable delays. Because of the press of e-business, for example, there has been a strong push to
12
Microsoft’s version of this is called “Smart Client” technology, which includes talk about “optimistic” and “pessimistic” locking, but each one has some serious drawbacks in systems with very high transaction rates.
VOL. 7, NO. 1
www.cutter.com
13
EXECUTIVE REPORT process trades and payments within a single business day. This process has come to be known in the financial world as straight through processing (STP). STP is also talked about with increasing frequency in data warehousing as well to mean near-real-time updating of data warehouse data, which historically has been on longer weekly or monthly cycles. STP represents one of the most concerted steps in any industry to move to near-real-time updating of massively distributed data. Most organizations today still have huge numbers of systems connected largely via copies of transactional batch data that is used in amazingly complex threads throughout their organization. Over time, the expectation is that the drive for faster and faster response time will, in turn, drive more and more application areas (e.g., supply chain) to some form of STP. Search and Presentation Technology
It would be a major mistake to ignore the impact that the Internet has had on data architecture in major organizations. Before the advent of the Internet, most large organizations had private networks and/or operated over networks shared with other firms in their own industry. The Internet changed all that. Today, nearly every major enterprise in the industrialized world is connected to the Internet. Some — for privacy, confidentiality, and security
©2007 CUTTER CONSORTIUM
reasons — restrict or prohibit direct connection, but this is more the exception than the rule. The Internet is now the largest, most used database in the world, and Google is the most used search engine. Almost singlehandedly, Google has radically transformed the way people think about finding information. In developing a report like this, I suspect that I used Google at least 20 or 30 times, maybe more. I used it to find definitions, papers, links to other research, and so on. I also use Google Desktop to cross-reference my internal files and, most importantly, to find specific e-mail and attachments. I can’t imagine what I would do if Google suddenly wasn’t available. What I do know is that it would cost both me and those who pay me to do research a great deal more than it currently does. Google, as well as other search engines, is now reaching into corporate information retrieval. To date, this is mostly used to look up documents and files. But no doubt, advanced search engines in the near future will be able to look up and format almost any kind of information. Already there is Google Scholar, Google Patent, Google Finance, as well as Google Analytics. Organizations that have made their living providing information to selected markets will increasingly find that they must compete or cooperate with the likes of Google, Yahoo!, or Microsoft.
Search and presentation technologies are revolutionizing nearly every field. First, we had map, weather, and restaurant information on different sites, and now we have intelligent mapping services where all the information exists on a map or globe. Already, with wireless GPS devices, computers can show us the information we’re looking for using our current location as a reference point. High-speed Internet connections coupled with new technologies for visualizing information are having profound implications on corporate (enterprise) systems. Increasingly, the expectations of the public, business partners, and corporate users are growing, fueled by what they see every day for free on the Internet. CURRENT PROBLEMS WITH DATA/INFORMATION IN LARGE ORGANIZATIONS What, then, are the principal data problems that organizations face in the early 21st century at the end of this third cycle? Data quality. One of the dirty
little secrets in large enterprises has to do with problems reconciling data. At month’s end, for example, the general ledger system may show one set of numbers for gross sales and net profit, while the sales order system often shows significantly different numbers. Getting the official numbers to balance requires hundreds of person-hours each quarter, and, even then, many of the
VOL. 7, NO. 1
14 changes required to make everything work out are suspect. Data quality is an ongoing problem that often undermines efforts like data warehousing initiatives. In addition, experience teaches that data quality, even within a single data silo, can be problematic. With the emergence of SOX and other rules and regulations around the world, the ongoing problems of data quality loom large. Integration. The next most seri-
ous ongoing data problem has to do with bringing together common data from one application with that from another.13 Both operations and marketing organizations often keep massive customer files, but there always seem to be problems getting the data to match up. The same is true with product information. There are product files in the sales and marketing systems, and there are product files in manufacturing systems, and yet other product files in engineering systems. But there is no one place where someone can look at all the product data. In multidivisional organizations, these integration problems are magnified. Getting at critical data. Most
organizations have a polyglot of different legacy and COTS systems that have to be interfaced to support enterprise data
BUSINESS INTELLIGENCE ADVISORY SERVICE processes. Many of these systems, while efficient at capturing and updating their internal databases, are not nearly as good at allowing easy access for new querying and reporting purposes. Often, legacy databases lack adequate, up-todate metadata that makes easy access possible. Intra- and inter-enterprise
sharing of data. Modern businesses depend increasingly on electronic supply chains and outsourcing. This means that data must be shared across not just departmental and divisional boundaries but also across enterprises. The post9/11 analysis that intelligence and law enforcement agencies had trouble sharing data was not a surprise to anyone who understood the technical, organizational, and legal difficulties in setting up robust data sharing mechanisms. Recovery of historical
(archived) data. Never has the rate of technological change been greater. New generations of technology occur every three or four years today rather than every 10 or 15 years as it was just a couple of decades ago. When organizations are required to dig back 20 or 30 years, for example, to satisfy some class action lawsuit, they often find that they no longer
have any equipment that can even read the media on which the historical data is stored. Moreover, the older the data, the less likely that adequate metadata exists that describes the data files of databases. Conceptual shifts. Not only are
there problems that result from new generations of hardware and software, there are also shifts in systems and database paradigms as well. In database technology, the first successful databases were either hierarchical or network-based. During the 1980s and 1990s, relational databases came to dominate database strategy, and in the last decade, OO programming has created an impedance mismatch between programming and database management. Even recent developments like XML make it possible to send, receive, and store much more complex data/information artifacts. However, they do not match up well with persistent relational database implementations. Even with all these problems, things are not hopeless. As planning gurus are apt to point out, every problem is an opportunity. The fact that there are so many problems with data today is in large part the result of the success of computers and communication systems in transforming our
13
It is clear that data quality, integration, and common metadata are interrelated. If you have multiple copies of the same thing, you will have problems integrating the data. The organizations that use the separate copies will develop “dialects” where “customer,” “product,” etc., will mean different things.
VOL. 7, NO. 1
www.cutter.com
15
EXECUTIVE REPORT organizations and changing the way business is conducted around the world. The world is increasingly data-centric. Today, millions upon millions of people and enterprises are capable of instantaneous worldwide communication. The Internet, which has created and/or revolutionized entire industries, is only a little over a dozen years old. Data needs have mushroomed because organizations have moved increasingly from transporting atoms (physical things) to transporting bits (electronic pulses). However, if organizations are going to be able to leverage these new advances, they will have to manage their data to be able to meet current and future needs. DATA/INFORMATION FORCES To go along with the problems and opportunities described previously, there is also a series of forces driving and enabling changes in our data architecture. For the purposes of this report, we classify business needs as drivers and technology opportunities as enablers. Some of these, as you will see, are attempts to solve the problems discussed earlier. Business Drivers (Market Pull)
The multitude of things that top managers and professionals continually ask for include: An integrated view of major
data classes. Some of the questions that one would
©2007 CUTTER CONSORTIUM
assume are the easiest to answer — such as, “How much did we sell last month by product and customer?”; “How much business did our top 10 customers do with us last year?”; and “How profitable are our top 10 customers?” — turn out to be exceedingly difficult to answer in most big organizations. Getting integrated data is like pulling teeth. And as our data architecture becomes more baroque, integrating similar data becomes ever more difficult. The need to get at key data
faster. The world in Thomas L. Friedman’s The World Is Flat is moving faster and faster. Many major decisions have to be made today or next week, not next month or next year. Oil prices spike, oil prices plummet. There is a strike in Italy. What do you do? The center of the world’s economic attention is no longer North America and Europe, but China, India, Latin America, or somewhere else in the developing world. In order to be competitive in this world, information has to flow faster from operations to corporate headquarters, and it must incorporate external data sources. The need to handle large
amounts of structured and complex unstructured data. Today, large organizations are requiring their vendors to be closely tied to their supply
chain system. Large retail customers want to be able to supply their vendors with detail data on every transaction that includes one of their products in every store in the chain. And with radio frequency identification (RFID), the amount of detail data will continue to double or triple every couple of years. It won’t just be transaction data that will be captured; in some locations, video camera and recording system vendors are now supplying retail stores with software that tracks customers’ movement within stores. The original purpose of these systems was to reduce shoplifting and pilfering, but now smart video tracking software is also providing vast amounts of information for data mining of customer behavior as well. Immediate updating of all
data. In order to make up-tothe-second decisions, transactions must be able to update not only operational databases but management/analytical ones as well (STP). Technology Enablers
As fast as the business world changes, the technology world changes even faster. Each day, new discoveries, products, and processes are announced. Each of these products enables new kinds of business solutions. No organization can take advantage of all of these innovations; the trick is to figure out which technologies to
VOL. 7, NO. 1
16 buy into and when to make the move. Move too quickly, and the technology will not be ready; wait too long, and the window of opportunity may have closed. In the world of data, key technologies include: The cost per byte of data
storage. Probably no area in technology has tracked or exceeded the predictions of Moore’s Law better than data storage. Just recently, for example, Hitachi announced a new 1 terabyte disk. It was not so long ago that 1 terabyte was considered a very large database. Indeed, even today, many organizations could put all of their structured data for multiple years on such a drive. But even with all this capacity, organizations are pressed to capture, index, and retrieve all the important information at their disposal. The speed of data retrieval.
Not only is data storage becoming cheaper every few months, but the speed is also increasing. The dominant form of storage, rotating magnetic memory, stands to be replaced for a great many applications by large flash memories, which require very low power requirements and are nonvolatile, thus eliminating many problems of backup and recovery. Search technology. A recent
business consultant talking about Google’s great success described using Google as akin
VOL. 7, NO. 1
BUSINESS INTELLIGENCE ADVISORY SERVICE to “sitting next to the smartest kid in school.” Google has made the access of billions of documents, images, and Web sites an amazing exercise, which increases the IQ of the Internet user by quite a lot. As a result, Google is moving to provide the same subsecond, highly relevant access to internal enterprise data. Grid computing. Google works
its magic not by using large, powerful server farms, but by using hundreds of thousands (millions) of very simple computers in centers around the world, running special gridbased indexing, organizing, and search algorithms. Data mining. Data mining is a
technology that provides software algorithms for analyzing millions of records and discovering often-unexpected patterns. This technology is being utilized in industries from retail to telecommunications to drug discovery. Data visualization. Data visual-
ization provides new ways to utilize human intelligence to literally see unexpected patterns in 3D images. As the amount of information that people have to work with increases, the more need they have for ways to consolidate and visualize their data in new, intuitive ways. Semantic modeling. Long-
term, the most important new technology affecting data is
to create truly semantic data models, which include ontologies, propositions, and inference rules. There is one caveat: all of this technology works best if the data (and the metadata) is good, up-todate, and someone understands what it means. Today, serious problems occur because no human ever looks at the detail data due to sheer volume. And despite decades of discussion about artificial intelligence, most computer algorithms only find the things that they are looking for. Technological breakthroughs that will actually allow computers to understand and reason more like human beings are getting closer but still need a lot of human intervention. The Netherworld of Legacy Data
Understanding the business drivers and technology enablers that propel our organizations into the future is not even half of the challenge large organizations face in developing (and maintaining) an effective data architecture. What holds them back is a very large anchor in the form of the existing legacy data environment, where legacy data is defined as any data in any system that is in operation. Table 1 gives just a small window into the nature of the problem by displaying the basic database statistics for a middle-sized government enterprise with an annual budget of about US $1.2 billion and about 3,000 employees. The organization has been involved in www.cutter.com
17
EXECUTIVE REPORT computing since the late 1950s and uses computers in every aspect of its organization, from advanced engineering and planning through sophisticated project management. An early user of DBMSs, the agency has large investments in both mainframe (DB2) and server (Oracle) databases. These numbers, by the way, do not include instances of Microsoft SQL Server, Microsoft Access, and Excel spreadsheets, some of which also contain critical agency information. Now, this is not an extreme case. The agency in question has a wellrun IT organization with an experienced DBA group. Indeed, this organization has been involved in serious systems and database planning and architecture for nearly two decades. More importantly, it is not a large organization as enterprises go. What stands out is just how large the numbers of tables and attributes are. The technical database group and senior developers in this organization know how integrated database systems should be designed, developed, and deployed. But over time, the push to deliver new functions in shorter and shorter periods has encouraged project teams to develop their own database with similar (or the same) tables with lots of attributes in common.
Table 1 — Database, Tables, and Attributes DB2
Oracle
Total
14
42
56
560
3,700
4,260
80,000
17,600
97,600
Databases Tables Columns (Attributes)
From my experience over the last two decades, this organization is not unique. Indeed, my firsthand knowledge would suggest that many attributes, even many tables, that exist in a normal data center are not used, or not used in an appropriate way. One highly complex COTS application that I reviewed had something like 200 tables in its database. However, on examination, it turned out that almost half of the tables (94) had no rows (i.e., no data) in them! In addition, of the 20 or so major tables in the application, 50%-66% of the columns had either nulls or constants in them, meaning that they weren’t being used either! In this report, we focus on what the industry is now calling “master data” (customer, product, vendor, employee). Here, despite decades of concentration, the problem is even more difficult. Customer data is probably the worst case. Every commercial enterprise (and many public ones) focuses extensively on its customer data, but in organization after organization, it turns out there are multiple customer databases. This in and of itself would
not be too bad, but in a number of cases, these databases don’t have either a common scheme for identifying their customers or an agreement from system to system on what actually constitutes a customer.14 One of the reasons that this happens is that the needs of operational divisions and the enterprise are quite different. From the standpoint of an individual division or department, its underlying databases do not have to be consistent or compatible across the enterprise, they just have to work for that department or division. As long as all the bills get out and someone pays them, the billing department for Division A doesn’t care if there are many duplicated customer numbers, for example, but at the enterprise marketing level, duplicate customer numbers matter quite a lot — it is important to know who your best customers really are and how much they buy. And as enterprises gear up for operations in the 21st century, getting a handle on data quality across the enterprise becomes a major need.
14
One large manufacturing company that I worked with in the 1990s had two major (and lots of minor) customer databases. The one maintained by operations had some 500,000 customers on it, while the marketing organization had a customer database with more than 4 million customers!
©2007 CUTTER CONSORTIUM
VOL. 7, NO. 1
18 All the business strategies and all the technology in the world will not solve problems of legacy systems and legacy databases. Replacing some of the major applications with COTS packages will not solve the problem and in many cases may make it worse.15 What is required to actually make a dent in this data problem is a very long-term migration strategy guided by an informed, evolving data architecture together with a program to separate operational data systems from analytical/ managerial ones! DATA ARCHITECTURE SOLUTIONS I have written elsewhere that enterprise architecture is more like urban planning than it is like building architecture. While constructing buildings is a complex task and an interesting one, urban planning deals with much more complex interrelated factors and with many more dimensions. This way of looking at EA is beginning to get a wider following with major organizations like Microsoft and Walt Disney promoting urban planning models [2]. If viewed as urban planning, EA can be thought of as depending upon a detailed understanding (documentation) of the current “as built” environment, as well as the collaborative development of master plans, rules, and regulations to help fashion (regulate)
BUSINESS INTELLIGENCE ADVISORY SERVICE the future. Like cities and suburbs in urban areas, application and data architectures grow organically over time, and despite the best-laid plans or well-funded infrastructure, unplanned things happen. In the same way that all cities have an as-is urban architecture, so too, it can be argued, all organizations have an as-is data architecture even if they don’t recognize it or consciously manage it. The sum of an organization’s legacy data provides its as-is data architecture. Unfortunately, the problem with unplanned data architectures is that they often lead to many of the problems that we discussed earlier. Distributed Database
… a data architecture describes: (i) how data is persistently stored, (ii) how components and processes reference and manipulate this data, (iii) how external/ legacy systems access the data interfaces to data managed by external/legacy system, (iv) implementation of common data operations. [5]
To recapitulate: the as-is data architecture for an enterprise is the sum of all the data practices of the enterprise up to the current time. It represents both planned and unplanned activities; it represents internal and external data; and it represents large databases and small ones, enterprise-wide
projects, as well as local, departmental ones. It even represents those unauthorized access databases and spreadsheets that are somewhat similar in and enterprise data architecture to the shanties and makeshift structures that dot the urban landscapes. A serious enterprise data architect has to assume that everything that exists has value to someone in the organization. All of my experience shows that an organization’s true as-is data architecture is enormously more complex than anyone suspects, even those involved in database management. And the as-is data architecture is not restricted to just its so-called persistent data either; it also must include all the data exchanges that move from system to system. These data exchanges range from transferring physical media like tapes and disks to nearly instantaneously electronic communication. While it was assumed in the 1970s and 1980s that all systems would be essentially online in a few years, batch approaches have persisted even for online systems processing millions of transactions per day as a means of updating data in other systems. This serial updating process has turned out to be easy to understand, explain, and manage. What is not so easy to explain and manage is the timing of updating and exactly which data is updated and when. Most existing systems have many
15
COTS packages have their own unique databases, attribute names, and business rules. Most of these packages, for some very good reasons, do not want anyone messing with their database structure or data name. In a great many cases, highly integrated COTS applications have proven difficult to extract data from and equally difficult to interface with the hundreds of existing legacy and other COTS applications found in the organization.
VOL. 7, NO. 1
www.cutter.com
19
EXECUTIVE REPORT versions of “the truth,” depending on when you ask, what your definitions are, and, of course, who you ask. Quasi-independent databases tend to have different values for common information, so one of the driving factors in all data architecture efforts has to be to improve the overall quality of and reliability of the enterprise’s core data. Data Quality
The quality of data is a key (even critical) issue in every large organization. At base, if an organization’s information isn’t any good, the company will make serious, often fatal, mistakes. Ultimately, data quality is a pretty simple concept. Data quality is a measure of how closely the data in our databases agrees with the real world. A data quality value of “1.0,” for example, would mean that our data is in exact agreement with the same values in the real world. All of our addresses for all of our customers would be correct, as would all of their telephone numbers. On the other hand, a data quality value of “0.0” would mean that there is no agreement at all between the data in our databases and the real world. Obviously, time is an issue. If it takes a month to get a name and address updated, then it is likely that the names and addresses will be at least a month out of date. If there is not a systematic way of getting data updated once it is 16
entered, then databases will be less and less accurate.16 And there are a large number of other data quality issues that companies struggle with. Often, an enterprise’s customers will have more than one customer number in one system. Indeed, business promotions often encourage cheating. For instance, I’ve been told that if I apply for a store credit card, I will get an extra 10% or 15% off. When I tell the salesperson that I already have one of their cards, they often say, “Well, that’s all right, they’ll [the central IT folks] never know. Just use your initials instead of your full name.” So not only do customers end up with multiple customer numbers, but the best customers end up with the most. Ultimately, the most important indicator of data quality is use. The more an organization uses its data, the better the data will get. Conversely, data that is rarely or never used will not (cannot) be very good. Data warehousing has dramatically improved data in many organizations, though it often takes a while. Data Warehousing
A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decisionmaking process. — Bill Inmon, “Father of Data Warehousing”
Now, the complexity created by the drivers, enablers, and legacy data world described earlier is not new to IT. Indeed, these problems have been around for decades. And the larger and more progressive the organization, the earlier this complexity is encountered. By the middle of the 1980s, a number of large data-intensive enterprises around the world had already reached a point where their data pain threshold had been exceeded. A new generation of relational DBMSs was coming onstream with the promise of a much more elegant way of creating, updating, and retrieving data. But these organizations already had hundreds (and in some cases, thousands) of applications with existing data files or firstgeneration DBMSs. A number of organizations had extensive experience with the explosion of databases occasioned by the lack of a clear strategy for integrating data and providing a single view of the truth. Analytical tools were coming on-stream that could provide managers and professionals with much improved reporting and visualization capabilities — if only they could get at the needed data. Many of these organizations began to look at the wholesale replacement of their existing legacy databases with true centralized, integrated databases. However, when the IT planners put pencil to paper, they found that it would be
Demographers estimate that approximately 16% of the population of the US changes their address every year.
©2007 CUTTER CONSORTIUM
VOL. 7, NO. 1
20
BUSINESS INTELLIGENCE ADVISORY SERVICE
10, maybe 20, years before this integration could be affected (remember this point).
day-to-day operations of their businesses.
What these planners came up with was a different strategy, not just a technology strategy — a data architecture strategy known today as data warehousing. The question that touched off this revolution was, “What if we develop an integrated analytical database where we could bring all our major operational data together and then use that database for our reporting, analysis, and simulation work?” These planners felt that although building data warehouses might take three to five years, it would provide much of the value of total data integration and data consistency without disrupting the operational systems that had become essential to the
This splitting of the reporting and analysis applications away from the operational domain created a new high-level enterprise data flow architecture that acknowledged the difference of operational systems from what many now call “informational systems” (see Figure 2). The term “informational” was coined to distinguish operational systems handling the day-to-day business activities from those systems associated with planning, forecasting, and control. While EA was emerging at about the same time, it was data architecture where one of the first major enterprise architectural solutions was first put into operation.17
The next stage was to further divide the IS domain into data warehouses (or operational data stores) and data marts to allow end users to directly access highly structured informational databases (see Figure 3). As it turned out, performing data warehousing was much harder than anyone anticipated. Integrating data from disparate sources became much more difficult, expensive, and timeconsuming than its early advocates suspected. One the problems had to do with the best strategy for designing a data warehousing database; the other had to do with the quality of the data and metadata of the myriad operational databases. Two major data warehousing design strategies emerged during
Operational Systems
Business Intelligence (BI) Applications
Data Warehouse
Data Warehouse Population Application
Operational Applications
Operational DBs
Figure 2 — Early data warehouse architecture (circa 1990).
Business Intelligence (BI) Applications
Operational DBs
Data Mart Population Applications
Operational Data Store (ODS)
ETL Applications
Operational DBs
Operational Applications
Figure 3 — Mature data warehouse architecture (circa 1995).
17
The other major enterprise initiative that occurred around this time, “business process engineering,” was also beginning to have a major effect on both business organization coupled with technology advances in workflow management.
VOL. 7, NO. 1
www.cutter.com
21
EXECUTIVE REPORT this period: a detail, near-normalized data warehouse–based solution (the ODS [operational data store] approach18) and a dimensional solution (the star schema approach19). As a practical matter, the ODS approach tended to lead to larger, more centralized data warehouses; whereas the star schema approach tended to lead to more departmental, divisional, project-oriented data marts. Most seasoned data warehousing professionals now see both approaches as being part of a mature informational data architecture solution. Most large organizations today have one or more data warehouses as well as hundreds of data marts installed to support their key businesses. Over the years, the data warehouses have become more and more valuable from an analytical and planning standpoint. But data warehouses have also increased the pressure for even more sophisticated integration of enterprise data, especially for those core data categories that are central to future business processes. Improving an organization’s data architecture is fraught with many of same problems that plague urban planners. Reshaping an
urban area is often difficult because of competing political forces. Developers, large employers, and local activists all have distinct interests that are opposed to one another. Data architecture activities show similar forces at work — major users and COTS vendors play the role in enterprise and data architecture that major employers and developers play in the urban landscape. In a recent Executive Report, Luke Hohmann and Ken Collier talk about the growing importance that COTS packages such as ERP, CRM, and SCM are having on data architecture design in lots of companies [3]. Since organizations typically make such huge investments bringing in one of these packages, and these COTS vendors have much more money to invest in design and architecture, enterprise data architecture is being set by these vendors. Master Data Management
There are a huge number of things an organization could include in a data warehouse or a data mart. However, time and again, the same categories of data rise to the top of the agenda of things that managers and professionals want to see included in an enterprise data warehouse, especially customer and product data.
Obviously, the subjects that most interest management will have to do with those things that most affect the outcome of the business, namely the actors and objects. So the actors of most interest are usually things like “customers,” “prospects,” “vendors,” “business partners,” “products,” “employees,” “purchased parts,” and “jobs.” In fields like construction or software development, the object might be “project.” But in no case would the primary subjects that management is interested in be much of a surprise to anyone familiar with the particular business or market. As data warehouses have matured, data warehouse tool vendors and consultants have noticed that there was one constant theme: enterprises wanted more and more data about their customers and products that were most critical to their businesses. They also wanted more sophisticated ways of expressing the data. Moreover, enterprises wanted to integrate not only more of their internal data into these more sophisticated relationships, they wanted to integrate more external data as well. From these observations has come a push toward MDM. (In
18
One of the leading proponents of the detail, normalized data warehouse solution was Bill Inmon. One of his major contributions to data warehouse thinking was the idea of the “operational data store,” a database that served as a staging area for the cleansing, integration, and loading of what I call the “core data warehouse.”
19
As stated earlier, the leading proponent of the star schema dimensional approach was Ralph Kimball. Taking a cue from multidimensional databases that were just coming online, Kimball came up with a strategy for mapping operational data into two major groups of tables: “fact tables,” which contain either detail or summarized business transaction data, and “dimensional tables,” which contain information on those major dimensions of business analysis (e.g., “customer,” “product,” location,” “time,” “geography").
©2007 CUTTER CONSORTIUM
VOL. 7, NO. 1
22
BUSINESS INTELLIGENCE ADVISORY SERVICE
case of confusion, you can substitute the term “subject” for “master” and think in terms of “subject data management” [SDM]. It comes down to pretty much the same thing.) Master reference data is the common set of definitions, identifiers, and attributes for the most significant semantic data categories across the enterprise.20 Master data management is the use of the enterprise master reference data to construct centrally managed data hubs about the categories. MDM involves developing increasingly sophisticated sets of data centered around one or more of the major semantic components of an organization, typically actors, objects, locations. A set of popular master data categories is shown in Table 2. Clearly, different master data is important in different industries and different organizations, but that said, the importance of “customer” and “product” master
data hubs far exceeds all the others. Indeed, customer and product hubs have become important enough to be considered specific software markets all by themselves. MDM Models
There are a number of different forms of MDM models. The most prominent are hub-based MDM models and data warehousebased MDM models. Hub-Based MDM Models
From a marketing point of view, customer and product hubs are the MDMs that are getting the most interest. Large enterprises are vitally interested in understanding and controlling customer and product data at an enterprise level, and all the major players (IBM, Oracle, SAP) have announced such hubs. Hubs are specific applications provided by various vendors that support a specific major actor or object.
These hubs aim to provide organizations with a common place to find all of the data about a specific semantic category. These hubs differ from one another in some interesting ways. In the case of the largest players, the hub vendors provide a base set of reference metadata for an all-encompassing customer or product hub. Armed with this set of reference data, organizations can modify the reference data and then use the modified hub reference data as the target for building (or extending) the links to their operational data and/or their already developed data warehouses. In some other cases, the data hub comes more or less prebuilt. Often, these hubs have been developed by COTS vendors who specialized in specific marketplaces and provide links with their COTS operational data. In other cases, hub vendors come from specific industries (retail, auto, pharmaceuticals, manufacturing) and exploit their experience to provide predefined master reference data.
Table 2 — Popular Master Data Categories Actors
Objects
Locations
Relationships
Customers
Products
Locations
Organizational hierarchies
Vendors
Purchased parts
Offices
Sales territories
Preferred suppliers
Services
Regionals
Trading partners
Accounts
Geographies
Employees
Assets
Shareholders
Policies
Data Warehouse–Based MDM Models
Another approach to MDM is to develop custom hubs based on existing data warehouses and BI initiatives. This makes sense, since the biggest obstacle in developing a serious MDM hub is identifying, finding, cleaning, and integrating data from operational
20
Notice that there are no significant differences in the definition of master reference data and the definition of any data within an organization, except for the explicit stipulation that the definitions be enterprise-wide.
VOL. 7, NO. 1
www.cutter.com
23
EXECUTIVE REPORT and external data sources. If an organization has already spent multiple years and millions of dollars integrating data from a number of operational and external sources into an enterprise or divisional warehouse, it is natural that the next evolutionary step (i.e., developing an MDM hub) would be to build on these existing investments. Basically, just as there are two fundamental data warehouse design strategies (data warehouse–oriented or data mart– oriented), we are also seeing two major data warehouse–based MDM models: (1) MDMs built around core data warehouses/ ODSs and (2) MDMs built around data marts using star-schema models.
ODS-Based MDM Models
One of the most interesting approaches for getting from data warehousing to MDM is found in a set an ideas that uses the metadata of an ODS as the basis for creating custom master data hubs (see Figure 4). One ODS-based approach has been developed by ObjectRiver [4]. The strategy involves analyzing entity-relationship (ER) information from an ODS to identify “coarse grained business objects”21 by looking for the natural hierarchies within the data model. This is done by first identifying those entities with just primary keys (unique identifiers) and then looking for the relationship of those tables that are directly attached to those tables with just primary keys. In a way,
the approach is somewhat equivalent to picking up the ER diagram of the ODS by one of these key entities and seeing what is attached. One advantage of this approach is that there are no major data conversion projects involved, since the data already exists, which means that organizations can start getting value sooner. Another advantage is that there are a wider range of master data subjects than one would find by just focusing on customer or product hubs. Finally, the approach works well with SOA, since the approach can automatically build a “data adapter” to facilitate OO access to existing relational data.
External Data
Data Hub BI Applications
Master Data Hubs
Data Hub Population Applications
Operational Data Store (ODS)
ETL Applications
ETL Applications
ETL Applications
Operational DBs
COTSI DBs
Operational DBs
Operational Applications
COTS Applications
Operational Applications
Figure 4 — ODS-based MDM model. 21
In [4], Steven Lemmo writes: “We call these ‘coarse grained’ because they are aggregations of entities that are linked through defined relationships.”
©2007 CUTTER CONSORTIUM
VOL. 7, NO. 1
24 Data Mart–Based MDM Models
Another strategy for developing MDM models is to leverage the existing data mart–based data warehouse projects to develop MDM hubs. The popularity of Kimball’s star schema structures for data warehouse design makes them another natural starting point for organizations that want to develop their own custom data hubs. The problem here is a little more complex since the focus of star schema designs are the fact tables and dimension tables where the fact tables usually represent base transactions (messages) and dimensions represent actors, objects, events (time), and location. This means that the customer data (which is naturally a dimension) is apt to be found in many data marts connected to a number of different fact tables. Rationalizing the information here is somewhat more difficult than in the case where there is a single staging area that has a single set of metadata to analyze. A number of organizations that specialize in combining metadata from different data marts and data warehouses are focusing on this area. MDM Issues
Most MDM hubs focus on either one of the primary business actors (customer, trading partner) or one of the primary business objects (product, service). By bringing this data together, organizations have much more control over their decision making and process management. The problem is VOL. 7, NO. 1
BUSINESS INTELLIGENCE ADVISORY SERVICE that the “glue” that hangs data together is not the business actors or business objects, it is the messages or transactions. In an organization with both a customer hub and a product hub, where do the “orders” reside? The answer, of course, is that they reside in both places. Orders, shipments, bills, and payments all tie enterprises to their customers and their products. The same is true with vendors and purchased products — it is the purchase orders, receipts, vendor invoices, and vendor payments that tie them together. Ultimately, I expect that MDM will simply move our organizations to the next plateau of data architecture, where we begin to integrate the data from our MDM hubs to create a true enterprise data architecture. Another plus for the MDMs is the increasing interest to use these hubs as a point of integration of all the unstructured data we have about our customers and products. This data is growing rapidly and is enormously fragmented. Bringing this data together under a common management strategy will have enormous long-range benefit to our organizations. CONCLUSION The key to better decisions and business process improvement is better data! Of all the elements of EA, data architecture is both the most important and most mature, in
no small part because data is our most valuable enterprise IT asset. Over the last two decades, data architecture in large and mediumsized enterprises has undergone dramatic changes. The most significant of these changes has been the introduction of data warehousing as a means of “rationalizing” databases specifically designed for data access as opposed to operations. This has reenergized business intelligence. Data architecture is now undergoing yet another major change called master data management. MDM focuses enterprise data management efforts around the corporate jewels, the data associated largely with major actors and objects like customer and product (i.e., the critical data that is the most important from a strategic standpoint). These data hubs are increasingly comprehensive and increasingly sophisticated, incorporating data not only from most of the necessary enterprises sources, but from external, syndicated sources as well. Data is the lifeblood of big organizations. The bigger the enterprise, the more important data becomes. As organizations become more electronic and realtime, the more significant that data architecture, data warehousing, and MDM become. While there are no magic bullets, these are initiatives that leading enterprises are depending on to manage the future.
www.cutter.com
25
EXECUTIVE REPORT REFERENCES
ABOUT THE AUTHOR
1. Codd, E.F. “A Relational Model of Data for Large Shared Data Banks.” Communications of the ACM, Vol. 13, No. 6, June 1970.
Ken Orr is a Fellow of the Cutter Business Technology Council and a Senior Consultant with Cutter Consortium’s Agile Project Management, Business Intelligence, Business-IT Strategies, and Enterprise Architecture practices. He is also a regular speaker at Cutter Summits and symposia. Mr. Orr is the founder of and Chief Researcher at the Ken Orr Institute, a business technology research organization. Previously, he was an Affiliate Professor and Director of the Center for the Innovative Application of Technology with the School of Technology and Information Management at Washington University. He is an internationally recognized expert on technology transfer, software engineering, information architecture, and data warehousing. Mr. Orr has more than 30 years’ experience in analysis, design, project management, technology planning, and management consulting. He is the author of Structured Systems Development, Structured Requirements Definition, and The One Minute Methodology. He can be reached at korr@cutter.com.
2. Helland, Pat. “Metropolis.” Microsoft Architect Journal, Microsoft Press, Vol. 2, April 2004. 3. Hohmann, Luke, and Ken Collier. “Do You Run from or to Embedded Business Intelligence?” Cutter Consortium Business Intelligence Executive Report, Vol. 6, No. 12, 2006. 4. Lemmo, Steven. “Leveraging Data Models to Create a Unified Business Vocabulary for SOAs.” DM Review, 25 January 2007. 5. Lewis, Grace Alexandra, Santiago Comella-Dorda, Pat Place, Daniel Plakosh, and Robert C. Seacord. “An Enterprise Information System Data Architecture Guide.” Technical Report CMU/SEI-2001-TR-018, Carnegie Mellon Software Engineering Institute, 2007. 6. Lyman, Peter, and Hal R. Varian. “How Much Information 2003?” University of California at Berkeley, October 2003.
©2007 CUTTER CONSORTIUM
VOL. 7, NO. 1
About the Practice
Business Intelligence Practice The strategies and technologies of business intelligence and knowledge management are critical issues enterprises must embrace if they are to remain competitive in the e-business economy. It’s more important than ever to make the right strategic decisions the first time. Cutter Consortium’s Business Intelligence Practice helps companies take all their enterprise data, augment it if appropriate, and turn it into a powerful strategic weapon that enables them to make better business decisions. The practice is unique in that it provides clients with the full picture: technology discussions, product reviews, insight into organizational and cultural issues, and strategic advice across the full spectrum of business intelligence. Clients get the background they need to manage technical issues like data cleansing as well as management issues such as how to encourage employees to participate in knowledge sharing and knowledge management initiatives. From tactics that will help transform your company to a culture that accepts and embraces the value of information, to surveys of the tools available to implement business intelligence initiatives, the Business Intelligence Practice helps clients leverage data into revenue-generating information. Through Cutter’s subscription-based service and consulting, mentoring, and training, clients are ensured opinionated analyses of the latest data warehousing, data mining, knowledge management, CRM, and business intelligence strategies and products. You’ll discover the benefits of implementing these solutions, as well as the pitfalls companies must consider when embracing these technologies. Products and Services Available from the Business Intelligence Practice
• • • • •
The Business Intelligence Advisory Service Consulting Inhouse Workshops Mentoring Research Reports
Other Cutter Consortium Practices
Cutter Consortium aligns its products and services into the nine practice areas below. Each of these practices includes a subscription-based periodical service, plus consulting and training services.
• • • • • • • • •
Agile Project Management Business Intelligence Business-IT Strategies Business Technology Trends & Impacts Enterprise Architecture IT Management Measurement & Benchmarking Strategies Enterprise Risk Management & Governance Sourcing & Vendor Relationships
Senior Consultant Team The Senior Consultants on Cutter’s Business Intelligence team are thought leaders in the many disciplines that make up business intelligence. Like all Cutter Consortium Senior Consultants, each has gained a stellar reputation as a trailblazer in his or her field. They have written groundbreaking papers and books, developed methodologies that have been implemented by leading organizations, and continue to study the impact that business intelligence strategies and tactics are having on enterprises worldwide. The team includes:
• • • • • • • • • • • • • • • • • • •
Verna Allee Stowe Boyd Ken Collier Clive Finkelstein Jonathan Geiger David Gleason Curt Hall David C. Hay André LeClerc Lisa Loftis David Loshin Larissa T. Moss Ken Orr Gabriele Piccoli Thomas C. Redman Ricardo Rendón Michael Schmitz Ed Schuster Karl M. Wiig