J U N E 2 012 T DW I E - BO OK
BIG DATA INTEGRATION
1 3
BIG DATA AND HADOOP: THE END OF ETL?
Q&A: CHALLENGES AND BEST PRACTICES FOR INTEGRATING BIG DATA
6
MORE DATA = MORE PROBLEMS
8
ABOUT SYNCSORT
Sponsored by
Presented by
Big Data and Hadoop
Expert Q&A
More Data, More Problems
About Syncsort
BIG DATA AND HADOOP: THE END OF ETL? BY STEPHEN SWOYER
MapReduce is typically seen as an enabling technology for big data analytics.
could accelerate—could, in fact, supercharge—certain kinds of ETL jobs.
From the beginning, however, proponents have talked up another potential application for MapReduce: namely, as a means to supercharge ETL, one of the basic building blocks of data integration (DI). Does the advent of MapReduce-enabled ETL portend the end of DI as we know it? More to the point, is MapReduce ETL even up to the task of replacing its traditional DI counterpart? The short answer to this last question is—no. It isn’t.
Graham never trumpeted MapReduce as an ETL replacement, however. There’s a good reason for that, says veteran data warehouse (DW) architect Mark Madsen, a principal with consultancy Third Nature Inc.
In fact, experts say, there’s a sense in which the two technologies, as distinct approaches to ETL, can be said to complement one another. Ironically, there are likewise scenarios in which ETL—good, plain, old-fashioned, dependable ETL—can help to accelerate MapReduce performance. The idea of MapReduce-based ETL is by no means new. Three years ago, Dan Graham, general manager for enterprise systems with data warehousing powerhouse Teradata Corp., famously described one potential use case for MapReduce as “ETL on steroids.” The idea, said Graham, was that MapReduce 1 TDWI E - BOOK BIG DATA INTEGR ATION
Although MapReduce can be used to accelerate certain very specific kinds of ETL jobs, Madsen concedes, this isn’t necessarily a good idea. He uses the example of Hadoop, which comprises both an open source software (OSS) implementation of MapReduce and a distributed file system, along with an ecosystem of related OSS tools. “Hadoop is brute force parallelism. If you can easily segregate data to each node and not have to re-sync it for another operation [by, for example,] broadcasting all the data again— then it’s fast. Usually you can’t [do that],” he explains. “So when Hadoop breaks up a program and runs it local to each node, and the data is nicely there, it’s fast, but it isn’t efficient. If you have some data such as user IDs and you want to get some other data via a look-up, you [have to] write your own join in code. There’s no facility for single-row look-up, so you
Big Data and Hadoop
Expert Q&A
[have to] write your own joins and deal with sorting, also on your own.” The problem, says Madsen, is that most application developers lack expertise with parallel programming or engineering. “Hadoop can’t pipeline. It does a step, then a global sync, then a step, [and so on]. ETL tools were all designed to pipeline,” he continues, arguing that “while there are Hadoop distros that do pipeline [or which] use MPI interfaces,” very few people are actually using them. The upshot, he concludes, is that “it’s incumbent upon the programmer to code it right, which is an advantage of SQL and [database] data flow architecture. Pig script is an attempt at declarative programming over H, but with the flaw that there’s no [mathematical logic] behind it, unlike SQL, therefore optimization is hard.”
Can a high-performance ETL tool really be used in this way? It isn’t a stretch at all. Jorge Lopez, senior manager for data integration with Syncsort, says he isn’t sweating the idea of Hadoop or MapReduce as a replacement for ETL.
More Data, More Problems
About Syncsort
“You could install an instance of DMExpress on each of the [Hadoop] nodes, [and] by doing that, we would be able to ... actually shift some of the MapReduce jobs into our engine, and those [MapReduce jobs] would run significantly faster [in DMExpress] than on the Hadoop framework,” he claims, invoking the example of a “sort” function in MapReduce. “Inside Hadoop, many of the operations involve sorting. The sort [capability] that’s included in Hadoop, the native sort, is not very scalable, and it doesn’t have a lot of functionality. If you are doing different types of sort to achieve performance or to achieve more functionality, you have to go Java,” Lopez continues, explaining that Syncsort developed a plug-in for Hadoop that permits it to call or to invoke DMExpress to perform sorts and other common transformations. “You can plug our sort capability into the Hadoop framework and you’re able to shift those MapReduce [sort] jobs into our [DMExpress] engine, and that gives you a much faster, more flexible sort.” Right now, he says, DMExpress Hadoop Edition supports sorts, aggregations, and copies, with support for joins promised for a future release. Can a high-performance ETL tool really be used in this way? Third Nature’s Madsen thinks so. What’s more, he says, it isn’t a stretch at all.
Syncsort’s bread and butter, after all, is ETL: many of its competitors now field full-fledged data integration platforms— “[Syncsort is] fast at sorting, and [they] can do some fast ... complete with data quality (DQ), master data management ETL when they’re dealing with big sets of data that can be (MDM), and even enterprise service bus (ESB) components— pipelined,” he explains. “I imagine that slapping Syncsort but Syncsort has doubled down on ETL, choosing to emphasize on nodes and pumping data through them would speed up the performance of its flagship DMExpress engine. MapReduce jobs considerably when sorting, counting, and such Performance matters, Lopez maintains, and Syncsort’s are on the line.” DMExpress is one of the fastest ETL engines on the market. In Stephen Swoyer is a Nashville, TN-based freelance journalist fact, he argues, DMExpress can be used to accelerate Hadoop who writes about technology. MapReduce jobs. Syncsort even introduced DMExpress Hadoop Edition, a version of its ETL engine designed to run inside the Hadoop framework.
2 TDWI E - BOOK BIG DATA INTEGR ATION
Big Data and Hadoop
Expert Q&A
More Data, More Problems
About Syncsort
Q&A: CHALLENGES AND BEST PRACTICES FOR INTEGRATING BIG DATA Data integration has long been a challenge for BI professionals. Now big data is adding a few wrinkles to the process. What are the problems that these technologies bring, what are our common misconceptions about their interaction, and what best practices can we follow to make big data integration a success? For answers, we turned to Jorge Lopez, senior manager, data integration at Syncsort.
TDWI: What data integration challenges have BI professionals faced in the past? Jorge Lopez: Data integration tools were originally conceived as a means to easily extract information from multiple sources, transform it into consumable information, and consolidate it into one or more targets with the objective of providing a unified and consistent version of the truth. Some of the key challenges included the ability to replace complex, hand-coded routines and point-to-point programs with easier tools that provided a simple graphical user interface to create information workflows and transformations without the need to hand code. With relatively small data volumes and relatively cheap hardware, most data integration tools focused on adding lots of functionality, which made their stacks more complex and somehow neglected the performance aspect of ETL [extract, transform, and load].
3  TDWI E - BOOK BIG DATA INTEGR ATION
Big Data and Hadoop
Expert Q&A
How do you define big data, and what special challenges does it add to data integration tasks? Globalization—along with mobile devices, social media, and the cloud—has created a world where data is being created and consumed at an unprecedented scale. Big data can be defined in terms of its three key dimensions: volume, velocity, and variety. In plain English, big data means organizations must process more data from an even wider variety of sources in less time. We see big data highlighting a fundamental issue of data integration tasks: the aspect of performance and efficiency at scale. Organizations increasingly are asking how they can transform all this new data into actionable information—and how they can do it within their budget. In the end, the question they face is: Can they afford not to do it?
What are the biggest misconceptions BI professionals have about big data as it relates to data integration? Here are a few misconceptions that come to mind: • Big data is only for big companies. Although few BI professionals would question the importance of big data overall, many perceive big data as an issue that only really concerns the largest organizations, such as those found in the finance and insurance industries. The truth is, although certain industries have dealt with big data challenges for a long time, mobile technologies, social media, and the cloud have turned big data into a mainstream problem. In fact, it’s not uncommon to see small and medium-sized organizations with just a few hundred employees struggling to keep up with growing data volumes and shrinking batch windows. • Big data means all data is important. Sometimes it’s easy to get caught up in the hype about big data. However, trying to process larger data volumes can increase the amount of noise considerably, hindering your ability to uncover valuable insights. That’s why it is important to remember that not all data is created equal. Therefore, any big data strategy must include ways to efficiently and effectively process the required data while filtering out the noise. • Big data means big costs. For many years, we’ve been told to approach data integration in certain ways. Although 4 TDWI E - BOOK BIG DATA INTEGR ATION
More Data, More Problems
About Syncsort
many of these “best practices” may still be applicable today, most of them were not designed with performance and scalability in mind. They were not designed to tackle big data. Therefore, scaling can result in costs that may outweigh the benefits to the organization. The best example is staging data when joining heterogeneous data sources. This practice, alone, not only increases the complexity of DI environments, but also adds millions of dollars a year in database costs just to keep the lights on. The reality is that big data doesn’t have to call for big budgets; instead, it calls for a new approach.
What are some of the common mistakes BI professionals make in trying to successfully integrate big data? Regardless of any challenges they face, IT departments must comply with the established SLAs [service-level agreements]. Therefore, it’s easy to fall prey to suboptimal approaches that might work in the short term but often prove to be extremely expensive, unproductive, and unsustainable in the long term.
Unfortunately, in most cases, data is growing much faster than computing power, so they are fighting a losing battle. A common workaround is fine-tuning their most expensive transformations—a continual effort to reclaim performance. It’s a task that consumes some of the most experienced and highly skilled resources for no net gain. For example, I recently spoke to a DI developer from an entertainment software company. They spent two weeks developing an ETL job and four weeks tuning it to run 40 percent faster! In the end, the job was still taking 2.5 days to process, which resulted in obsolete data. When tuning fails to solve the problem, many IT organizations try adding more hardware, which can result in enormous clusters of servers to get the job done. Unfortunately, in most cases, data is growing much faster than computing power, so they are fighting a losing battle. More hardware also brings other headaches—such as the need for more data center
Big Data and Hadoop
Expert Q&A
space, power and cooling, maintenance, software licenses, hardware failures, and so on. After a while, people realize they cannot hardware their way out of this problem. That’s when they take matters into their own hands, hand-coding SQL routines and pushing transformations out of the ETL tool and into the database. At this point, it’s like taking a huge step back in time—data lineage is lost, the database is overloaded, and costs and complexity rise to the roof.
What best practices can you suggest to overcome these mistakes? I believe there is more than one solution to any problem. However, any solution to successfully integrate big data must include four fundamental principles. First, organizations must think about performance in strategic, rather than tactical, terms. This requires organizations to take a proactive rather than reactive approach. Performance and scalability should be at the core of any decision throughout the entire development cycle, from inception to evaluation to development and even ongoing maintenance. To do this, organizations need to attack the root of the problem with tools that are specifically designed for performance. Second, organizations must improve efficiency of the DI architecture. The objective here is to optimize hardware resource utilization in order to minimize infrastructure costs and complexity. Third, organizations need tools that are simple and easy to use. This simplicity should also permeate all phases of the development cycle, but it is particularly critical for ongoing maintenance and tunability. Finally, reducing costs and complexity must be an intrinsic part of the solution. Ultimately, a combination of performance, efficiency, and productivity will help organizations reduce costs significantly, so they can leverage big data for competitive advantage.
Where does Hadoop fit within this picture? Hadoop is definitely playing an emerging role as one of the most viable solutions for big data. A new approach that
5 TDWI E - BOOK BIG DATA INTEGR ATION
More Data, More Problems
About Syncsort
promises to provide the performance and scalability is needed to tackle big data by distributing work among relatively cheap commodity servers through an open source framework. Hadoop has tremendous potential and its growing success is catching the attention of many of our customers. However, organizations are still facing two major challenges with Hadoop. The first challenge is to minimize hardware resource utilization. Although servers might be relatively cheap, they still elevate capital costs, as well as operational costs, due to hardware maintenance, cooling, and power and data center costs. Second, Hadoop is still very difficult to develop. Among other things, coding MapReduce jobs and tuning sort operations requires very specific skills that are both expensive and hard to find. Using the criteria I provided in response to your previous question, it is clear that there is still a lot of work to be done before Hadoop becomes a mainstream solution for most companies.
What products or services does Syncsort offer to integrate big data? Syncsort has been managing big data volumes for more than 40 years—decades before the term big data came into fashion. Syncsort’s DMExpress high-performance, data integration solutions have been used to integrate, optimize, and migrate data in thousands of applications around the world. Our solutions include high-performance ETL and Hadoop Optimization, among others. DMExpress is fundamentally different from other data integration tools because it is based on four core principles: performance, efficiency, productivity, and cost. Customers using DMExpress typically experience up to 10x faster elapsed processing times than conventional DI tools, consume up to 75 percent less CPU and memory, and their TCO is as much as 65 percent lower. Moreover, DMExpress is completely selftuning and does not require database staging, which increases productivity and frees up valuable IT resources. I would encourage IT professionals to check out the Forrester Total Economic Impact study and online calculator at www.syncsort.com/tei, where they can learn more about the benefits of DMExpress and estimate their own ROI.
Big Data and Hadoop
Expert Q&A
More Data, More Problems
About Syncsort
MORE DATA = MORE PROBLEMS BY STEPHEN SWOYER
Although big data can bring big benefits for data analytics, BI professionals are finally taking to heart what IT has been warning about for years: big data can lead to big problems, especially when it comes to performance. On top of that, the more data you have, the more challenging the data integration.
work of transformation—which was traditionally performed by the ETL engine itself—onto the database. This isn’t necessarily a good idea, Lopez argues: depending on the kinds of transformations you’re doing, it can drastically impede performance.
Data integration stalwart Syncsort certainly knows this. The company’s reputation rests on its DMExpress offering, which it bills as one of the fastest ETL engines on the market. That’s any market, mind you, because Syncsort now offers a version of DMExpress for Hadoop, too. Company officials make a compelling case that one of the most hyped tools in the big data tool chest—MapReduce—could benefit from some honest-to-goodness ETL-style acceleration.
It can likewise prove to be prohibitively expensive, especially if you’re using costly data warehouse resources to perform ETL transformations. Lopez likens it to a misapplication—if not to a misallocation—of premium performance.
So, too, could traditional DI, says Jorge Lopez, senior manager for data integration with Syncsort. In fact, Lopez argues, most shops are still struggling with traditional data integration problems. They haven’t even started to wrap their heads around big data, which ups the ante with respect to the volume, variety, and (yes) velocity of data. “We know that data integration tools as we know them today are failing, but why are they failing? We believe it’s because they tend to focus on the wrong things,” said Lopez, in an interview last autumn. He cited a Syncsort survey in which a surprising number of respondents—nearly one-third—said they were still using hand-coded or homegrown DI. The reason for this, according to Lopez and Syncsort, has a lot to do with performance: DI vendors don’t talk much, if at all, about ETL performance, and virtually no one—aside from Syncsort, it seems—emphasizes ETL performance as a differentiator. Increasingly, in fact, ETL [extract, transform, and load] has given way to ELT: extract, load, and transform. It’s a crucial transposition, Lopez argues, because ELT effectively shifts the
6 TDWI E - BOOK BIG DATA INTEGR ATION
Big data can lead to big problems, especially when it comes to performance. On top of that, the more data you have, the more challenging the data integration. “This is true of Teradata customers, where they can scale very well [to perform this kind of ETL workload], but that takes a lot of money and resources,” he maintains. “[Instead,] the idea is to basically encourage [customers to] realize that ETL tools were meant to do that work in the first place, not just by pushing it to the database, but by providing the performance and the scalability that organizations need to leverage big data.” If anything, Lopez continues, big data should help bring this problem into focus. “One of the reasons we are seeing this is precisely because of big data. Before, you would have the large data volumes, but not the pressing requirements in terms of batch windows. Or maybe you’d have a very short batch window, but the data volumes were not as big. Or you didn’t have as many sources. Now the combination of these three things is overwhelming,” he argues. “In some cases, particularly with big data, the challenges that organizations face are actually highly specialized data integration jobs, but they don’t realize that’s what they are.
Big Data and Hadoop
Expert Q&A
They involve sorting, joining, aggregating, copying, [or] moving data from one place to another. This is data integration; they just don’t realize it.” Increasingly, Lopez suggests, shops are falling back on less than ideal practices (such as hand coding) to tackle these problems. From a data management (DM) perspective, hand coding used to mean SQL: lots and lots of SQL. Big data changes that. In the case of projects that involve Hadoop—an open source software (OSS) framework that bundles a MapReduce implementation, a distributed file system, and a host of other amenities—SQL is effectively a third-rate language. After all, Hadoop developers tend to code in procedural programming languages such as Java or C++. So while MapReduce could be used to perform some of the required transformations, this isn’t always—or even mostly—a good idea. At the very least, effectively parallelizing DI transformations across a Hadoop cluster will require specialized programming knowledge: the more sophisticated or unusual the transformations, the more specialized the programming knowledge. “The difficulty is that a tool like [Hadoop] is [only] as good as the developer who writes the code,” observes veteran industry watcher Mark Madsen, a principal with consultancy Third Nature Inc. The best way to tackle the integration heavy-lifting associated with big data and Hadoop is also the oldest—or most timehonored—of ways, Lopez argues: traditional, batch-oriented ETL. In practice, he claims, not all ETL tools just are up to the task. After all, if they weren’t getting the job done before now, how can they be expected to scale to support the volumes—to say nothing of the varieties and velocities—associated with the integration or movement of big data? “Big data means that ETL tools now have to process more data in less time and with minimum costs to the organization,” he indicates. “The problem is that most DI platforms generally struggle to meet the demands of the business precisely in terms of scalability and performance, and this is something we see every day: cases where data integration limits an organization’s ability to execute strategic marketing campaigns or even to deploy new products.” 7 TDWI E - BOOK BIG DATA INTEGR ATION
More Data, More Problems
About Syncsort
Even though big data is still in its infancy, its analytic aspect has the potential to transform how organizations conduct (as well as understand) their strategic marketing campaigns or product development efforts. Syncsort is betting that the big data phenomenon will likewise invite shops to reconsider how they do DI. That’s one reason it markets ETL offerings designed for both conventional batch (DMExpress) and big data (DMExpress Hadoop Edition) integration scenarios.
The best way to tackle the integration heavylifting associated with big data and Hadoop is also the oldest—or most time-honored—of ways: traditional, batch-oriented ETL. Third Nature’s Madsen thinks pitching DMExpress as a big data accelerator isn’t a bad idea. The only rub, he points out, is cost, particularly given Hadoop’s OSS pedigree. “Many people are in [Hadoop] because it’s free-ish, whereas these commercial products aren’t inexpensive,” he observes. “But there is an opportunity in speed-up because Hadoop people don’t get ... [how inadequate the] architecture [is] for much [of this data integration] work. “So you drop a Syncsort in and it runs local [on a Hadoop node] and sucks up those big files that have to be read sequentially, and it can run a series of local operations very quickly, far better than the average coder. So long as the overall job isn’t stupid about constant global syncs [i.e., it performs reduce and map operations in sequence], the speedup can be a lot for the job steps and therefore the jobs.” Stephen Swoyer is a Nashville, TN-based freelance journalist who writes about technology.
Big Data and Hadoop
Expert Q&A
More Data, More Problems
About Syncsort
www.syncsort.com
tdwi.org
Syncsort is a leading provider of high-performance data integration and sort solutions ideally suited for today’s complex big data environments. For more than a decade, Syncsort DMExpress® software has been used to integrate, optimize and migrate data in thousands of applications around the world. DMExpress solutions are proven to be fast, efficient, simple and cost-effective when compared to conventional data integration solutions. A Forrester Research Total Economic Impact™ study of DMExpress identified an ROI of over 220% and payback period of 9 months. Learn what DMExpress can do for you at www.syncsort.com/tei.
TDWI, a division of 1105 Media, Inc., is the premier provider of in-depth, high-quality education and research in the business intelligence and data warehousing industry. TDWI is dedicated to educating business and information technology professionals about the best practices, strategies, techniques, and tools required to successfully design, build, maintain, and enhance business intelligence and data warehousing solutions. TDWI also fosters the advancement of business intelligence and data warehousing research and contributes to knowledge transfer and the professional development of its members. TDWI offers a worldwide membership program, five major educational conferences, topical educational seminars, role-based training, onsite courses, certification, solution provider partnerships, an awards program for best practices, live Webinars, resourceful publications, an in-depth research program, and a comprehensive Web site, tdwi.org.
• Syncsort DMExpress® • Total Economic Impact
© 2012 by TDWI (The Data Warehousing InstituteTM), a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. E-mail requests or feedback to info@tdwi.org. Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies.
8 TDWI E - BOOK BIG DATA INTEGR ATION