A quarterly journal
2012 Issue 1
06 The third wave of customer analytics
30 The art and science of new analytics technology
Reshaping the workforce with the new analytics
Mike Driscoll CEO, Metamarkets
44 Natural language processing and social media intelligence
58 Building the foundation for a data science culture
Acknowledgments
Advisory Principal & Technology Leader Tom DeGarmo US Thought Leadership Partner-in-Charge Tom Craren Strategic Marketing Natalie Kontra Jordana Marx
Center for Technology & Innovation Managing Editor Bo Parker Editors Vinod Baya Alan Morrison Contributors Galen Gruman Steve Hamby and Orbis Technologies Bud Mathaisel Uche Ogbuji Bill Roberts Brian Suda Editorial Advisors Larry Marion Copy Editor Lea Anne Bantsari Transcriber Dawn Regan
02
PwC Technology Forecast 2012 Issue 1
US studio Design Lead Tatiana Pechenik Designer Peggy Fresenburg Illustrators Don Bernhardt James Millefolie Production Jeff Ginsburg Online Managing Director Online Marketing Jack Teuber Designer and Producer Scott Schmidt Animator Roger Sano Reviewers Jeff Auker Ken Campbell Murali Chilakapati Oliver Halter Matt Moore Rick Whitney Special thanks Cate Corcoran WIT Strategy Nisha Pathak Metamarkets Lisa Sheeran Sheeran/Jager Communication
Industry perspectives During the preparation of this publication, we benefited greatly from interviews and conversations with the following executives: Kurt J. Bilafer Regional Vice President, Analytics, Asia Pacific Japan SAP Jonathan Chihorek Vice President, Global Supply Chain Systems Ingram Micro Zach Devereaux Chief Analyst Nexalogy Environics Mike Driscoll Chief Executive Officer Metamarkets Elissa Fink Chief Marketing Officer Tableau Software Kaiser Fung Adjunct Professor New York University
Jonathan Newman Senior Director, Enterprise Web & EMEA eSolutions Ingram Micro Ashwin Rangan Chief Information Officer Edwards Lifesciences Seth Redmore Vice President, Marketing and Product Management Lexalytics Vince Schiavone Co-founder and Executive Chairman ListenLogic Jon Slade Global Online and Strategic Advertising Sales Director Financial Times Claude ThĂŠoret President Nexalogy Environics Saul Zambrano Senior Director, Customer Energy Solutions Pacific Gas & Electric
Kent Kushar Chief Information Officer E. & J. Gallo Winery JosĂŠe Latendresse Owner Latendresse Groupe Conseil Mario Leone Chief Information Officer Ingram Micro Jock Mackinlay Director, Visual Analysis Tableau Software
Reshaping the workforce with the new analytics
03
The right data + the right resolution = a new culture of inquiry
Tom DeGarmo US Technology Consulting Leader thomas.p.degarmo@us.pwc.com
Message from the editor James Balog1 may have more influence on the global warming debate than any scientist or politician. By using time-lapse photographic essays of shrinking glaciers, he brings art and science together to produce striking visualizations of real changes to the planet. In 60 seconds, Balog shows changes to glaciers that take place over a period of many years— introducing forehead-slapping insight to a topic that can be as difficult to see as carbon dioxide. Part of his success can be credited to creating the right perspective. If the photographs had been taken too close to or too far away from the glaciers, the insight would have been lost. Data at the right resolution is the key. Glaciers are immense, at times more than a mile deep. Amyloid particles that are the likely cause of Alzheimer’s 1 http://www.jamesbalog.com/.
04
PwC Technology Forecast 2012 Issue 1
disease sit at the other end of the size spectrum. Scientists’ understanding of the role of amyloid particles in Alzheimer’s has relied heavily on technologies such as scanning tunneling microscopes.2 These devices generate visual data at sufficient resolution so that scientists can fully explore the physical geometry of amyloid particles in relation to the brain’s neurons. Once again, data at the right resolution together with the ability to visually understand a phenomenon are moving science forward. Science has long focused on data-driven understanding of phenomenon. It’s called the scientific method. Enterprises also use data for the purposes of understanding their business outcomes and, more recently, the effectiveness and efficiency of their business processes. But because running a business is not the same as running a science experiment, 2 Davide Brambilla, et al., “Nanotechnologies for Alzheimer’s disease: diagnosis, therapy, and safety issues,” Nanomedicine: Nanotechnology, Biology and Medicine 7, no. 5 (2011): 521–540.
there has long been a divergence between analytics as applied to science and the methods and processes that define analytics in the enterprise. This difference partly has been a question of scale and instrumentation. Even a large science experiment (setting aside the Large Hadron Collider) will introduce sufficient control around the inquiry of interest to limit the amount of data collected and analyzed. Any large enterprise comprises tens of thousands of moving parts, from individual employees to customers to suppliers to products and services. Measuring and retaining the data on all aspects of an enterprise over all relevant periods of time are still extremely challenging, even with today’s IT capacities. But targeting the most important determinants of success in an enterprise context for greater instrumentation— often customer information—can be and is being done today. And with Moore’s Law continuing to pay dividends, this instrumentation will expand in the future. In the process, and with careful attention to the appropriate resolution of the data being collected, enterprises that have relied entirely on the art of management will increasingly blend in the science of advanced analytics. Not surprisingly, the new role emerging in the enterprise to support these efforts is often called a “data scientist.” This issue of the Technology Forecast examines advanced analytics through this lens of increasing instrumentation. PwC’s view is that the flow of data at this new, more complete level of resolution travels in an arc beginning
with big data techniques (including NoSQL and in-memory databases), through advanced statistical packages (from the traditional SPSS and SAS to open source offerings such as R), to analytic visualization tools that put interactive graphics in the control of business unit specialists. This arc is positioning the enterprise to establish a new culture of inquiry, where decisions are driven by analytical precision that rivals scientific insight. The first article, “The third wave of customer analytics,” on page 06 reviews the impact of basic computing trends on emerging analytics technologies. Enterprises have an unprecedented opportunity to reshape how business gets done, especially when it comes to customers. The second article, “The art and science of new analytics technology,” on page 30 explores the mix of different techniques involved in making the insights gained from analytics more useful, relevant, and visible. Some of these techniques are clearly in the data science realm, while others are more art than science. The article, “Natural language processing and social media intelligence,” on page 44 reviews many different language analytics techniques in use for social media and considers how combinations of these can be most effective.“How CIOs can build the foundation for a data science culture” on page 58 considers new analytics as an unusually promising opportunity for CIOs. In the best case scenario, the IT organization can become the go-to group, and the CIO can become the true information leader again.
This issue also includes interviews with executives who are using new analytics technologies and with subject matter experts who have been at the forefront of development in this area: • Mike Driscoll of Metamarkets considers how NoSQL and other analytics methods are improving query speed and providing greater freedom to explore. • Jon Slade of the Financial Times (FT.com) discusses the benefits of cloud analytics for online ad placement and pricing. • Jock Mackinlay of Tableau Software describes the techniques behind interactive visualization and how more of the workforce can become engaged in analytics. • Ashwin Rangan of Edwards Lifesciences highlights new ways that medical devices can be instrumented and how new business models can evolve. Please visit pwc.com/techforecast to find these articles and other issues of the Technology Forecast online. If you would like to receive future issues of this quarterly publication as a PDF attachment, you can sign up at pwc.com/techforecast/subscribe. As always, we welcome your feedback and your ideas for future research and analysis topics to cover.
Reshaping the workforce with the new analytics
05
Bahrain World Trade Center gets approximately 15% of its power from these wind turbines
06
PwC Technology Forecast 2012 Issue 1
The third wave of customer analytics These days, there’s only one way to scale the analysis of customer-related information to increase sales and profits—by tapping the data and human resources of the extended enterprise. By Alan Morrison and Bo Parker
As director of global online and strategic advertising sales for FT.com, the online face of the Financial Times, Jon Slade says he “looks at the 6 billion ad impressions [that FT.com offers] each year and works out which one is worth the most for any particular client who might buy.” This activity previously required labor-intensive extraction methods from a multitude of databases and spreadsheets. Slade made the process much faster and vastly more effective after working with Metamarkets, a company that offers a cloud-based, in-memory analytics service called Druid. “Before, the sales team would send an e-mail to ad operations for an inventory forecast, and it could take a minimum of eight working hours and as long as two business days to get an answer,” Slade says. Now, with a direct interface to the data, it takes a mere eight seconds, freeing up the ad operations team to focus on more
strategic issues. The parallel processing, in-memory technology, the interface, and many other enhancements led to better business results, including doubledigit growth in ad yields and 15 to 20 percent accuracy improvement in the metrics for its ad impression supply. The technology trends behind FT.com’s improvements in advertising operations—more accessible data; faster, less-expensive computing; new software tools; and improved user interfaces—are driving a new era in analytics use at large companies around the world, in which enterprises make decisions with a precision comparable to scientific insight. The new analytics uses a rigorous scientific method, including hypothesis formation and testing, with science-oriented statistical packages and visualization tools. It is spawning business unit “data scientists” who are replacing the centralized analytics units of the past. These trends will accelerate, and business leaders
Reshaping the workforce with the new analytics
07
Figure 1: How better customer analytics capabilities are affecting enterprises More computing speed, storage, and ability to scale Leads to
More time and better tools
More data sources
More focus on key metrics
Better access to results
Leads to
A broader culture of inquiry
Leads to
Less guesswork Less bias More awareness Better decisions
Processing power and memory keep increasing, the ability to leverage massive parallelization continues to expand in the cloud, and the cost per processed bit keeps falling. Data scientists are seeking larger data sets and iterating more to refine their questions and find better answers. Visualization capabilities and more intuitive user interfaces are making it possible for most people in the workforce to do at least basic exploration. Social media data is the most prominent example of the many large data clouds emerging that can help enterprises understand their customers better. These clouds augment data that business units have direct access to internally now, which is also growing. A core single metric can be a way to rally the entire organization’s workforce, especially when that core metric is informed by other metrics generated with the help of effective modeling. Whether an enterprise is a gaming or an e-commerce company that can instrument its own digital environment, or a smart grid utility that generates, slices, dices, and shares energy consumption analytics for its customers and partners, better analytics are going direct to the customer as well as other stakeholders. And they’re being embedded where users can more easily find them. Visualization and user interface improvements have made it possible to spread ad hoc analytics capabilities across the workplace to every user role. At the same time, data scientists—people who combine a creative ability to generate useful hypotheses with the savvy to simulate and model a business as it’s changing—have never been in more demand than now. The benefits of a broader culture of inquiry include new opportunities, a workforce that shares a better understanding of customer needs to be able to capitalize on the opportunities, and reduced risk. Enterprises that understand the trends described here and capitalize on them will be able to change company culture and improve how they attract and retain customers.
who embrace the new analytics will be able to create cultures of inquiry that lead to better decisions throughout their enterprises. (See Figure 1.) This issue of the Technology Forecast explores the impact of the new analytics and this culture of inquiry. This first article examines the essential ingredients of the new analytics, using several examples. The other articles
08
PwC Technology Forecast 2012 Issue 1
in this issue focus on the technologies behind these capabilities (see the article, “The art and science of new analytics technology,” on page 30) and identify the main elements of a CIO strategic framework for effectively taking advantage of the full range of analytics capabilities (see the article, “How CIOs can build the foundation for a data science culture,” on page 58).
More computing speed, storage, and ability to scale Basic computing trends are providing the momentum for a third wave in analytics that PwC calls the new analytics. Processing power and memory keep increasing, the ability to leverage massive parallelization continues to expand in the cloud, and the cost per processed bit keeps falling. FT.com benefited from all of these trends. Slade needs multiple computer screens on his desk just to keep up. His job requires a deep understanding of the readership and which advertising suits them best. Ad impressions— appearances of ads on web pages— are the currency of high-volume media industry websites. The impressions need to be priced based on the reader segments most likely to see them and click through. Chief executives in France, for example, would be a reader segment FT.com would value highly. “The trail of data that users create when they look at content on a website like ours is huge,” Slade says. “The real challenge has been trying to understand what information is useful to us and what we do about it.” FT.com’s analytics capabilities were a challenge, too. “The way that data was held—the demographics data, the behavior data, the pricing, the available inventory—was across lots of different databases and spreadsheets,” Slade says. “We needed an almost witchcraftlike algorithm to provide answers to ‘How many impressions do I have?’ and ‘How much should I charge?’ It was an extremely labor-intensive process.” FT.com saw a possible solution when it first talked to Metamarkets about an initial concept, which evolved as they collaborated. Using Metamarkets’ analytics platform, FT.com could quickly iterate and investigate numerous questions to improve its
decision-making capabilities. “Because our technology is optimized for the cloud, we can harness the processing power of tens, hundreds, or thousands of servers depending on our customers’ data and their specific needs,” states Mike Driscoll, CEO of Metamarkets. “We can ask questions over billions of rows of data in milliseconds. That kind of speed combined with data science and visualization helps business users understand and consume information on top of big data sets.” Decades ago, in the first wave of analytics, small groups of specialists managed computer systems, and even smaller groups of specialists looked for answers in the data. Businesspeople typically needed to ask the specialists to query and analyze the data. As enterprise data grew, collected from enterprise resource planning (ERP) systems and other sources, IT stored the more structured data in warehouses so analysts could assess it in an integrated form. When business units began to ask for reports from collections of data relevant to them, data marts were born, but IT still controlled all the sources. The second wave of analytics saw variations of centralized top-down data collection, reporting, and analysis. In the 1980s, grassroots decentralization began to counter that trend as the PC era ushered in spreadsheets and other methods that quickly gained widespread use—and often a reputation for misuse. Data warehouses and marts continue to store a wealth of helpful data. In both waves, the challenge for centralized analytics was to respond to business needs when the business units themselves weren’t sure what findings they wanted or clues they were seeking. The third wave does that by giving access and tools to those who act on the findings. New analytics taps the expertise of the broad business
Reshaping the workforce with the new analytics
09
Figure 2: The three waves of analytics and the impact of decentralization Cloud computing accelerates decentralization of the analytics function.
Cloud co-creation
Data in the cloud
Trend toward decentralization
Self-service
Central IT generated A
B
C
1 2 3 4 5 6 7
Analytics functions in enterprises were all centralized in the beginning, but not always responsive to business needs.
PCs and then the web and an increasingly interconnected business ecosystem have provided more responsive alternatives.
ecosystem to address the lack of responsiveness from central analytics units. (See Figure 2.) Speed, storage, and scale improvements, with the help of cloud co-creation, have made this decentralized analytics possible. The decentralized analytics innovation has evolved faster than the centralized variety, and PwC expects this trend to continue. “In the middle of looking at some data, you can change your mind about what question you’re asking. You need to be able to head toward that new question on the fly,” says Jock Mackinlay, director of visual analysis at Tableau Software, one of the vendors of the new visualization front ends for analytics. “No automated system is going to keep up with the stream of human thought.”
The trend toward decentralization continues as business units, customers, and other stakeholders collaborate to diagnose and work on problems of mutual interest in the cloud.
More time and better tools Big data techniques—including NoSQL1 and in-memory databases, advanced statistical packages (from SPSS and SAS to open source offerings such as R), visualization tools that put interactive graphics in the control of business unit specialists, and more intuitive user interfaces—are crucial to the new analytics. They make it possible for many people in the workforce to do some basic exploration. They allow business unit data scientists to use larger data sets and to iterate more as they test hypotheses, refine questions, and find better answers to business problems. Data scientists are nonspecialists who follow a scientific method of iterative and recursive analysis with a practical result in mind. Even without formal training, some business users in finance, marketing, operations, human capital, or other departments 1 See “Making sense of Big Data,” Technology Forecast 2010, Issue 3, http://www.pwc.com/us/en/technologyforecast/2010/issue3/index.jhtml, for more information on Hadoop and other NoSQL databases.
10
PwC Technology Forecast 2012 Issue 1
Case study
How the E. & J. Gallo Winery matches outbound shipments to retail customers E. & J. Gallo Winery, one of the world’s largest producers and distributors of wines, recognizes the need to precisely identify its customers for two reasons: some local and state regulations mandate restrictions on alcohol distribution, and marketing brands to individuals requires knowing customer preferences. “The majority of all wine is consumed within four hours and five miles of being purchased, so this makes it critical that we know which products need to be marketed and distributed by specific destination,” says Kent Kushar, Gallo’s CIO. Gallo knows exactly how its products move through distributors, but tracking beyond them is less clear. Some distributors are state liquor control boards, which supply the wine products to retail outlets and other end customers. Some sales are through military post exchanges, and in some cases there are restrictions and regulations because they are offshore. Gallo has a large compliance department to help it manage the regulatory environment in which Gallo products are sold, but Gallo wants to learn more about the customers who eventually buy and consume those products, and to learn from them information to help create new products that localize tastes. Gallo sometimes cannot obtain point of sales data from retailers to complete the match of what goes out to what is sold. Syndicated data, from sources such as Information Resources, Inc. (IRI), serves as the matching link between distribution and actual consumption. This results in the accumulation of more than 1GB of data each day as source information for compliance and marketing.
Years ago, Gallo’s senior management understood that customer analytics would be increasingly important. The company’s most recent investments are extensions of what it wanted to do 25 years ago but was limited by availability of data and tools. Since 1998, Gallo IT has been working on advanced data warehouses, analytics tools, and visualization. Gallo was an early adopter of visualization tools and created IT subgroups within brand marketing to leverage the information gathered. The success of these early efforts has spurred Gallo to invest even more in analytics. “We went from step function growth to logarithmic growth of analytics; we recently reinvested heavily in new appliances, a new system architecture, new ETL [extract, transform, and load] tools, and new ways our SQL calls were written; and we began to coalesce unstructured data with our traditional structured consumer data,” says Kushar. “Recognizing the power of these capabilities has resulted in our taking a 10-year horizon approach to analytics,” he adds. “Our successes with analytics to date have changed the way we think about and use analytics.” The result is that Gallo no longer relies on a single instance database, but has created several large purpose-specific databases. “We have also created new service level agreements for our internal customers that give them faster access and more timely analytics and reporting,” Kushar says. Internal customers for Gallo IT include supply chain, sales, finance, distribution, and the web presence design team.
Reshaping the workforce with the new analytics
11
already have the skills, experience, and mind-set to be data scientists. Others can be trained. The teaching of the discipline is an obvious new focus for the CIO. (See the article,”How CIOs can build the foundation for a data science culture” on page 58.) Visualization tools have been especially useful for Ingram Micro, a technology products distributor, which uses them to choose optimal warehouse locations around the globe. Warehouse location is a strategic decision, and Ingram Micro can run many what-if scenarios before it decides. One business result is shorterterm warehouse leases that give Ingram Micro more flexibility as supply chain requirements shift due to cost and time.
Over time, academia and the business software community have collaborated to make analytics tools more userfriendly and more accessible to people who aren’t steeped in the mathematical expressions needed to query and get good answers from data.
12
PwC Technology Forecast 2012 Issue 1
“Ensuring we are at the efficient frontier for our distribution is essential in this fast-paced and tight-margin business,” says Jonathan Chihorek, vice president of global supply chain systems at Ingram Micro. “Because of the complexity, size, and cost consequences of these warehouse location decisions, we run extensive models of where best to locate our distribution centers at least once a year, and often twice a year.” Modeling has become easier thanks to mixed integer, linear programming optimization tools that crunch large and diverse data sets encompassing many factors. “A major improvement came from the use of fast 64-bit processors and solid-state drives that reduced scenario run times from six to eight hours down to a fraction of that,” Chihorek says. “Another breakthrough for us has been improved visualization tools, such as spider and bathtub diagrams that help our analysts choose the efficient frontier curve from a complex array of data sets that otherwise look like lists of numbers.”
Analytics tools were once the province of experts. They weren’t intuitive, and they took a long time to learn. Those who were able to use them tended to have deep backgrounds in mathematics, statistical analysis, or some scientific discipline. Only companies with dedicated teams of specialists could make use of these tools. Over time, academia and the business software community have collaborated to make analytics tools more user-friendly and more accessible to people who aren’t steeped in the mathematical expressions needed to query and get good answers from data. Products from QlikTech, Tableau Software, and others immerse users in fully graphical environments because most people gain understanding more quickly from visual displays of numbers rather than from tables. “We allow users to get quickly to a graphical view of the data,” says Tableau Software’s Mackinlay. “To begin with, they’re using drag and drop for the fields in the various blended data sources they’re working with. The software interprets the drag and drop as algebraic expressions, and that gets compiled into a query database. But users don’t need to know all that. They just need to know that they suddenly get to see their data in a visual form.” Tableau Software itself is a prime example of how these tools are changing the enterprise. “Inside Tableau we use Tableau everywhere, from the receptionist who’s keeping track of conference room utilization to the salespeople who are monitoring their pipelines,” Mackinlay says. These tools are also enabling more finance, marketing, and operational executives to become data scientists, because they help them navigate the data thickets.
Figure 3: Improving the signal-to-noise ratio in social media monitoring Social media is a high-noise environment
But there are ways to reduce the noise
And focus on significant conversations Illuminating and helpful dialogue
work boots leather
heel color fashion
boots construction safety rugged
style leather
shoes price
boots
store wear
style cool
leather
toe safety value rugged construction
An initial set of relevant terms is used to cut back on the noise dramatically, a first step toward uncovering useful conversations.
heel color fashion
With proper guidance, machines can do millions of correlations, clustering words by context and meaning.
shoes price store
boots
wear
cool
toe safety value rugged construction
Visualization tools present “lexical maps” to help the enterprise unearth instances of useful customer dialog.
Source: Nexalogy Environics and PwC, 2012
More data sources The huge quantities of data in the cloud and the availability of enormous low-cost processing power can help enterprises analyze various business problems—including efforts to understand customers better, especially through social media. These external clouds augment data that business units already have direct access to internally. Ingram Micro uses large, diverse data sets for warehouse location modeling, Chihorek says. Among them: size, weight, and other physical attributes of products; geographic patterns of consumers and anticipated demand for product categories; inbound and outbound transportation hubs, lead times, and costs; warehouse lease and operating costs, including utilities; and labor costs—to name a few. Social media can also augment internal data for enterprises willing to learn how to use it. Some companies ignore social media because so much of the conversation seems trivial, but they miss opportunities. Consider a North American apparel maker that was repositioning a brand
of shoes and boots. The manufacturer was mining conventional business data for insights about brand status, but it had not conducted any significant analysis of social media conversations about its products, according to Josée Latendresse, who runs Latendresse Groupe Conseil, which was advising the company on its repositioning effort. “We were neglecting the wealth of information that we could find via social media,” she says. To expand the analysis, Latendresse brought in technology and expertise from Nexalogy Environics, a company that analyzes the interest graph implied in online conversations—that is, the connections between people, places, and things. (See “Transforming collaboration with social tools,” Technology Forecast 2011, Issue 3, for more on interest graphs.) Nexalogy Environics studied millions of correlations in the interest graph and selected fewer than 1,000 relevant conversations from 90,000 that mentioned the products. In the process, Nexalogy Environics substantially increased the “signal” and reduced the “noise” in the social media about the manufacturer. (See Figure 3.)
Reshaping the workforce with the new analytics
13
Figure 4: Adding social media analysis techniques suggests other changes to the BI process Here’s one example of how the larger business intelligence (BI) process might Addingwith SMA change thetechniques addition of social media analysis. One apparel maker started with its conventional BI analysis cycle. Conventional BI techniques used by an apparel company client ignored social media and required lots of data cleansing. The results often lacked insight.
1. 2. 3. 4. 5.
1 2
5
4
Develop questions Collect data Clean data Analyze data Present results
3
Then it added social media and targeted focus groups to the mix. The company’s revised approach added several elements such as social media analysis and expanded others, but kept the focus group phase near the beginning of the cycle. The company was able to mine new insights from social media conversations about market segments that hadn’t occurred to the company to target before.
1. Develop questions 2. Refine conventional BI - Collect data - Clean data - Analyze data 3. Conduct focus groups (retailers and end users) 4. Select conversations 5. Analyze social media 6. Present results
1 6
2
5
3 4
Then it tuned the process for maximum impact. The company’s current approach places focus groups near the end, where they can inform new questions more directly. This approach also stresses how the results get presented to executive leadership.
1 7
2
6
3 5
New step added
What Nexalogy Environics discovered suggested the next step for the brand repositioning. “The company wasn’t marketing to people who were blogging about its stuff,” says Claude Théoret, president of Nexalogy Environics. The shoes and boots were designed for specific industrial purposes, but the blogging influencers noted their fashion appeal and their utility when riding off-road on all-terrain vehicles and in other recreational settings. “That’s a whole market segment the company hadn’t discovered.” Latendresse used the analysis to help the company expand and refine its intelligence process more
14
PwC Technology Forecast 2012 Issue 1
4
1. Develop questions 2. Refine conventional BI - Collect data - Clean data - Analyze data 3. Select conversations 4. Analyze social media 5. Present results 6. Tailor results to audience 7. Conduct focus groups (retailers and end users)
generally. “The key step,” she says, “is to define the questions that you want to have answered. You will definitely be surprised, because the system will reveal customer attitudes you didn’t anticipate.” Following the social media analysis (SMA), Latendresse saw the retailer and its user focus groups in a new light. The analysis “had more complete results than the focus groups did,” she says. “You could use the focus groups afterward to validate the information evident in the SMA.” The revised intelligence development process now places focus groups closer to the end of the cycle. (See Figure 4.)
Figure 5: The benefits of big data analytics: A carrier example By analyzing billions of call records, carriers are able to obtain early warning of groups of subscribers likely to switch services. Here is how it works: 1 Carrier notes big peaks in churn.*
2 Dataspora brought in to analyze all call records.
14 billion
call data records analyzed
$ $
DON’T GO! We’ll miss you!
3 The initial analysis debunks some myths and raises new questions discussed with the carrier.
Dropped calls/poor service?
Merged to family plan?
Preferred phone unavailable?
Offer by competitor?
Financial trouble?
Dropped dead?
Incarcerated?
Friend dropped recently!
Pattern spotted: Those with a relationship to a dropped customer (calls lasting longer than two minutes, more than twice in the previous month) are 500% more likely to drop.
$ $
6 Marketers begin
campaigns that target at-risk subscriber groups with special offers.
Carrier’s prime hypothesis disproved
5 Data group deploys a call
record monitoring system that issues an alert that identifies at-risk subscribers.
4 Further analysis confirms that friends influence other friends’ propensity to switch services.
* Churn: the proportion of contractual subscribers who leave during a given time period
Source: Metamarkets and PwC, 2012
Third parties such as Nexalogy Environics are among the first to take advantage of cloud analytics. Enterprises like the apparel maker may have good data collection methods but have overlooked opportunities to mine data in the cloud, especially social media. As cloud capabilities evolve, enterprises are learning to conduct more iteration, to question more assumptions, and to discover what else they can learn from data they already have. More focus on key metrics One way to start with new analytics is to rally the workforce around a single core metric, especially when that core metric is informed by other metrics generated with the help of effective modeling. The core metric and the model that helps everyone understand it can steep the culture in the language, methods, and tools around the process of obtaining that goal.
A telecom provider illustrates the point. The carrier was concerned about big peaks in churn—customers moving to another carrier—but hadn’t methodically mined the whole range of its call detail records to understand the issue. Big data analysis methods made a large-scale, iterative analysis possible. The carrier partnered with Dataspora, a consulting firm run by Driscoll before he founded Metamarkets. (See Figure 5.)2 “We analyzed 14 billion call data records,” Driscoll recalls, “and built a high-frequency call graph of customers who were calling each other. We found that if two subscribers who were friends spoke more than once for more than two minutes in a given month and the first subscriber cancelled their contract in October, then the second subscriber became 500 percent more likely to cancel their contract in November.” 2 For more best practices on methods to address churn, see Curing customer churn, PwC white paper, http:// www.pwc.com/us/en/increasing-it-effectiveness/ publications/curing-customer-churn.jhtml, accessed April 5, 2012.
Reshaping the workforce with the new analytics
15
Data mining on that scale required distributed computing across hundreds of servers and repeated hypothesis testing. The carrier assumed that dropped calls might be one reason why clusters of subscribers were cancelling contracts, but the Dataspora analysis disproved that notion, finding no correlation between dropped calls and cancellation.
“Through the power of information and presentation, you can start to show customers different ways that they can become stewards of energy.” —Saul Zambrano, PG&E
“There were a few steps we took. One was to get access to all the data and next do some engineering to build a social graph and other features that might be meaningful, but we also disproved some other hypotheses,” Driscoll says. Watching what people actually did confirmed that circles of friends were cancelling in waves, which led to the peaks in churn. Intense focus on the key metric illustrated to the carrier and its workforce the power of new analytics. Better access to results The more pervasive the online environment, the more common the sharing of information becomes. Whether an enterprise is a gaming or an e-commerce company that can instrument its own digital environment, or a smart grid utility that generates, slices, dices, and shares energy consumption analytics for its customers and partners, better analytics are going direct to the customer as well as other stakeholders. And they’re being embedded where users can more easily find them. For example, energy utilities preparing for the smart grid are starting to invite the help of customers by putting better data and more broadly shared operational and customer analytics at the center of a co-created energy efficiency collaboration. Saul Zambrano, senior director of customer energy solutions at Pacific Gas & Electric (PG&E), an early installer of smart meters, points out
16
PwC Technology Forecast 2012 Issue 1
that policymakers are encouraging more third-party access to the usage data from the meters. “One of the big policy pushes at the regulatory level is to create platforms where third parties can—assuming all privacy guidelines are met—access this data to build business models they can drive into the marketplace,” says Zambrano. “Grid management and energy management will be supplied by both the utilities and third parties.” Zambrano emphasizes the importance of customer participation to the energy efficiency push. The issue he raises is the extent to which blended operational and customer data can benefit the larger ecosystem, by involving millions of residential and business customers. “Through the power of information and presentation, you can start to show customers different ways that they can become stewards of energy,” he says. As a highly regulated business, the utility industry has many obstacles to overcome to get to the point where smart grids begin to reach their potential, but the vision is clear: • Show customers a few key metrics and seasonal trends in an easy-to-understand form. • Provide a means of improving those metrics with a deeper dive into where they’re spending the most on energy. • Allow them an opportunity to benchmark their spending by providing comparison data. This new kind of data sharing could be a chance to stimulate an energy efficiency competition that’s never existed between homeowners and between business property owners. It is also an example of how broadening access to new analytics can help create a culture of inquiry throughout the extended enterprise.
Case study
Smart shelving: How the E. & J. Gallo Winery analytics team helps its retail partners Some of the data in the E. & J. Gallo Winery information architecture is for production and quality control, not just customer analytics. More recently, Gallo has adopted complex event processing methods on the source information, so it can look at successes and failures early in its manufacturing execution system, sales order management, and the accounting system that front ends the general ledger. Information and information flow are the lifeblood of Gallo, but it is clearly a team effort to make the best use of the information. In this team:
what the data reveal (for underlying trends of specific brands by location), or to conduct R&D in a test market, or to listen to the web platforms. These insights inform a specific design for “smart shelving,” which is the placement of products by geography and location within the store. Gallo offers a virtual wine shelf design schematic to retailers, which helps the retailer design the exact details of how wine will be displayed—by brand, by type, and by price. Gallo’s wine shelf design schematic will help the retailer optimize sales, not just for Gallo brands but for all wine offerings.
• Supply chain looks at the flows. • Sales determines what information is needed to match supply and demand. • R&D undertakes the heavy-duty customer data integration, and it designs pilots for brand consumption. • IT provides the data and consulting on how to use the information.
Before Gallo’s wine shelf design schematic, wine sales were not a major source of retail profits for grocery stores, but now they are the first or second highest profit generators in those stores. “Because of information models such as the wine shelf design schematic, Gallo has been the wine category captain for some grocery stores for 11 years in a row so far,” says Kent Kushar, CIO of Gallo.
Mining the information for patterns and insights in specific situations requires the team. A key goal is what Gallo refers to as demand sensing—to determine the stimulus that creates demand by brand and by product. This is not just a computer task, but is heavily based on human intervention to determine
Reshaping the workforce with the new analytics
17
Conclusion: A broader culture of inquiry This article has explored how enterprises are embracing the big data, tools, and science of new analytics along a path that can lead them to a broader culture of inquiry, in which improved visualization and user interfaces make it possible to spread ad hoc analytics capabilities to every user role. This culture of inquiry appears likely to become the age of the data scientists—workers who combine a creative ability to generate useful hypotheses with the savvy to simulate and model a business as it’s changing. It’s logical that utilities are instrumenting their environments as a step toward smart grids. The data they’re generating can be overwhelming, but that data will also enable the analytics needed to reduce energy consumption to meet efficiency and environmental goals. It’s also logical that enterprises are starting to hunt for more effective ways to filter social media conversations, as apparel makers
have found. The return on investment for finding a new market segment can be the difference between long-term viability and stagnation or worse. Tackling the new kinds of data being generated is not the only analytics task ahead. Like the technology distributor, enterprises in all industries have concerns about scaling the analytics for data they’re accustomed to having and now have more. Publishers can serve readers better and optimize ad sales revenue by tuning their engines for timing, pricing, and pinpointing ad campaigns. Telecom carriers can mine all customer data more effectively to be able to reduce the expense of churn and improve margins. What all of these examples suggest is a greater need to immerse the extended workforce—employees, partners, and customers—in the data and analytical methods they need. Without a view into everyday customer behavior, there’s no leverage for employees to influence company direction when
One way to raise awareness about the power of new analytics comes from articulating the results in a visual form that everyone can understand. Another is to enable the broader workforce to work with the data themselves and to ask them to develop and share the results of their own analyses.
18
PwC Technology Forecast 2012 Issue 1
Table 1: Key elements of a culture of inquiry Element
How it is manifested within an organization
Value to the organization
Executive support
Senior executives asking for data to support any opinion or proposed action and using interactive visualization tools themselves
Set the tone for the rest of the organization with examples
Data availability
Cloud architecture (whether private or public) and semantically rich data integration methods
Find good ideas from any source
Analytics tools
Higher-profile data scientists embedded in the business units
Identify hidden opportunities
Interactive visualization
Visual user interfaces and the right tool for the right person
Encourage a culture of inquiry
Training
Power users in individual departments
Spread the word and highlight the most effective and user-friendly techniques
Sharing
Internal portals or other collaborative environments to publish and discuss inquiries and results
Prove that the culture of inquiry is real
markets shift and there are no insights into improving customer satisfaction. Computing speed, storage, and scale make those insights possible, and it is up to management to take advantage of what is becoming a co-creative work environment in all industries— to create a culture of inquiry. Of course, managing culture change is a much bigger challenge than simply rolling out more powerful analytics software. It is best to have several starting points and to continue to find ways to emphasize the value of analytics in new scenarios. One way to raise awareness about the power of new analytics comes from articulating the results in a visual form that everyone can understand. Another is to enable the broader workforce to work with the data themselves and to ask them to develop and share the results of their own analyses. Still another approach
would be to designate, train, and compensate the more enthusiastic users in all units—finance, product groups, supply chain, human resources, and so forth—as data scientists. Table 1 presents examples of approaches to fostering a culture of inquiry. The arc of all the trends explored in this article is leading enterprises toward establishing these cultures of inquiry, in which decisions can be informed by an analytical precision comparable to scientific insight. New market opportunities, an energized workforce with a stake in helping to achieve a better understanding of customer needs, and reduced risk are just some of the benefits of a culture of inquiry. Enterprises that understand the trends described here and capitalize on them will be able to improve how they attract and retain customers.
Reshaping the workforce with the new analytics
19
The nature of cloudbased data science Mike Driscoll of Metamarkets talks about the analytics challenges and opportunities that businesses moving to the cloud face. Interview conducted by Alan Morrison and Bo Parker
Mike Driscoll Mike Driscoll is CEO of Metamarkets, a cloud-based analytics company he co-founded in San Francisco in 2010.
PwC: What’s your background, and how did you end up running a data science startup? MD: I came to Silicon Valley after studying computer science and biology for five years, and trying to reverse engineer the genome network for uranium-breathing bacteria. That was my thesis work in grad school. There was lots of modeling and causal inference. If you were to knock this gene out, could you increase the uptake of the reduction of uranium from a soluble to an insoluble state? I was trying all these simulations and testing with the bugs to see whether you could achieve that. PwC: You wanted to clean up radiation leaks at nuclear plants? MD: Yes. The Department of Energy funded the research work I did. Then I came out here and I gave up on the idea of building a biotech company, because I didn’t think there was enough commercial viability there from what I’d seen. I did think I could take this toolkit I’d developed and apply it to all these other businesses that have data. That was the genesis of the consultancy Dataspora. As we started working with companies at Dataspora, we found this huge gap between what was possible and what companies were actually doing. Right now the real shift is that companies are moving from this very high-latency-course era of reporting into one where they start to have lower latency, finer granularity, and better
20
PwC Technology Forecast 2012 Issue 1
Critical business questions
Some companies don’t have all the capabilities they need to create data science value. Companies need these three capabilities to excel in creating data science value.
Value and change
Good data
visibility into their operations. They realize the problem with being walking amnesiacs, knowing what happened to their customers in the last 30 days and then forgetting every 30 days. Most businesses are just now figuring out that they have this wealth of information about their customers and how their customers interact with their products. PwC: On its own, the new availability of data creates demand for analytics. MD: Yes. The absolute number-one thing driving the current focus in analytics is the increase in data. What’s different now from what happened 30 years ago is that analytics is the province of people who have data to crunch. What’s causing the data growth? I’ve called it the attack of the exponentials— the exponential decline in the cost of compute, storage, and bandwidth, and the exponential increase in the number of nodes on the Internet. Suddenly the economics of computing over data has shifted so that almost all the data that businesses generate is worth keeping around for its analysis. PwC: And yet, companies are still throwing data away. MD: So many businesses keep only 60 days’ worth of data. The storage cost is so minimal! Why would you throw it away? This is the shift at the big data layer; when these companies store data, they store it in a very
expensive relational database. There needs to be different temperatures of data, and companies need to put different values on the data— whether it’s hot or cold, whether it’s active. Most companies have only one temperature: they either keep it hot in a database, or they don’t keep it at all. PwC: So they could just keep it in the cloud? MD: Absolutely. We’re starting to see the emergence of cloud-based databases where you say, “I don’t need to maintain my own database on the premises. I can just rent some boxes in the cloud and they can persist our customer data that way.” Metamarkets is trying to deliver DaaS—data science as a service. If a company doesn’t have analytics as a core competency, it can use a service like ours instead. There’s no reason for companies to be doing a lot of tasks that they are doing in-house. You need to pick and choose your battles. We will see a lot of IT functions being delivered as cloud-based services. And now inside of those cloud-based services, you often will find an open source stack. Here at Metamarkets, we’ve drawn heavily on open source. We have Hadoop on the bottom of our stack, and then at the next layer we have our own in-memory distributed database. We’re running on Amazon Web Services and have hundreds of nodes there.
Data science
PwC: How are companies that do have data science groups meeting the challenge? Take the example of an orphan drug that is proven to be safe but isn’t particularly effective for the application it was designed for. Data scientists won’t know enough about a broad range of potential biological systems for which that drug might be applicable, but the people who do have that knowledge don’t know the first thing about data science. How do you bring those two groups together? MD: My data science Venn diagram helps illustrate how you bring those groups together. The diagram has three circles. [See above.] The first circle is data science. Data scientists are good at this. They can take data strings, perform processing, and transform them into data structures. They have great modeling skills, so they can use something like R or SAS and start to build a hypothesis that, for example, if a metric is three standard deviations above or below the specific threshold then someone may be more likely to cancel their membership. And data scientists are great at visualization. But companies that have the tools and expertise may not be focused on a critical business question. A company is trying to build what it calls the technology genome. If you give them a list of parts in the iPhone, they can look and see how all those different parts are related to other parts in camcorders and laptops. They built this amazingly intricate graph of the
Reshaping the workforce with the new analytics
21
“[Companies] realize the problem with being walking amnesiacs, knowing what happened to their customers in the last 30 days and then forgetting every 30 days.”
actual makeup. They’ve collected large amounts of data. They have PhDs from Caltech; they have Rhodes scholars; they have really brilliant people. But they don’t have any real critical business questions, like “How is this going to make me more money?” The second circle in the diagram is critical business questions. Some companies have only the critical business questions, and many enterprises fall in this category. For instance, the CEO says, “We just released a new product and no one is buying it. Why?” The third circle is good data. A beverage company or a retailer has lots of POS [point of sale] data, but it may not have the tools or expertise to dig in and figure out fast enough where a drink was selling and what demographics it was selling to, so that the company can react. On the other hand, sometimes some web companies or small companies have critical business questions and they have the tools and expertise. But because they have no customers, they don’t have any data. PwC: Without the data, they need to do a simulation. MD: Right. The intersection in the Venn diagram is where value is created. When you think of an e-commerce company that says, “How do we upsell people and reduce the number of abandoned
22
PwC Technology Forecast 2012 Issue 1
shopping carts?” Well, the company has 600 million shopping cart flows that it has collected in the last six years. So the company says, “All right, data science group, build a sequential model that shows what we need to do to intervene with people who have abandoned their shopping carts and get them to complete the purchase.” PwC: The questioning nature of business—the culture of inquiry— seems important here. Some who lack the critical business questions don’t ask enough questions to begin with. MD: It’s interesting—a lot of businesses have this focus on real-time data, and yet it’s not helping them get answers to critical business questions. Some companies have invested a lot in getting real-time monitoring of their systems, and it’s expensive. It’s harder to do and more fragile. A friend of mine worked on the data team at a web company. That company developed, with a real effort, a real-time log monitoring framework where they can see how many people are logging in every second with 15-second latency across the ecosystem. It was hard to keep up and it was fragile. It broke down and they kept bringing it up, and then they realized that they take very few business actions in real time. So why devote all this effort to a real-time system?
PwC: In many cases, the data is going to be fresh enough, because the nature of the business doesn’t change that fast. MD: Real time actually means two things. The first thing has to do with the freshness of data. The second has to do with the query speed. By query speed, I mean that if you have a question, how long it takes to answer a question such as, “What were your top products in Malaysia around Ramadan?” PwC: There’s a third one also, which is the speed to knowledge. The data could be staring you in the face, and you could have incredibly insightful things in the data, but you’re sitting there with your eyes saying, “I don’t know what the message is here.” MD: That’s right. This is about how fast can you pull the data and how fast can you actually develop an insight from it. For learning about things quickly enough after they happen, query speed is really important. This becomes a challenge at scale. One of the problems in the big data space is that databases used to be fast. You used to be able to ask a question of your inventory and you’d get an answer in seconds. SQL was quick when the scale wasn’t large; you could have an interactive dialogue with your data.
But now, because we’re collecting millions and millions of events a day, data platforms have seen real performance degradation. Lagging performance has led to degradation of insights. Companies literally are drowning in their data. In the 1970s, when the intelligence agencies first got reconnaissance satellites, there was this proliferation in the amount of photographic data they had, and they realized that it paralyzed their decision making. So to this point of speed, I think there are a number of dimensions here. Typically when things get big, they get slow. PwC: Isn’t that the problem the new in-memory database appliances are intended to solve? MD: Yes. Our Druid engine on the back end is directly competitive with those proprietary appliances. The biggest difference between those appliances and what we provide is that we’re cloud based and are available on Amazon.
appliance. We solve the performance problem in the cloud. Our mantra is visibility and performance at scale. Data in the cloud liberates companies from some of these physical box confines and constraints. That means that your data can be used as inputs to other types of services. Being a cloud service really reduces friction. The coefficient of friction around data has for a long time been high, and I think we’re seeing that start to drop. Not just the scale or amount of data being collected, but the ease with which data can interoperate with different services, both inside your company and out. I believe that’s where tremendous value lies.
“Being a cloud service really reduces friction. The coefficient of friction around data has for a long time been high, and I think we’re seeing that start to drop.”
If your data and operations are in the cloud, it does not make sense to have your analytics on some
Reshaping the workforce with the new analytics
23
Online advertising analytics in the cloud Jon Slade of the Financial Times describes the 123-year-old business publication’s advanced approach to its online ad sales. Interview conducted by Alan Morrison, Bo Parker, and Bud Mathaisel
Jon Slade Jon Slade is global online and strategic advertising sales director at FT.com, the digital arm of the Financial Times.
PwC: What is your role at the FT [Financial Times], and how did you get into it? JS: I’m the global advertising sales director for all our digital products. I’ve been in advertising sales and in publishing for about 15 years and at the FT for about 7 years. And about three and a half years ago I took this role—after a quick diversion into landscape gardening, which really gave me the idea that digging holes for a living was not what I wanted to do. PwC: The media business has changed during that period of time. How has the business model at FT.com evolved over the years? JS: From the user’s perspective, FT.com is like a funnel, really, where you have free access at the outer edge of the funnel, free access for registration in the middle, and then the subscriber at the innermost part. The funnel is based on the volume of consumption. From an ad sales perspective, targeting the most relevant person is essential. So the types of clients that we’re talking about—companies like PwC, Rolex, or Audi—are not interested in a scatter graph approach to advertising. The advertising business thrives on targeting advertising very, very specifically. On the one hand, we have an ad model that requires very precise, targeted information. And on the other hand, we have a metered model of access, which means we have lots of opportunity to collect information about our users.
24
PwC Technology Forecast 2012 Issue 1
“We have what we call the web app with FT.com. We’re not available through the iTunes Store anymore. We use the technology called HTML5, which essentially allows us to have the same kind of touch screen interaction as an app would, but we serve it through a web page.”
PwC: How does a company like the FT sell digital advertising space? JS: Every time you view a web page, you’ll see an advert appear at the top or the side, and that one appearance of the ad is what we call an ad impression. We usually sell those in groups of 1,000 ad impressions. Over a 12-month period, our total user base, including our 250,000 paying subscribers, generates about 6 billion advertising impressions across FT.com. That’s the currency that is bought and sold around advertising in the online world. In essence, my job is to look at those ad impressions and work out which one of those ad impressions is worth the most for any one particular client. And we have about 2,000 advertising campaigns a year that run across FT.com.
from maybe 1 percent or 2 percent just three years ago. So that’s a radically changing picture that we now need to understand as well. What are the consumption patterns around mobile? How many pages are people consuming? What type of content are they consuming? What content is more relevant to a chief executive versus a finance director versus somebody in Japan versus somebody in Dubai? Mobile is a very substantial platform that we now must look at in much more detail and with much greater care than we ever did before.
Impressions generated have different values to different advertisers. So we need to separate all the strands out of those 6 billion ad impressions and get as close a picture as we possibly can to generate the most revenue from those ad impressions.
PwC: Yes, and regarding the mobile picture, have you seen any successes in terms of trying to address that channel in a new and different way? JS: Well, just with the FT, we have what we call the web app with FT.com. We’re not available through the iTunes Store anymore. We use the technology called HTML5, which essentially allows us to have the same kind of touch screen interaction as an app would, but we serve it through a web page.
PwC: It sounds like you have a lot of complexity on both the supply and the demand side. Is the supply side changing a lot? JS: Sure. Mobile is changing things pretty dramatically, actually. About 20 percent of our page views on digital channels are now generated by a mobile device or by someone who’s using a mobile device, which is up
So a user points the browser on their iPad or other device to FT.com, and it takes you straight through to the app. There’s no downloading of the app; there’s no content update required. We can update the infrastructure of the app very, very easily. We don’t need to push it out through any third party such as Apple. We can retain a direct relationship with our customer.
One or two other publishers are starting to understand that this is a pretty good way to push content to mobile devices, and it’s an approach that we’ve been very successful with. We’ve had more than 1.4 million users of our new web app since we launched it in June 2011. It’s a very fast-growing opportunity for us. We see both subscription and advertising revenue opportunities. And with FT.com we try to balance both of those, both subscription revenue and advertising revenue. PwC: You chose the web app after having offered a native app, correct? JS: That’s right, yes. PwC: Could you compare and contrast the two and what the pros and cons are? JS: If we want to change how we display content in the web app, it’s a lot easier for us not to need to go to a new version of the app and push that through into the native app via an approval process with a third party. We can just make any changes at our end straight away. And as users go to the web app, those implemented changes are there for them. On the back end, it gives us a lot more agility to develop advertising opportunities. We can move faster to take advantage of a growing market, plus provide far better web-standard analytics around campaigns—something that native app providers struggle with.
Reshaping the workforce with the new analytics
25
Big data in online advertising “Every year, our total user base, including our 250,000 paying subscribers, generates about 6 billion advertising impressions across FT.com.”
One other benefit we’ve seen is that a far greater number of people use the web app than ever used the native app. So an advertiser is starting to get a bit more scale from the process, I guess. But it’s just a quicker way to make changes to the application with the web app. PwC: How about the demand side? How are things changing? You mentioned 6 billion annual impressions—or opportunities, we might phrase it. JS: Advertising online falls into two distinct areas. There is the scatter graph type of advertising where size matters. There are networks that can give you billions and billions of ad impressions, and as an advertiser, you throw as many messages into that mix as you possibly can. And then you try and work out over time which ones stuck the best, and then you try and optimize to that. That is how a lot of mainstream or major networks run their businesses. On the other side, there are very, very targeted websites that provide advertisers with real efficiency to reach only the type of demographic that they’re interested in reaching, and that’s very much the side that we fit into. Over the last two years, there’s been a shift to the extreme on both sides. We’ve seen advertisers go much more toward a very scattered environment, and equally other advertisers head much more toward investing more of their money into a very niche environment. And then some advertisers seem to try and play a little bit in the middle.
26
PwC Technology Forecast 2012 Issue 1
With the readers and users of FT.com, particularly in the last three years as the economic crisis has driven like a whirlwind around the globe, we’ve seen what we call a flight to quality. Users are aware—as are advertisers— that they could go to a thousand different places to get their news, but they don’t really have the time to do that. They’re going to fewer places and spending more time within them, and that’s certainly the experience that we’ve had with the Financial Times. PwC: To make a more targeted environment for advertising, you need to really learn more about the users themselves, yes? JS: Yes. Most of the opt-in really occurs at the point of registration and subscription. This is when the user declares demographic information: this is who I am, this is the industry that I work for, and here’s the ZIP code that I work from. Users who subscribe provide a little bit more. Most of the work that we do around understanding our users better occurs at the back end. We examine user actions, and we note that people who demonstrate this type of behavior tend to go on and do this type of thing later in the month or the week or the session or whatever it might be. Our back-end analytics allows us to extract certain groups who exhibit those behaviors. That’s probably most of the work that we’re focused on at the moment. And that applies not just to the advertising picture
6B but to our content development and our site development, too. If we know, for example, that people type A-1-7 tend to read companies’ pages between 8 a.m. and 10 a.m. and they go on to personal finance at lunchtime, then we can start to examine those groups and drive the right type of content toward them more specifically. It’s an ongoing piece of the content and advertising optimization. PwC: Is this a test to tune and adjust the kind of environment that you’ve been able to create? JS: Absolutely, both in terms of how our advertising campaigns display and also the type of content that we display. If you and I both looked at FT.com right now, we’d probably see the home page, and 90 percent of what you would see would be the same as what I would see. But about 10 percent of it would not be. PwC: How does Metamarkets fit into this big picture? Could you shine some light on what you’re doing with them and what the initial successes have been? JS: Sure. We’ve been working with Metamarkets in earnest for more than a year. The real challenge that Metamarkets relieves for us is to understand those 6 billion ad impressions—who’s generating them, how many I’m likely to have tomorrow of any given sort, and how much I should charge for them. It gives me that single view in a single place in near real time what my exact
supply and my exact demand are. And that is really critical information. I increasingly feel a little bit like I’m on a flight deck with the number of screens around me to understand. When I got into advertising straight after my landscape gardening days, I didn’t even have a screen. I didn’t have a computer when I started. Previously, the way that data was held—the demographics data, the behavior data, the pricing, the available inventory—was across lots of different databases and spreadsheets. We needed an almost witchcraft-like algorithm to provide answers to “How many impressions do I have?” and “How much should I charge?” It was an extremely labor-intensive process. And that approach just didn’t really fit the need for the industry in which we work. Media advertising is purchased in real time now. The impression appears, and this process goes between three or four interested parties—one bid wins out, and the advert is served in the time it takes to open a web page.
PwC: In general, it seems like Metamarkets is doing a whole piece of your workflow rather than you doing it. Is that a fair characterization? JS: Yes. I’ll give you an example. I was talking to our sales manager in Paris the other day. I said to him, “If you wanted to know how many adverts of a certain size that you have available to you in Paris next Tuesday that will be created by chief executives in France, how would you go about getting that answer?” Before, the sales team would send an e-mail to ad operations in London for an inventory forecast, and it could take the ad operations team up to eight working hours to get back to them. It could even take as long as two business days to get an answer in times of high volume. Now, we’ve reduced that turnaround to about eight seconds of self-service, allowing our ad operations team time to focus on more strategic output. That’s the sort of magnitude of workflow change that this creates for us—a two-day turnaround down to about eight seconds.
Now if advertising has been purchased in real time, we really need to understand what we have on our supermarket shelves in real time, too. That’s what Metamarkets does for us—help us visualize in one place our supply and demand.
“Before, the sales team would send an e-mail to ad operations in London for an inventory forecast, and it could take the ad operations team up to eight working hours to get back to them. Now, we’ve reduced that turnaround to about eight seconds of self-service.”
Reshaping the workforce with the new analytics
27
PwC: When you were looking to resolve this problem, were there a lot of different services that did this sort of thing? JS: Not that we came across. I have to say our conversations with the Metamarkets team actually started about something not entirely different, but certainly not the product that we’ve come up with now. Originally we had a slightly different concept under discussion that didn’t look at this part at all. As a company, Metamarkets was really prepared to say, “We don’t have something on the shelves. We have some great minds and some really good technology, so why don’t we try to figure out with you what your problem is, and then we’ll come up with an answer.” To be honest, we looked around a little bit at what else is out there, but I don’t want to buy anything off the shelf. I want to work with a company that can understand what I’m after, go away, and come back with the answer to that plus, plus, plus. And that seems to be the way Metamarkets has developed. Other vendors clearly do something similar or close, but most of what I’ve seen comes off the shelf. And we are— we’re quite annoying to work with, I would say. We’re not really a cookiecutter business. You can slice and dice those 6 billion ad impressions in
28
PwC Technology Forecast 2012 Issue 1
thousands and thousands of ways, and you can’t always predict how a client or a customer or a colleague is going to want to split up that data. So rather than just say, “The only way you can do it is this way, and here’s the off-the-shelf solution,” we really wanted something that put the power in the hands of the user. And that seems to be what we’ve created here. The credit is entirely with Metamarkets, I have to say. We just said, “Help, we have a problem,” and they said, “OK, here’s a good answer.” So the credit for all the clever stuff behind this should go with them. PwC: So there continues to be a lot of back and forth between FT and Metamarkets as your needs change and the demand changes? JS: Yes. We have at least a weekly call. The Metamarkets team visits us in London about once a month, or we meet
in New York if I’m there. And there’s a lot of back and forth. What seems to happen is that every time we give it to one of the ultimate end users— one of the sales managers around the world—you can see the lights on in their head about the potential for it. And without fail they’ll say, “That’s brilliant, but how about this and this?” Or, “Could we use it for this?” Or, “How about this for an intervention?” It’s great. It’s really encouraging to see a product being taken up by internal customers with the enthusiasm that it is. We very much see this as an iterative project. We don’t see it as necessarily having a specific end in sight. We think there’s always more that we can add into this. It’s pretty close to a partnership really, a straight vendor and supplier relationship. It is a genuine partnership, I think.
“So rather than just say, ‘The only way you can do it is this way, and here’s the off-the-shelf solution,’ we really wanted something that put the power in the hands of the user.”
Supply accuracy in online advertising “Accuracy of supply is upward of 15 percent better than what we’ve seen before.”
PwC: How is this actually translated into the bottom line— yield and advertising dollars? JS: It would be probably a little hard for me to share with you any percentages or specifics, but I can say that it is driving up the yields we achieve. It is double-digit growth on yield as a result of being able to understand our supply and demand better. The degree of accuracy of supply that it provides for us is upward of 15 percent better than what we’ve seen before. I can’t quantify the difference that it’s made to workflows, but it’s significant. To go to a twoday turnaround on a simple request to eight seconds is significant. PwC: Given our research focus, we have lots of friends in the publishing business, and many of them talked to us about the decline in return from impression advertising. It’s interesting. Your story seems to be pushing in the different direction. JS: Yes. I’ve noticed that entirely. Whenever I talk to a buying customer, they always say, “Everybody else is getting cheaper, so how come you’re getting more expensive?” I completely hear that. What I would say is we are getting better at understanding the attribution model. Ultimately, what these impressions create for a client or a customer is not just how many visits will readers make to your website, but how much money will they spend when they get there.
15%
Now that piece is still pretty much embryonic, but we’re certainly making the right moves in that direction. We’ve found that putting a price up is accepted. Essentially what an increase in yield implies is that you put your price up. It’s been accepted because we’ve been able to offer a much tighter specific segmentation of that audience. Whereas, when people are buying on a spray basis across large networks, deservedly there is significant price pressure on that. Equally, if we understand our supply and demand picture in a much more granular sense, we know when it’s a good time to walk away from a deal or whether we’re being too bullish in the deal. That pricing piece is critical, and we’re looking to get to a realtime dynamic pricing model in 2012. And Metamarkets is certainly along the right lines to help us with that. PwC: A lot of our clients are very conservative organizations, and they might be reluctant to subscribe to a cloud service like Metamarkets, offered by a company that has not been around for a long time. I’m assuming that the FT had to make the decision to go on this different route and that there was quite a bit of consideration of these factors. JS: Endless legal diligence would be one way to put it—back and forth a lot. We have 2,000 employees worldwide, so we still have a fairly entrepreneurial attitude toward suppliers. Of course we
do the legal diligence, and of course we do the contractual diligence, and of course we look around to see what else is available. But if you have a good instinct about working with somebody, then we’re the size of organization where that instinct can still count for something. And I think that was the case with Metamarkets. We felt that we were talking on the same page here. We almost could put words in one another’s mouths and the sentence would still kind of form. So it felt very good from the beginning. If we look at what’s happening in the digital publishing world, some of the most exciting things are happening with very small startup businesses and all of the big web powers now were startups 8 or 10 years ago, such as Facebook and Amazon. We believe in that mentality. We believe in a personality in business. Metamarkets represented that to us very well. And yes, there’s a little bit of a risk, but it has paid off. So we’re happy.
Reshaping the workforce with the new analytics
29
30
PwC Technology Forecast 2012 Issue 1
The art and science of new analytics technology Left-brain analysis connects with right-brain creativity. By Alan Morrison
The new analytics is the art and science of turning the invisible into the visible. It’s about finding “unknown unknowns,” as former US Secretary of Defense Donald Rumsfeld famously called them, and learning at least something about them. It’s about detecting opportunities and threats you hadn’t anticipated, or finding people you didn’t know existed who could be your next customers. It’s about learning what’s really important, rather than what you thought was important. It’s about identifying, committing, and following through on what your enterprise must change most. Achieving that kind of visibility requires a mix of techniques. Some of these are new, while others aren’t. Some are clearly in the realm of data science because they make possible more iterative and precise analysis of large, mixed data sets. Others, like visualization and more contextual search, are as much art as science. This article explores some of the newer technologies that make feasible the case studies and the evolving cultures of inquiry described in “The third wave of customer analytics” on page 06. These technologies include the following:
• In-memory technology—Reducing response time and expanding the reach of business intelligence (BI) by extending the use of main (random access) memory • Interactive visualization—Merging the user interface and the presentation of results into one responsive visual analytics environment • Statistical rigor—Bringing more of the scientific method and evidence into corporate decision making • Associative search—Navigating to specific names and terms by browsing the nearby context (see the sidebar, “Associative search,” on page 41) A companion piece to this article, “Natural language processing and social media intelligence,” on page 44, reviews the methods that vendors use for the needle-in-a-haystack challenge of finding the most relevant social media conversations about particular products and services. Because social media is such a major data source for exploratory analytics and because natural language processing (NLP) techniques are so varied, this topic demands its own separate treatment.
Reshaping the workforce with the new analytics
31
Figure 1: Addressable analytics footprint for in-memory technology In-memory technology augmented traditional business intelligence (BI) and predictive analytics to begin with, but its footprint will expand over the forecast period to become the base for corporate apps, where it will blur the boundary between transactional systems and data warehousing. Longer term, more of a 360-degree view of the customer can emerge. 2011
2012
2013
2014
BI + ERP and mobile
+ Other corporate apps
In-memory technology Enterprises exploring the latest in-memory technology soon come to realize that the technology’s fundamental advantage—expanding the capacity of main memory (solid-state memory that’s directly accessible) and reducing reliance on disk drive storage to reduce latency—can be applied in many different ways. Some of those applications offer the advantage of being more feasible over the short term. For example, accelerating conventional BI is a short-term goal, one that’s been feasible for several years through earlier products that use in-memory capability from some BI providers, including MicroStrategy, QlikTech QlikView, TIBCO Spotfire, and Tableau Software. Longer term, the ability of platforms such as Oracle Exalytics, SAP HANA, and the forthcoming SAS in-memory Hadoop-based platform1 to query across a wide range of disparate data sources will improve. “Previously, 1 See Doug Henschen, “SAS Prepares HadoopPowered In-Memory BI Platform,” InformationWeek, February 14, 2012, http://www.informationweek.com/ news/hardware/grid_cluster/232600767, accessed February 15, 2012. SAS, which also claims interactive visualization capabilities in this appliance, expects to make this appliance available by the end of June 2012.
32
PwC Technology Forecast 2012 Issue 1
Crossfunctional, cross-source analytics
users were limited to BI suites such as BusinessObjects to push the information to mobile devices,” says Murali Chilakapati, a manager in PwC’s Information Management practice and a HANA implementer. “Now they’re going beyond BI. I think in-memory is one of the best technologies that will help us to work toward a better overall mobile analytics experience.” The full vision includes more crossfunctional, cross-source analytics, but this will require extensive organizational and technological change. The fundamental technological change is already happening, and in time richer applications based on these changes will emerge and gain adoption. (See Figure 1.) “Users can already create a mashup of various data sets and technology to determine if there is a correlation, a trend,” says Kurt J. Bilafer, regional vice president of analytics at SAP. To understand how in-memory advances will improve analytics, it will help to consider the technological advantages of hardware and software, and how they can be leveraged in new ways.
What in-memory technology does For decades, business analytics has been plagued by slow response times (also known as latency), a problem that in-memory technology helps to overcome. Latency is due to input/output bottlenecks in a computer system’s data path. These bottlenecks can be alleviated by using six approaches:
Memory swapping Figure 2: Memory swapping Swapping data from RAM to disk introduces latency that in-memory systems designs can now avoid. RAM Block out
• Move the traffic through more paths (parallelization)
Block in
Steps that introduce latency
• Increase the speed of any single path (transmission) • Reduce the time it takes to switch paths (switching) • Reduce the time it takes to store bits (writing) • Reduce the time it takes to retrieve bits (reading) • Reduce computation time (processing) To process and store data properly and cost-effectively, computer systems swap data from one kind of memory to another a lot. Each time they do, they encounter latency in transmitting, switching, writing, or reading bits. (See Figure 2.) Contrast this swapping requirement with processing alone. Processing is much faster because so much of it is on-chip or directly interconnected. The processing function always outpaces multitiered memory handling. If these systems can keep more data “in memory” or directly accessible to the central processing units (CPUs), they can avoid swapping and increase efficiency by accelerating inputs and outputs. Less swapping reduces the need for duplicative reading, writing, and moving data. The ability to load and work on whole data sets in main memory—that is, all in random
access memory (RAM) rather than frequently reading it from and writing it to disk—makes it possible to bypass many input/output bottlenecks. Systems have needed to do a lot of swapping, in part, because faster storage media were expensive. That’s why organizations have relied heavily on high-capacity, cheaper disks for storage. As transistor density per square millimeter of chip area has risen, the cost per bit to use semiconductor (or solid-state) memory has dropped and the ability to pack more bits in a given chip’s footprint has increased. It is now more feasible to use semiconductor memory in more places where it can help most, and thereby reduce reliance on high-latency disks. Of course, the solid-state memory used in direct access applications, dynamic random access memory (DRAM), is volatile. To avoid the higher risk of
Reshaping the workforce with the new analytics
33
Write-behind caching
Figure 3: Write-behind caching Write-behind caching makes writes to disk independent of other write functions. CPU Reader
Writer
RAM
Write behind
T-Mobile, one of SAP’s customers for HANA, claims that reports that previously took hours to generate now take seconds.
Source: Gigaspaces and PwC, 2010 and 2012
data loss from expanding the use of DRAM, in-memory database systems incorporate a persistence layer with backup, restore, and transaction logging capability. Distributed caching systems or in-memory data grids such as Gigaspaces XAP data grid, memcached, and Oracle Coherence— which cache (or keep in a handy place) lots of data in DRAM to accelerate website performance—refer to this same technique as write-behind caching. These systems update databases on disk asynchronously from the writes to DRAM, so the rest of the system doesn’t need to wait for the disk write process to complete before performing another write. (See Figure 3.) How the technology benefits the analytics function The additional speed of improved in-memory technology makes possible more analytics iterations within a given time. When an entire BI suite is contained in main memory, there are many more opportunities to query the data. Ken Campbell, a director in PwC’s information and enterprise data
34
PwC Technology Forecast 2012 Issue 1
management practice, notes: “Having a big data set in one location gives you more flexibility.” T-Mobile, one of SAP’s customers for HANA, claims that reports that previously took hours to generate now take seconds. HANA did require extensive tuning for this purpose.2 Appliances with this level of main memory capacity started to appear in late 2010, when SAP first offered HANA to select customers. Oracle soon followed by announcing its Exalytics In-Memory Machine at OpenWorld in October 2011. Other vendors well known in BI, data warehousing, and database technology are not far behind. Taking full advantage of in-memory technology depends on hardware and software, which requires extensive supplier/provider partnerships even before any thoughts of implementation. Rapid expansion of in-memory hardware. Increases in memory bit density (number of bits stored in a square millimeter) aren’t qualitatively new; the difference now is quantitative. What seems to be a step-change in in-memory technology has actually been a gradual change in solidstate memory over many years. Beginning in 2011, vendors could install at least a terabyte of main memory, usually DRAM, in a single appliance. Besides adding DRAM, vendors are also incorporating large numbers of multicore processors in each appliance. The Exalytics appliance, for example, includes four 10-core processors.3 The networking capabilities of the new appliances are also improved. 2 Chris Kanaracus, “SAP’s HANA in-memory database will run ERP this year,” IDG News Service, via InfoWorld, January 25, 2012, http://www.infoworld.com/d/ applications/saps-hana-in-memory-database-will-runerp-year-185040, accessed February 5, 2012. 3 Oracle Exalytics In-Memory Machine: A Brief Introduction, Oracle white paper, October 2011, http:// www.oracle.com/us/solutions/ent-performance-bi/ business-intelligence/exalytics-bi-machine/overview/ exalytics-introduction-1372418.pdf, accessed February 1, 2012.
Use case examples
Business process advantages of in-memory technology
Exalytics has two 40Gbps InfiniBand connections for low-latency database server connections and two 10 Gigabit Ethernet connections, in addition to lower-speed Ethernet connections. Effective data transfer rates are somewhat lower than the stated raw speeds. InfiniBand connections became more popular for high-speed data center applications in the late 2000s. With each succeeding generation, InfiniBand’s effective data transfer rate has come closer to the raw rate. Fourteen data rate or FDR InfiniBand, which has a raw data lane rate of more than 14Gbps, became available in 2011.4 Improvements in in-memory databases. In-memory databases are quite fast because they are designed to run entirely in main memory. In 2005, Oracle bought TimesTen, a high-speed, in-memory database provider serving the telecom and trading industries. With the help of memory technology improvements, by 2011, Oracle claimed that entire BI system implementations, such as Oracle BI server, could be held in main memory. Federated databases— multiple autonomous databases that can be run as one—are also possible. “I can federate data from five physical databases in one machine,” says PwC Applied Analytics Principal Oliver Halter. In 2005, SAP bought P*Time, a highly parallelized online transaction processing (OLTP) database, and has blended its in-memory database capabilities with those of TREX and MaxDB to create the HANA in-memory database appliance. HANA includes stores for both row (optimal for transactional data with many fields) and column (optimal for analytical data with fewer fields), with capabilities for both structured and less structured data. HANA will become the base for the full range of SAP’s applications, 4 See “What is FDR InfiniBand?” at the InfiniBand Trade Association site (http://members.infinibandta.org/ kwspub/home/7423_FDR_FactSheet.pdf, accessed February 10, 2012) for more information on InfiniBand availability.
In-memory technology makes it possible to run queries that previously ran for hours in minutes, which has numerous implications. Running queries faster implies the ability to accelerate data-intensive business processes substantially. Take the case of supply chain optimization in the electronics industry. Sometimes it can take 30 hours or more to run a query from a business process to identify and fill gaps in TV replenishment at a retailer, for example. A TV maker using an in-memory appliance component in this process could reduce the query time to under an hour, allowing the maker to reduce considerably the time it takes to respond to supply shortfalls. Or consider the new ability to incorporate into a process more predictive analytics with the help of
in-memory technology. Analysts could identify new patterns of fraud in tax return data in ways they hadn’t been able to before, making it feasible to provide investigators more helpful leads, which in turn could make them more effective in finding and tracking down the most potentially harmful perpetrators before their methods become widespread. Competitive advantage in these cases hinges on blending effective strategy, means, and execution together, not just buying the new technology and installing it. In these examples, the challenge becomes not one of simply using a new technology, but using it effectively. How might the TV maker anticipate shortfalls in supply more readily? What algorithms might be most effective in detecting new patterns of tax return fraud? At its best, in-memory technology could trigger many creative ideas for process improvement.
with SAP porting its enterprise resource planning (ERP) module to HANA beginning in the fourth quarter of 2012, followed by other modules.5 Better compression. In-memory appliances use columnar compression, which stores similar data together to improve compression efficiency. Oracle claims a columnar compression capability of 5x, so physical capacity of 1TB is equivalent to having 5TB available. Other columnar database management system (DBMS) providers such as EMC/Greenplum, IBM/Netezza, and HP/Vertica have refined their own columnar compression capabilities over the years and will be able to apply these to their in-memory appliances. 5 Chris Kanaracus, op. cit.
Reshaping the workforce with the new analytics
35
More adaptive and efficient caching algorithms. Because main memory is still limited physically, appliances continue to make extensive use of advanced caching techniques that increase the effective amount of main memory. The newest caching algorithms—lists of computational procedures that specify which data to retain in memory—solve an old problem: tables that get dumped from memory when they should be maintained in the cache. “The caching strategy for the last 20 years relies on least frequently used algorithms,” Halter says. “These algorithms aren’t always the best approaches.” The term least frequently used refers to how these algorithms discard the data that hasn’t been used a lot, at least not lately. The method is good in theory, but in practice these algorithms can discard
“The caching strategy for the last 20 years relies on least frequently used algorithms. These algorithms aren’t always the best approaches.” —Oliver Halter, PwC
36
PwC Technology Forecast 2012 Issue 1
data such as fact tables (for example, a list of countries) that the system needs at hand. The algorithms haven’t been smart enough to recognize less used but clearly essential fact tables that could be easily cached in main memory because they are often small anyway. Generally speaking, progress has been made on many fronts to improve in-memory technology. Perhaps most importantly, system designers have been able to overcome some of the hardware obstacles preventing the direct connections the data requires so it can be processed. That’s a fundamental first step of a multistep process. Although the hardware, caching techniques, and some software exist, the software refinement and expansion that’s closer to the bigger vision will take years to accomplish.
Figure 4: Data blending In self-service BI software, the end user can act as an analyst.
Sales database Customer name Last n days Order date Order ID Order priority
Territory spreadsheet Container Product category Profit State ZIP code
A
B
C
State State abbreviated Population, 2009 Territory Tableau recognizes identical fields in different data sets.
Simple drag and drop replaces days of programming.
Database
You can combine, filter, and even perform calculations among different data sources right in the Tableau window.
Spreadsheet
Source: Tableau Software, 2011 Derived from a video at http://www.tableausoftware.com/videos/data-integration
Self-service BI and interactive visualization One of BI’s big challenges is to make it easier for a variety of end users to ask questions of the data and to do so in an iterative way. Self-service BI tools put a larger number of functions within reach of everyday users. These tools can also simplify a larger number of tasks in an analytics workflow. Many tools—QlikView, Tableau, and TIBCO Spotfire, to name a few—take some advantage of the new in-memory technology to reduce latency. But equally important to BI innovation are interfaces that meld visual ways of blending and manipulating the data with how it’s displayed and how the results are shared.
In the most visually capable BI tools, the presentation of data becomes just another feature of the user interface. Figure 4 illustrates how Tableau, for instance, unifies data blending, analysis, and dashboard sharing within one person’s interactive workflow. How interactive visualization works One important element that’s been missing from BI and analytics platforms is a way to bridge human language in the user interface to machine language more effectively. User interfaces have included features such as drag and drop for decades, but drag and drop historically has been linked to only a single application function—moving a file from one folder to another, for example.
Reshaping the workforce with the new analytics
37
Figure 5: Bridging human, visual, and machine language 1. To the user, results come from a simple drag and drop, which encourages experimentation and further inquiry.
Database
2. Behind the scenes, complex algebra actually makes the motor run. Hiding all the complexities of the VizQL computations saves time and frees the user to focus on the results of the query, rather than the construction of the query. 552
General queries Spreadsheet
1104
612
550 Specification x: C*(A+B) y: D+E Z: F Iod: G …
Compute normalized set form of each table expression
x: { (c1, a1)…(ck, bj ) } y: { (c1), …(d l ), (e1), …(em ) } z: { (f1), …(fn ) }
556
614
Construct table and sorting network
z x, y 720
Query results Partition into relations corresponding to panes Data interpreter
616
Per pane aggregation and sorting of tuples
Visual interpreter
Each tuple is rendered as a mark; data is encoded in color, size, etc.
Source: Chris Stolte, Diane Tang, and Pat Hanrahan, “Computer systems and methods for the query and visualization of multidimensional databases,” United States Patent 7089266, Stanford University, 2006, http://www.freepatentsonline.com/7089266.html, accessed February 12, 2012.
To query the data, users have resorted to typing statements in languages such as SQL that take time to learn. What a tool such as Tableau does differently is to make manipulating the data through familiar techniques (like drag and drop) part of an ongoing dialogue with the database extracts that are in active memory. By doing so, the visual user interface offers a more seamless way to query the data layer. Tableau uses what it calls Visual Query Language (VizQL) to create that dialogue. What the user sees on the screen, VizQL encodes into algebraic expressions that machines interpret and execute in the data. VizQL uses table algebra developed for this approach that maps rows and columns to the x- and y-axes and layers to the z-axis.6 6 See Chris Stolte, Diane Tang, and Pat Hanrahan, “Polaris: A System for Query, Analysis, and Visualization of Multidimensional Databases,” Communications of the ACM, November 2008, 75–76, http:// mkt.tableausoftware.com/files/Tableau-CACM-Nov2008-Polaris-Article-by-Stolte-Tang-Hanrahan.pdf, accessed February 10, 2012, for more information on the table algebra Tableau uses.
38
PwC Technology Forecast 2012 Issue 1
Jock Mackinlay, director of visual analysis at Tableau Software, puts it this way: “The algebra is a crisp way to give the hardware a way to interpret the data views. That leads to a really simple user interface.” (See Figure 5.) The benefits of interactive visualization Psychologists who study how humans learn have identified two types: leftbrain thinkers, who are more analytical, logical, and linear in their thinking, and right-brain thinkers, who take a more synthetic parts-to-wholes approach that can be more visual and focused on relationships among elements. Visually oriented learners make up a substantial portion of the population, and adopting tools more friendly to them can be the difference between creating a culture of inquiry, in which different thinking styles are applied to problems, and making do with an isolated group of
Good visualizations without normalized data
statisticians. (See the article, “How CIOs can build the foundation for a data science culture,” on page 58.) The new class of visually interactive, self-service BI tools can engage parts of the workforce—including right-brain thinkers—who may not have been previously engaged with analytics.
Business analytics software generally assumes that the underlying data is reasonably well designed, providing powerful tools for visualization and the exploration of scenarios. Unfortunately, well-designed, structured information is a rarity in some domains. Interactive tools can help refine a user’s questions and combine data, but often demand a reasonably normalized schematic framework.
At Seattle Children’s Hospital, the director of knowledge management, Ted Corbett, initially brought Tableau into the organization. Since then, according to Elissa Fink, chief marketing officer of Tableau Software, its use has spread to include these functions:
Zepheira’s Freemix product, the foundation of the Viewshare.org project from the US Library of Congress, works with less-structured data, even comma-separated values (CSV) files with no headers. Rather than assuming
• Facilities optimization— Making the best use of scarce operating room resources
the data is already set up for the analytical processing that machines can undertake, the Freemix designers concluded that the machine needs help from the user to establish context, and made generating that context feasible for even an unsophisticated user. Freemix walks the user through the process of adding context to the data by using annotations and augmentation. It then provides plugins to normalize fields, and it enhances data with new, derived fields (from geolocation or entity extraction, for example). These capabilities help the user display and analyze data quickly, even when given only ragged inputs.
• Inventory optimization— Reducing the tendency for nurses to hoard or stockpile supplies by providing visibility into what’s available hospital-wide • Test order reporting—Ensuring tests ordered in one part of the hospital aren’t duplicated in another part • Financial aid identification and matching—Expediting a match between needy parents whose children are sick and a financial aid source The proliferation of iPad devices, other tablets, and social networking inside the enterprise could further encourage the adoption of this class of tools. TIBCO Spotfire for iPad 4.0, for example, integrates with Microsoft SharePoint and tibbr, TIBCO’s social tool.7 7 Chris Kanaracus, “Tibco ties Spotfire business intelligence to SharePoint, Tibbr social network,” InfoWorld, November 14, 2011, http:// www.infoworld.com/d/business-intelligence/tibco-tiesspotfire-business-intelligence-sharepoint-tibbr-socialnetwork-178907, accessed February 10, 2012.
The QlikTech QlikView 11 also integrates with Microsoft SharePoint and is based on an HTML5 web application architecture suitable for tablets and other handhelds.8 Bringing more statistical rigor to business decisions Sports continue to provide examples of the broadening use of statistics. In the United States several years ago, Billy Beane and the Oakland Athletics baseball team, as documented in Moneyball by Michael Lewis, hired statisticians to help with recruiting and line-up decisions, using previously little-noticed player metrics. Beane had enough success with his method that it is now copied by most teams. 8 Erica Driver, “QlikView Supports Multiple Approaches to Social BI,” QlikCommunity, June 24, 2011, http://community.qlikview.com/blogs/ theqlikviewblog/2011/06/24/with-qlikview-you-cantake-various-approaches-to-social-bi, and Chris Mabardy, “QlikView 11—What’s New On Mobile,” QlikCommunity, October 19, 2011, http:// community.qlikview.com/blogs/theqlikviewblog /2011/10/19, accessed February 10, 2012.
Reshaping the workforce with the new analytics
39
“There are certain statistical principles and concepts that lie underneath all the sophisticated methods. You can get a lot out of or you can go far without having to do complicated math.” —Kaiser Fung, New York University
In 2012, there’s a debate over whether US football teams should more seriously consider the analyses of academics such as Tobias Moskowitz, an economics professor at the University of Chicago, who co-authored a book called Scorecasting. He analyzed 7,000 fourthdown decisions and outcomes, including field positions after punts and various other factors. His conclusion? Teams should punt far less than they do. This conclusion contradicts the common wisdom among football coaches: even with a 75 percent chance of making a first down when there’s just two yards to go, coaches typically choose to punt on fourth down. Contrarians, such as Kevin Kelley of Pulaski Academy in Little Rock, Arkansas, have proven Moskowitz right. Since 2003, Kelley went for it on fourth down (in various yardage situations) 500 times and has a 49 percent success rate. Pulaski Academy has won the state championship three times since Kelley became head coach.9 Addressing the human factor As in the sports examples, statistical analysis applied to business can surface findings that contradict longheld assumptions. But the basic principles aren’t complicated. “There are certain statistical principles and concepts that lie underneath all the sophisticated methods. You can get a lot out of or you can go far without having to do complicated math,” says Kaiser Fung, an adjunct professor at New York University. Simply looking at variability is an example. Fung considers variability a neglected factor in comparison to averages, for example. If you run a 9 Seth Borenstein, “Unlike Patriots, NFL slow to embrace ‘Moneyball’,” Seattle Times, February 3, 2012, http://seattletimes.nwsource.com/html/ sports/2017409917_apfbnsuperbowlanalytics.html, accessed February 10, 2012.
40
PwC Technology Forecast 2012 Issue 1
theme park and can reduce the longest wait times for rides, that is a clear way to improve customer satisfaction, and it may pay off more and be less expensive than reducing the average wait time. Much of the utility of statistics is to confront old thinking habits with valid findings that may seem counterintuitive to those who aren’t accustomed to working with statistics or acting on the basis of their findings. Clearly there is utility in counterintuitive but valid findings that have ties to practical business metrics. They get people’s attention. To counter old thinking habits, businesses need to raise the profiles of statisticians, scientists, and engineers who are versed in statistical methods, and make their work more visible. That in turn may help to raise the visibility of statistical analysis by embedding statistical software in the day-to-day business software environment. R: Statistical software’s open source evolution Until recently, statistical software packages were in a group by themselves. College students who took statistics classes used a particular package, and the language it used was quite different from programming languages such as Java. Those students had to learn not only a statistical language, but also other programming languages. Those who didn’t have this breadth of knowledge of languages faced limitations in what they could do. Others who were versed in Python or Java but not a statistical package were similarly limited. What’s happened since then is the proliferation of R, an open source statistical programming language that lends itself to more uses in
Associative search
business environments. R has become popular in universities and now has thousands of ancillary open source applications in its ecosystem. In its latest incarnations, it has become part of the fabric of big data and more visually oriented analytics environments.
Particularly for the kinds of enterprise databases used in business intelligence, simple keyword search goes only so far. Keyword searches often come up empty for semantic reasons—the users doing the searching can’t guess the term in a database that comes closest to what they’re looking for.
R in open source big data environments. Statisticians have typically worked with small data sets on their laptops, but now they can work with R directly on top of Hadoop, an open source cluster computing environment.10 Revolution Analytics, which offers a commercial R distribution, created a Hadoop interface for R in 2011, so R users will not be required to use MapReduce or Java.11 The result is a big data analytics capability for R statisticians and programmers that didn’t exist before, one that requires no additional skills. R convertible to SQL and part of the Oracle big data environment. In January 2012, Oracle announced Oracle R Enterprise, its own distribution of R, which is bundled with a Hadoop distribution in its big data appliance. With that distribution, R users can run their analyses in the Oracle 11G database. Oracle claims performance advantages when running in its own database.12 Integrating interactive visualization with R. One of the newest capabilities involving R is its integration with 10 See “Making sense of Big Data,” Technology Forecast 2010, Issue 3, http://www.pwc.com/us/en/technologyforecast/2010/issue3/index.jhtml, and Architecting the data layer for analytic applications, PwC white paper, Spring 2011, http://www.pwc.com/us/en/increasingit-effectiveness/assets/pwc-data-architecture.pdf, accessed April 5, 2012, to learn more about Hadoop and other NoSQL databases. 11 Timothy Prickett Morgan, “Revolution speeds stats on Hadoop clusters,” The Register, September 27, 2011, http://www.theregister.co.uk/2011/09/27/revolution_r_ hadoop_integration/, accessed February 10, 2012. 12 Doug Henschen, “Oracle Analytics Package Expands In-Database Processing Options,” InformationWeek, February 8, 2012, http://informationweek.com/news/ software/bi/232600448, accessed February 10, 2012.
To address this problem, self-service BI tools such as QlikView offer associative search. Associative search allows users to select two or more fields and search occurrences in both to find references to a third concept or name. With the help of this technique, users can gain unexpected insights and make discoveries by clearly seeing how data is associated—sometimes for the very first time. They ask a stream of questions by making a series of selections, and they instantly see all the fields in the application filter themselves based on their selections.
At any time, users can see not only what data is associated—but what data is not related. The data related to their selections is highlighted in white while unrelated data is highlighted in gray. In the case of QlikView’s associative search, users type relevant words or phrases in any order and get quick, associative results. They can search across the entire data set, and with search boxes on individual lists, users can confine the search to just that field. Users can conduct both direct and indirect searches. For example, if a user wanted to identify a sales rep but couldn’t remember the sales rep’s name—just details about the person, such as that he sells fish to customers in the Nordic region— the user could search on the sales rep list box for “Nordic” and “fish” to narrow the search results to just sellers who meet those criteria.
interactive visualization.13 R is best known for its statistical analysis capabilities, not its interface. However, interactive visualization tools such as Omniscope are beginning to offer integration with R, improving the interface significantly. The resulting integration makes it possible to preview data from various sources, drag and drop from those sources and individual R statistical operations, and drag and connect to combine and display results. Users can view results in either a data manager view or a graph view and refine the visualization within either or both of those views.
13 See Steve Miller, “Omniscope and R,” Information Management, February 7, 2012, http:// www.information-management.com/blogs/datascience-agile-BI-visualization-Visokio-10021894-1.html and the R Statistics/Omniscope 2.7 video, http://www.visokio.com/featured-videos, accessed February 8, 2012.
Reshaping the workforce with the new analytics
41
R has benefitted greatly from its status in the open source community, and this has brought it into a mainstream data analysis environment. There is potential now for more direct collaboration between the analysts and the statisticians. Better visualization and tablet interfaces imply an ability to convey statistically based information more powerfully and directly to an executive audience. Conclusion: No lack of vision, resources, or technology The new analytics certainly doesn’t lack for ambition, vision, or technological innovation. SAP intends to base its new applications architecture on the HANA in-memory database appliance. Oracle envisions running whole application suites in memory, starting with BI. Others that offer BI or columnar database products have similar visions. Tableau Software and others in interactive visualization continue to refine and expand a visual language that allows even casual users to extract, analyze, and display in a few drag-and-drop steps. More enterprises are keeping their customer data longer, so they can mine the historical record more effectively. Sensors are embedded in new places daily, generating ever more data to analyze.
42
PwC Technology Forecast 2012 Issue 1
There is clear promise in harnessing the power of a larger proportion of the whole workforce with one aspect or another of the new analytics. But that’s not the only promise. There’s also the promise of more data and more insight about the data for staff already fully engaged in BI, because of processes that are instrumented closer to the action; the parsing and interpretation of prose, not just numbers; the speed that questions about the data can be asked and answered; the ability to establish whether a difference is random error or real and repeatable; and the active engagement with analytics that interactive visualization makes possible. These changes can enable a company to be highly responsive to its environment, guided by a far more accurate understanding of that environment. There are so many different ways now to optimize pieces of business processes, to reach out to new customers, to debunk old myths, and to establish realities that haven’t been previously visible. Of course, the first steps are essential— putting the right technologies in place can set organizations in motion toward a culture of inquiry and engage those who haven’t been fully engaged.
There are so many different ways now to optimize pieces of business processes, to reach out to new customers, to debunk old myths, and to establish realities that haven’t been previously visible. Of course, the first steps are essential—putting the right technologies in place can set organizations in motion toward a culture of inquiry. The new analytics certainly doesn’t lack for ambition, vision, or technological innovation.
Reshaping the workforce with the new analytics
43
44
PwC Technology Forecast 2012 Issue 1
Natural language processing and social media intelligence Mining insights from social media data requires more than sorting and counting words. By Alan Morrison and Steve Hamby
Most enterprises are more than eager to further develop their capabilities in social media intelligence (SMI)—the ability to mine the public social media cloud to glean business insights and act on them. They understand the essential value of finding customers who discuss products and services candidly in public forums. The impact SMI can have goes beyond basic market research and test marketing. In the best cases, companies can uncover clues to help them revisit product and marketing strategies. “Ideally, social media can function as a really big focus group,” says Jeff Auker, a director in PwC’s Customer Impact practice. Enterprises, which spend billions on focus groups, spent nearly $1.6 billion in 2011 on social media marketing, according to Forrester Research. That number is expected to grow to nearly $5 billion by 2016.1 1 Shar VanBoskirk, US Interactive Marketing Forecast, 2011 To 2016, Forrester Research report, August 24, 2011, http://www.forrester.com/rb/Research/us_ interactive_marketing_forecast%2C_2011_to_2016/q/ id/59379/t/2, accessed February 12, 2012.
Auker cites the example of a media company’s use of SocialRep,2 a tool that uses a mix of natural language processing (NLP) techniques to scan social media. Preliminary scanning for the company, which was looking for a gentler approach to countering piracy, led to insights about how motivations for movie piracy differ by geography. “In India, it’s the grinding poverty. In Eastern Europe, it’s the underlying socialist culture there, which is, ‘my stuff is your stuff.’ There, somebody would buy a film and freely copy it for their friends. In either place, though, intellectual property rights didn’t hold the same moral sway that they did in some other parts of the world,” Auker says. This article explores the primary characteristics of NLP, which is the key to SMI, and how NLP is applied to social media analytics. The article considers what’s in the realm of the possible when mining social media text, and how informed human analysis becomes essential when interpreting the conversations that machines are attempting to evaluate. 2 PwC has joint business relationships with SocialRep, ListenLogic, and some of the other vendors mentioned in this publication.
Reshaping the workforce with the new analytics
45
“It takes very rare skill sets in the NLP community to figure this stuff out. It’s incredibly processing and storage intensive, and it takes awhile. If you used pure NLP to tell me everything that’s going on, by the time you indexed all the conversations, it might be days or weeks later. By then, the whole universe isn’t what it used to be.” —Jeff Auker, PwC
Natural language processing: Its components and social media applications NLP technologies for SMI are just emerging. When used well, they serve as a more targeted, semantically based complement to pure statistical analysis, which is more scalable and able to tackle much larger data sets. While statistical analysis looks at the relative frequencies of word occurrences and the relationships between words, NLP tries to achieve deeper insights into the meanings of conversations. The best NLP tools can provide a level of competitive advantage, but it’s a challenging area for both users and vendors. “It takes very rare skill sets in the NLP community to figure this stuff out,” Auker says. “It’s incredibly processing and storage intensive, and it takes awhile. If you used pure NLP to tell me everything that’s going on, by the time you indexed all the conversations, it might be days or weeks later. By then, the whole universe isn’t what it used to be.” First-generation social media monitoring tools provided some direct business value, but they also left users with more questions than answers. And context was a key missing ingredient. Rick Whitney, a director in PwC’s Customer Impact practice, makes the following distinction between the first- and secondgeneration SMI tools: “Without good NLP, the first-generation tools don’t give you that same context,” he says. What constitutes good NLP is open to debate, but it’s clear that some of the more useful methods blend different detailed levels of analysis and sophisticated filtering, while others stay attuned to the full context of the conversations to ensure that novel and interesting findings that inadvertently could be screened out make it through the filters.
46
PwC Technology Forecast 2012 Issue 1
Types of NLP NLP consists of several subareas of computer-assisted language analysis, ways to help scale the extraction of meaning from text or speech. NLP software has been used for several years to mine data from unstructured data sources, and the software had its origins in the intelligence community. During the past few years, the locus has shifted to social media intelligence and marketing, with literally hundreds of vendors springing up. NLP techniques span a wide range, from analysis of individual words and entities, to relationships and events, to phrases and sentences, to documentlevel analysis. (See Figure 1.) The primary techniques include these: Word or entity (individual element) analysis • Word sense disambiguation— Identifies the most likely meaning of ambiguous words based on context and related words in the text. For example, it will determine if the word “bank” refers to a financial institution, the edge of a body of water, the act of relying on something, or one of the word’s many other possible meanings. • Named entity recognition (NER)—Identifies proper nouns. Capitalization analysis can help with NER in English, for instance, but capitalization varies by language and is entirely absent in some. • Entity classification—Assigns categories to recognized entities. For example, “John Smith” might be classified as a person, whereas “John Smith Agency” might be classified as an organization, or more specifically “insurance company.”
Figure 1: The varied paths to meaning in text analytics Machines need to review many different kinds of clues to be able to deliver meaningful results to users.
Documents Metadata Lexical graphs
Words
Sentences Social graphs
Meaning
• Part of speech (POS) tagging— Assigns a part of speech (such as noun, verb, or adjective) to every word to form a foundation for phrase- or sentence-level analysis.
about its competitors—even though a single verb “blogged” initiated the two events. Event analysis can also define relationships between entities in a sentence or phrase; the phrase “Sally shot John” might establish a relationship between John and Sally of murder, where John is also categorized as the murder victim.
Relationship and event analysis • Relationship analysis—Determines relationships within and across sentences. For example, “John’s wife Sally …” implies a symmetric relationship of spouse. • Event analysis—Determines the type of activity based on the verb and entities that have been assigned to a classification. For example, an event “BlogPost” may have two types associated with it—a blog post about a company versus a blog post
• Co-reference resolution—Identifies words that refer to the same entity. For example, in these two sentences— “John bought a gun. He fired the gun when he went to the shooting range.”—the “He” in the second sentence refers to “John” in the first sentence; therefore, the events in the second sentence are about John.
Reshaping the workforce with the new analytics
47
Syntactic (phrase and sentence construction) analysis • Syntactic parsing—Generates a parse tree, or the structure of sentences and phrases within a document, which can lead to helpful distinctions at the document level. Syntactic parsing often involves the concept of sentence segmentation, which builds on tokenization, or word segmentation, in which words are discovered within a string of characters. In English and other languages, words are separated by spaces, but this is not true in some languages (for instance, Chinese). • Language services—Range from translation to parsing and extracting in native languages. For global organizations, these services are a major differentiator because of the different techniques required for different languages. Document analysis • Summarization and topic identification—Summarizes (in the case of topic identification) in a few words the topic of an entire document or subsection. Summarization, by contrast, provides a longer summary of a document or subsection. • Sentiment analysis—Recognizes subjective information in a document that can be used to identify “polarity” or distinguish between entirely opposite entities and topics. This analysis is often used to determine trends in public opinion, but it also has other uses, such as determining confidence in facts extracted using NLP. • Metadata analysis—Identifies and analyzes the document source, users, dates, and times created or modified.
48
PwC Technology Forecast 2012 Issue 1
NLP applications require the use of several of these techniques together. Some of the most compelling NLP applications for social media analytics include enhanced extraction, filtered keyword search, social graph analysis, and predictive and sentiment analysis. Enhanced extraction NLP tools are being used to mine both the text and the metadata in social media. For example, the inTTENSITY Social Media Command Center (SMCC) integrates Attensity Analyze with Inxight ThingFinder—both established tools—to provide a parser for social media sources that include metadata and text. The inTTENSITY solution uses Attensity Analyze for predicate analysis to provide relationship and event analysis, and it uses ThingFinder for noun identification. Filtered keyword search Many keyword search methods exist. Most require lists of keywords to be defined and generated. Documents containing those words are matched. WordStream is one of the prominent tools in keyword search for SMI. It provides several ways for enterprises to filter keyword searches. Social graph analysis Social graphs assist in the study of a subject of interest, such as a customer, employee, or brand. These graphs can be used to: • Determine key influencers in each major node section • Discover if one aspect of the brand needs more attention than others • Identify threats and opportunities based on competitors and industry • Provide a model for collaborative brainstorming
Many NLP-based social graph tools extract and classify entities and relationships in accordance with a defined ontology or graph. But some social media graph analytics vendors, such as Nexalogy Environics, rely on more flexible approaches outside standard NLP. “NLP rests upon what we call static ontologies—for example, the English language represented in a network of tags on about 30,000 concepts could be considered a static ontology,” Claude Théoret, president of Nexalogy Environics, explains. “The problem is that the moment you hit something that’s not in the ontology, then there’s no way of figuring out what the tags are.” In contrast, Nexalogy Environics generates an ontology for each data set, which makes it possible to capture meaning missed by techniques that are looking just for previously defined terms. “That’s why our stuff is not quite real time,” he says, “because the amount of number crunching you have to do is huge and there’s no human intervention whatsoever.” (For an example of Nexalogy’s approach, see the article, “The third wave of customer analytics,” on page 06.) Predictive analysis and early warning Predictive analysis can take many forms, and NLP can be involved, or it might not be. Predictive modeling and statistical analysis can be used effectively without the help of NLP to analyze a social network and find and target influencers in specific areas. Before he came to PwC, Mark Paich, a director in the firm’s advisory service, did some agentbased modeling3 for a Los Angeles– based manufacturer that hoped to
3 Agent-based modeling is a means of understanding the behavior of a system by simulating the behavior of individual actors, or agents, within that system. For more on agent-based modeling, see the article “Embracing unpredictability” and the interview with Mark Paich, “Using simulation tools for strategic decision making,” in Technology Forecast 2010, Issue 1, http://www.pwc.com/us/en/technology-forecast/ winter2010/index.jhtml, accessed February 14, 2012.
What constitutes good NLP is open to debate, but it’s clear that some of the more useful methods blend different detailed levels of analysis and sophisticated filtering, while others stay attuned to the full context of the conversations.
change public attitudes about its products. “We had data on what products people had from the competitors and what people had products from this particular firm. And we also had some survey data about attitudes that people had toward the product. We were able to say something about what type of people, according to demographic characteristics, had different attitudes.” Paich’s agent-based modeling effort matched attitudes with the manufacturer’s product types. “We calibrated the model on the basis of some fairly detailed geographic data to get a sense as to whose purchases influenced whose purchases,” Paich says. “We didn’t have direct data that said, ‘I influence you.’ We made some assumptions about what the network would look like, based on studies of who talks to whom. Birds of a feather flock together, so people in the same age groups who have other things in common
Reshaping the workforce with the new analytics
49
tend to talk to each other. We got a decent approximation of what a network might look like, and then we were able to do some statistical analysis.”
“Our models are built on seeds from analysts with years of experience in each industry. We can put in the word ‘Escort’ or ‘Suburban,’ and then behind that put a car brand such as ‘Ford’ or ‘Chevy.’ The models combined could be strings of 250 filters of various types.” —Vince Schiavone, ListenLogic
That statistical analysis helped with the influencer targeting. According to Paich, “It said that if you want to sell more of this product, here are the key neighborhoods. We identified the key neighborhood census tracts you want to target to best exploit the social network effect.”
PwC Technology Forecast 2012 Issue 1
Nearly all SMI products provide some form of timeline analysis of social media traffic with historical analysis and trending predictions.
Predictive modeling is helpful when the level of specificity needed is high (as in the Los Angeles manufacturer’s example), and it’s essential when the cost of a wrong decision is high.4 But in other cases, less formal social media intelligence collection and analysis are often sufficient. When it comes to brand awareness, NLP can help provide context surrounding a spike in social media traffic about a brand or a competitor’s brand.
Sentiment analysis Even when overall social media traffic is within expected norms or predicted trends, the difference between positive, neutral, and negative sentiment can stand out. Sentiment analysis can suggest whether a brand, customer support, or a service is better or worse than normal. Correlating sentiment to recent changes in product assembly, for example, could provide essential feedback.
That spike could be a key data point to initiate further action or research to remediate a problem before it gets worse or to take advantage of a market opportunity before a competitor does. (See the article, “The third wave of customer analytics,” on page 06.) Because social media is typically faster than other data sources in delivering early indications, it’s becoming a preferred means of identifying trends.
Most customer sentiment analysis today is conducted only with statistical analysis. Government intelligence agencies have led with more advanced methods that include semantic analysis. In the US intelligence community, media intelligence generally provides early indications of events important to US interests, such as assessing the impact of terrorist activities on voting in countries the Unites States is aiding, or mining social media for early indications of a disease outbreak. In these examples, social media prove to be one of the fastest, most accurate sources for this analysis.
4 For more information on best practices for the use of predictive analytics, see Putting predictive analytics to work, PwC white paper, January 2012, http:// www.pwc.com/us/en/increasing-it-effectiveness/ publications/predictive-analytics-to-work.jhtml, accessed February 14, 2012.
50
Many companies mine social media to determine who the key influencers are for a particular product. But mining the context of the conversations via interest graph analysis is important. “As Clay Shirky pointed out in 2003, influence is only influential within a context,” Théoret says.
Table 1: A few NLP best practices Strategy
Description
Benefits
Mine the aggregated data.
Many tools monitor individual accounts. Clearly enterprises need more than individual account monitoring.
Scalability and efficiency of the mining effort are essential.
Segment the interest graph in a meaningful way.
Regional segmentation, for instance, is important because of differences in social media adoption by country.
Orkut is larger than Facebook in Brazil, for instance, and Qzone is larger in China. Global companies need global social graph data.
Conduct deep parsing.
Deep parsing takes advantage of a range of NLP extraction techniques rather than just one.
Multiple extractors that use the best approaches in individual areas—such as verb analysis, sentiment analysis, named entity recognition, language services, and so forth—provide better results than the all-in-one approach.
Align internal models to the social model.
After mining the data for social graph clues, the implicit model that results should be aligned to the models used for other data sources.
With aligned customer models, enterprises can correlate social media insights with logistics problems and shipment delays, for example. Social media serves in this way as an early warning or feedback mechanism.
Take advantage of alternatives to mainstream NLP.
Approaches outside the mainstream can augment mainstream tools.
Tools that take a bottom-up approach and surface more flexible ontologies, for example, can reveal insights other tools miss.
NLP-related best practices After considering the breadth of NLP, one key takeaway is to make effective use of a blend of methods. Too simple an approach can’t eliminate noise sufficiently or help users get to answers that are available. Too complicated an approach can filter out information that companies really need to have. Some tools classify many different relevant contexts. ListenLogic, for example, combines lexical, semantic, and statistical analysis, as well as models the company has developed to establish specific industry context. “Our models are built on seeds from analysts with years of experience in each industry. We can put in the word ‘Escort’ or ‘Suburban,’ and then behind that put a car brand such as ‘Ford’ or ‘Chevy,’” says Vince Schiavone, co-founder and executive chairman of ListenLogic. “The models combined could be strings of 250 filters of various types.” The models fall into five categories:
• Direct concept filtering—Filtering based on the language of social media • Ontological—Models describing specific clients and their product lines • Action—Activity associated with buyers of those products • Persona—Classes of social media users who are posting • Topic—Discovery algorithms for new topics and topic focusing Other tools, including those from Nexalogy Environics, take a bottom-up approach, using a data set as it comes and, with the help of several proprietary universally applicable algorithms, processing it with an eye toward categorization on the fly. Equally important, Nexalogy’s analysts provide interpretations of the data that might not be evident to customers using the same tool. Both kinds of tools have strengths and weaknesses. Table 1 summarizes some of the key best practices when collecting SMI.
Reshaping the workforce with the new analytics
51
Conclusion: A machine-assisted and iterative process, rather than just processing alone Good analysis requires consideration of a number of different clues and quite a bit of back-and-forth. It’s not a linear process. Some of that process can be automated, and certainly it’s in a company’s interest to push the level of automation. But it’s also essential not to put too much faith in a tool or assume that some kind of automated service will lead to insights that are truly game changing. It’s much more likely that the tool provides a way into some far more extensive investigation, which could lead to some helpful insights, which then must be acted upon effectively. One of the most promising aspects of NLP adoption is the acknowledgment that structuring the data is necessary to help machines interpret it. Developers have gone to great lengths to see how much knowledge they can extract with the help of statistical analysis methods, and it still has legs. Search engine companies, for example, have taken pure
The tool provides a way into some far more extensive investigation, which could lead to some helpful insights, which then must be acted upon effectively.
statistical analysis to new levels, making it possible to pair a commonly used phrase in one language with a phrase in another based on some observation of how frequently those phrases are used. So statistically based processing is clearly useful. But it’s equally clear from seeing so many opaque social media analyses that it’s insufficient. Structuring textual data, as with numerical data, is important. Enterprises cannot get to the web of data if the data is not in an analysis-friendly form—a database of sorts. But even when something materializes resembling a better described and structured web, not everything in the text of a social media conversation will be clear. The hope is to glean useful clues and starting points from which individuals can begin their own explorations. Perhaps one of the more telling trends in social media is the rise of online word-of-mouth marketing and other similar approaches that borrow from anthropology. So-called social ethnographers are monitoring how online business users behave, and these ethnographers are using NLP-based tools to land them in a neighborhood of interest and help them zoom in once there. The challenge is how to create a new social science of online media, one in which the tools are integrated with the science.
“As Clay Shirky pointed out in 2003, influence is only influential within a context.” —Claude Théoret, Nexalogy Environics
52
PwC Technology Forecast 2012 Issue 1
An in-memory appliance to explore graph data YarcData’s uRiKA analytics appliance,1 announced at O’Reilly’s Strata data science conference in March 2012, is designed to analyze the relationships between nodes in large graph data sets. To accomplish this feat, the system can take advantage of as much as 512TB of DRAM and 8,192 processors with over a million active threads. In-memory appliances like these allow very large data sets to be stored and analyzed in active or main memory, avoiding memory swapping to disk that introduces lots of latency. It’s possible to load full business intelligence (BI) suites, for example, into RAM to speed up the response time as much as 100 times. (See “What in-memory technology does” on page 33 for more information on in-memory appliances.) With compression, it’s apparent that analysts can query true big data (data sets of greater than 1PB) directly in main memory with appliances of this size. Besides the sheer size of the system, uRiKA differs from other appliances because it’s designed to analyze graph data (edges and nodes) that take the form of subject-verb-object triples. This kind of graph data can describe relationships between people, places, and things scalably. Flexible and richly described data relationships constitute an additional data dimension users can mine, so it’s now possible, for example, to query for patterns evident in the graphs that aren’t evident otherwise, whether unknown or purposely hidden.2
1 The YarcData uRiKA Graph Appliance: Big Data Relationship Analytics, Cray white paper, http://www. yarcdata.com/productbrief.html, March 2012, accessed April 3, 2012.
But mining graph data, as YarcData (a unit of Cray) explains, demands a system that can process graphs without relying on caching, because mining graphs requires exploring many alternative paths individually with the help of millions of threads— a very memory- and processorintensive task. Putting the full graph in a single random access memory space makes it possible to query it and retrieve results in a timely fashion. The first customers for uRiKA are government agencies and medical research institutes like the Mayo Clinic, but it’s evident that social media analytics developers and users would also benefit from this kind of appliance. Mining the social graph and the larger interest graph (the relationships between people, places, and things) is just beginning.3 Claude Théoret of Nexalogy Environics has pointed out that crunching the relationships between nodes at web scale hasn’t previously been possible. Analyzing the nodes themselves only goes so far.
3 See “The collaboration paradox,” Technology Forecast 2011, Issue 3, http://www.pwc.com/us/en/technologyforecast/2011/issue3/features/feature-socialinformation-paradox.jhtml#, for more information on the interest graph.
2 Michael Feldman, “Cray Parlays Supercomputing Technology Into Big Data Appliance,” Datanami, March 2, 2012, http://www.datanami.com/ datanami/2012-03-02/cray_parlays_supercomputing_ technology_into_big_data_appliance.html, accessed April 3, 2012.
Reshaping the workforce with the new analytics
53
The payoff from interactive visualization Jock Mackinlay of Tableau Software discusses how more of the workforce has begun to use analytics tools. Interview conducted by Alan Morrison
Jock Mackinlay Jock Mackinlay is the director of visual analysis at Tableau Software.
PwC: When did you come to Tableau Software? JM: I came to Tableau in 2004 out of the research world. I spent a long time at Xerox Palo Alto Research Center working with some excellent people— Stuart Card and George Robertson, who are both recently retired. We worked in the area of data visualization for a long time. Before that, I was at Stanford University and did a PhD in the same area—data visualization. I received a Technical Achievement Award for that entire body of work from the IEEE organization in 2009. I’m one of the lucky few people who had the opportunity to take his research out into the world into a successful company. PwC: Our readers might appreciate some context on the whole area of interactive visualization. Is the innovation in this case task automation? JM: There’s a significant limit to how we can automate. It’s extremely difficult to understand what a person’s task is and what’s going on in their head. When I finished my dissertation, I chose a mixture of automated techniques plus giving humans a lot of power over thinking with data. And that’s the Tableau philosophy too. We want to provide people with good defaulting as best we can but also make it easy for people to make adjustments as their tasks change. When users are in the middle of looking at some data, they might change their minds about what questions they’re asking. They need to head toward that new question on the fly. No automated system is going to keep up with the stream of human thought.
54
PwC Technology Forecast 2012 Issue 1
“No amount of pre-computation or work by an IT department is going to be able to anticipate all the possible ways people might want to work with data. So you need to have a flexible, human-centered approach.”
PwC: Humans often don’t know themselves what question they’re ultimately interested in. JM: Yes, it’s an iterative exploration process. You cannot know up front what question a person may want to ask today. No amount of pre-computation or work by an IT department is going to be able to anticipate all the possible ways people might want to work with data. So you need to have a flexible, human-centered approach to give people a maximal ability to take advantage of data in their jobs. PwC: What did your research uncover that helps? JM: Part of the innovation of the dissertation at Stanford was that the algebra enables a simple drag-anddrop interface that anyone can use. They drag fields and place them in rows and columns or whatnot. Their actions actually specify an algebraic expression that gets compiled into a query database. But they don’t need to know all that. They just need to know that they suddenly get to see their data in a visual form. PwC: One of the issues we run into is that user interfaces are often rather cryptic. Users must be well versed in the tool from the designer’s perspective. What have you done to make it less cryptic, to make what’s happening more explicit, so that users don’t present results that they think are answering their questions in some way but they’re not? JM: The user experience in Tableau is that you connect to your data and
you see the fields on the side. You can drag out the fields and drop them on row, column, color, size, and so forth. And then the tool generates the graphical views, so users can see the data visualization. They’re probably familiar with their data. Most people are if they’re working with data that they care about. The graphical view by default codifies the best practices for putting data in the view. For example, if the user dragged out a profit and date measure, because it’s a date field, we would automatically generate a line mark and give that user a trend line view because that’s best practice for profit varying over time. If instead they dragged out product and profit, we would give them a bar graph view because that’s an appropriate way to show that information. If they selected a geographic field, they’ll get a map view because that’s an appropriate way to show geography. We work hard to make it a rapid exploration process, because not only are tables and numbers difficult for humans to process, but also because a slow user experience will interrupt cognition and users can’t answer the questions. Instead, they’re spending the time trying to make the tool work. The whole idea is to make the tool an extension of your hand. You don’t think about the hammer. You just think about the job of building a house.
PwC: Are there categories of more structured data that would lend themselves to this sort of approach? Most of this data presumably has been processed to the point where it could be fed into Tableau relatively easily and then worked with once it’s in the visual form. JM: At a high level, that’s accurate. One of the other key innovations of the dissertation out of Stanford by Chris Stolte and Pat Hanrahan was that they built a system that could compile those algebraic expressions into queries on databases. So Tableau is good with any information that you would find in a database, both SQL databases and MDX databases. Or, in other words, both relational databases and cube databases. But there is other data that doesn’t necessarily fall into that form. It is just data that’s sitting around in text files or in spreadsheets and hasn’t quite got into a database. Tableau can access that data pretty well if it has a basic table structure to it. A couple of releases ago, we introduced what we call data blending. A lot of people have lots of data in lots of databases or tables. They might be text files. They might be Microsoft Access files. They might be in SQL or Hyperion Essbase. But whatever it is, their questions often span across those tables of data.
Reshaping the workforce with the new analytics
55
Normally, the way to address that is to create a federated database that joins the tables together, which is a six-month or greater IT effort. It’s difficult to query across multiple data tables from multiple databases. Data blending is a way—in a lightweight drag-and-drop way—to bring in data from multiple sources.
other, which we call grouping. At a fundamental level, it’s a way you can build up a hierarchical structure out of a flat dimension easily by grouping fields together. We also have some lightweight support for supporting those hierarchies.
data about the lenders, their location, the amount, and the borrower right down to their photographs. And we built a graphical view in Tableau. We sliced and diced it first and then built some graphical views for the demo.
We’ve also connected Tableau to Hadoop. Do you know about it?
The key problem about it from a human performance point of view is that there’s high latency. It takes a long time for the programs to run and process the data. We’re interested in helping people answer their questions at the speed of their thought. And so latency is a killer.
Imagine you have a spreadsheet that you’re using to keep track of some information about your products, and you have your company-wide data mart that has a lot of additional information about those products. And you want to combine them. You can direct connect Tableau to the data mart and build a graphical view.
PwC: We wrote about Hadoop in 2010. We did a full issue on it as a matter of fact.1 JM: We’re using a connector to Hadoop that Cloudera built that allows us to write SQL and then access data via the Hadoop architecture.
Then you can connect to your spreadsheet, and maybe you build a view about products. Or maybe you have your budget in your spreadsheet and you would like compare the actuals to the budget you’re keeping in your spreadsheet. It’s a simple drag-and-drop operation or a simple calculation to do that.
In particular, whenever we do demos on stage, we like to look for real data sets. We found one from Kiva, the online micro-lending organization. Kiva published the huge XML file describing all of the organization’s lenders and all of the recipients of the loans. This is an XML file, so it’s not your normal structured data set. It’s also big, with multiple years and lots of details for each.
So, you asked me this big question about structured to unstructured data. PwC: That’s right. JM: We have functionality that allows you to generate additional structure for data that you might have brought in. One of the features gives you the ability—in a lightweight way—to combine fields that are related to each
56
PwC Technology Forecast 2012 Issue 1
We processed that XML file in Hadoop and used our connector, which has string functions. We used those string functions to reach inside the XML and pull out what would be all the structured 1 See “Making sense of Big Data,” Technology Forecast 2010, Issue 3, http://www.pwc.com/us/en/technologyforecast/2010/issue3/index.jhtml, for more information.
We used the connection to process the XML file and build a Tableau extract file. That file runs on top of our data engine, which is a high-performance columnar database system. Once we had the data in the Tableau extract format, it was drag and drop at human speed. We’re heading down this vector, but this is where we are right now in terms of being able to process less-structured information into a form that you could then use Tableau on effectively. PwC: Interesting stuff. What about in-memory databases and how large they’re getting? JM: Anytime there’s a technology that can process data at fast rates, whether it’s in-memory technology, columnar databases, or what have you, we’re excited. From its inception, Tableau
involved direct connecting to databases and making it easy for anybody to be able to work with it. We’re not just about self-analytics; we’re also about data storytelling. That can have as much impact on the executive team as directly being able themselves to answer their own questions. PwC: Is more of the workforce doing the analysis now? JM: I just spent a week at the Tableau Customer Conference, and the people that I meet are extremely diverse. They’re not just the hardcore analysts who know about SPSS and R. They come from all different sizes of companies and nonprofits and on and on.
easier to use. I love that I have authentic users all over the company and I can ask them, “Would this feature help?” So yes, I think the focus on the workforce is essential. The trend here is that data is being collected by our computers almost unmanned, no supervision necessary. It’s the process of utilizing that data that is the game changer. And the only way you’re going to do that is to put the data in the hands of the individuals inside your organization.
“A lot of people have lots of data in lots of databases or tables. They might be text files. They might be Microsoft Access files. They might be in SQL or Hyperion Essbase. But whatever it is, their questions often span across those tables of data.”
And the people at the customer conferences are pretty passionate. I think part of the passion is the realization that you can actually work with data. It doesn’t have to be this horribly arduous process. You can rapidly have a conversation with your data and answer your questions. Inside Tableau, we use Tableau everywhere—from the receptionist who’s tracking utilization of all the conference rooms to the sales team that’s monitoring their pipeline. My major job at Tableau is on the team that does forward product direction. Part of that work is to make the product
Reshaping the workforce with the new analytics
57
Palm tree nursery. Palm oil is being tested to be used in aviation fuel
58
PwC Technology Forecast 2012 Issue 1
How CIOs can build the foundation for a data science culture Helping to establish a new culture of inquiry can be a way for these executives to reclaim a leadership role in information. By Bud Mathaisel and Galen Gruman
The new analytics requires that CIOs and IT organizations find new ways to engage with their business partners. For all the strategic opportunities new analytics offers the enterprise, it also threatens the relevance of the CIO. The threat comes from the fact that the CIO’s business partners are being sold data analytics services and software outside normal IT procurement channels, which cuts out of the process the very experts who can add real value. Perhaps the vendors’ user-centric view is based on the premise that only users in functional areas can understand which data and conclusions from its analysis are meaningful. Perhaps the CIO and IT have not demonstrated the value they can offer, or they have dwelled too much on controlling security or costs to the detriment of showing the value IT can add. Or perhaps only the user groups have the funding to explore new analytics.
Whatever the reasons, CIOs must rise above them and find ways to provide important capabilities for new analytics while enjoying the thrill of analytics discovery, if only vicariously. The IT organization can become the go-to group, and the CIO can become the true information leader. Although it is a challenge, the new analytics is also an opportunity because it is something within the CIO’s scope of responsibility more than nearly any other development in information technology. The new analytics needs to be treated as a long-term collaboration between IT and business partners—similar to the relationship PwC has advocated1 for the general consumerization-of-IT phenomenon invoked by mobility, social media, and cloud services. This tight collaboration can be a win for the business and for the CIO. The new analytics is a chance for the CIO to shine, reclaim the “I” leadership in CIO, and provide a solid footing for a new culture of inquiry. 1 The consumerization of IT: The next-generation CIO, PwC white paper, November 2011, http:// www.pwc.com/us/en/technology-innovation-center/ consumerization-information-technology-transformingcio-role.jhtml, accessed February 1, 2012.
Reshaping the workforce with the new analytics
59
The many ways for CIOs to be new analytics leaders In businesses that provide information products or services—such as healthcare, finance, and some utilities— there is a clear added value from having the CIO directly contribute to the use of new analytics. Consider Edwards Lifesciences, where hemodynamic (blood circulation) modeling has benefited from the convergence of new data with new tools to which the CIO contributes. New digitally enabled medical devices, which are capable of generating a continuous flow of data, provide the opportunity to measure, analyze, establish pattern boundaries, and suggest diagnoses.
“IT has partnered successfully with Gallo’s marketing, sales, R&D, and distribution to leverage the capabilities of information from multiple sources. IT is not the focus of the analytics; the business is.” —Kent Kushar, E. & J. Gallo Winery
60
PwC Technology Forecast 2012 Issue 1
“In addition, a personal opportunity arises because I get to present our newest product, the EV1000, directly to our customers alongside our business team,” says Ashwin Rangan, CIO of Edwards Lifesciences. Rangan leverages his understanding of the underlying technologies, and, as CIO, he helps provision the necessary information infrastructure. As CIO, he also has credibility with customers when he talks to them about the information capabilities of Edwards’ products. For CIOs whose businesses are not in information products or services, there’s still a reason to engage in the new analytics beyond the traditional areas of enablement and of governance, risk, and compliance (GRC). That reason is to establish long-term relationships with the business partners. In this partnership, the business users decide which analytics are meaningful, and the IT professionals consult with them on the methods involved, including provisioning the data and tools. These CIOs may be less visible outside the enterprise, but they have a crucial role to play internally to jointly explore opportunities for analytics that yield useful results.
E. & J. Gallo Winery takes this approach. Its senior management understood the need for detailed customer analytics. “IT has partnered successfully with Gallo’s marketing, sales, R&D, and distribution to leverage the capabilities of information from multiple sources. IT is not the focus of the analytics; the business is,” says Kent Kushar, Gallo’s CIO. “After working together with the business partners for years, Gallo’s IT recently reinvested heavily in updated infrastructure and began to coalesce unstructured data with the traditional structured consumer data.” (See “How the E. & J. Gallo Winery matches outbound shipments to retail customers” on page 11.) Regardless of the CIO’s relationship with the business, many technical investments IT makes are the foundation for new analytics. A CIO can often leverage this traditional role to lead new analytics from behind the scenes. But doing even that—rather than leading from the front as an advocate for businessvaluable analytics—demands new skills, new data architectures, and new tools from IT. At Ingram Micro, a technology distributor, CIO Mario Leone views a well-integrated IT architecture as a critical service to business partners to support the company’s diverse and dynamic sales model and what Ingram Micro calls the “frontier” analysis of distribution logistics. “IT designs the modular and scalable backplane architecture to deliver real-time and relevant analytics,” he says. On one side of the backplane are multiple data sources, primarily delivered through partner interactions; on the flip side of the backplane are analytics tools and capabilities, including such
Figure 1: A CIO’s situationally specific roles
CIO #1 Focuses on inputs when production innovation, for example, is at a premium.
Backplane
OUTPUTS
INPUTS
Multiple data sources
Marketing Sales Distribution Research and development
CIO #2 Focuses on outputs when sales or marketing, for example, is the major concern.
new features as pattern recognition, optimization, and visualization. Taken together, the flow of multiple data streams from different points and advanced tools for business users can permit more sophisticated and iterative analyses that give greater insight to product mix offerings, changing customer buying patterns, and electronic channel delivery preferences. The backplane is a convergence point of those data into a coherent repository. (See Figure 1.)
Enable the data scientist One course of action is to strategically plan and provision the data and infrastructure for the new sources of data and new tools (discussed in the next section). However, the bigger challenge is to invoke the productive capability of the users. This challenge poses several questions:
Given these multiple ways for CIOs to engage in the new analytics— and the self-interest for doing so— the next issue is how to do it. After interviewing leading CIOs and other industry experts, PwC offers the following recommendations.
• Analytics capabilities have been pursued for a long time, but several hurdles have hindered the attainment of the goal (such as difficult-to-use tools, limited data, and too much dependence on IT professionals). CIOs must ask: which of these impediments are eased by the new capabilities and which remain?
• How can CIOs do this without knowing in advance which users will harvest the capabilities?
Reshaping the workforce with the new analytics
61
• As analytics moves more broadly through the organization, there may be too few people trained to analyze and present data-driven conclusions. Who will be fastest up the learning curve of what to analyze, of how to obtain and process data, and of how to discover useful insights?
Josée Latendresse of Latendresse Groupe Conseil says one of her clients, an apparel manufacturer based in Quebec, has been hiring PhDs to serve in the data science function.
What the enterprise needs is the data scientist—actually, several of them. A data scientist follows a scientific method of iterative and recursive analysis, with a practical result in mind. Examples are easy to identify: an outcome that improves revenue, profitability, operations or supply chain efficiency, R&D, financing, business strategy, the use of human capital, and so forth. There is no sure way of knowing in advance where or when this insight will arrive, so it cannot be tackled in assembly line fashion with predetermined outcomes. The analytic approach involves trial and error and accepts that there will be dead ends, although a data scientist can even draw a useful conclusion—“this doesn’t work”—from a dead end. Even without formal training, some business users have the suitable skills, experience, and mind-set. Others need to be trained and encouraged to think like a scientist but behave like a—choose the function—financial analyst, marketer, sales analyst, operations quality analyst, or whatever. When it comes to repurposing parts of the workforce, it’s important to anticipate obstacles or frequent objections and consider ways to overcome them. (See Table 1.) Josée Latendresse of Latendresse Groupe Conseil says one of her clients, an apparel manufacturer based in Quebec, has been hiring PhDs to serve in this function. “They were able to know the factors and get very, very fine analysis of the information,” she says.
62
PwC Technology Forecast 2012 Issue 1
Gallo has tasked statisticians in IT, R&D, sales, and supply chain to determine what information to analyze, the questions to ask, the hypotheses to test, and where to go after that, Kushar says. The CIO has the opportunity to help identify the skills needed and then help train and support data scientists, who may not reside in IT. CIOs should work with the leaders of each business function to answer the questions: Where would information insights pay the highest dividends? Who are the likely candidates in their functions to be given access to these capabilities, as well as the training and support? Many can gain or sharpen analytic skills. The CIO is in the best position to ensure that the skills are developed and honed. The CIO must first provision the tools and data, but the data analytics requires the CIO and IT team to assume more responsibility for the effectiveness of the resources than in the past. Kushar says Gallo has a team within IT dedicated to managing and proliferating business intelligence tools, training, and processes. When major systems were deployed in the past, CIOs did their best to train users and support them, but CIOs only indirectly took responsibility for the users’ effectiveness. In data analytics, the responsibility is more directly correlated: the investments are not worth making unless IT steps up to enhance the users’ performance. Training should be comprehensive and go beyond teaching the tools to helping users establish an hypothesis, iteratively discover and look for insights from results that don’t match the hypothesis, understand the limitations of the data, and share the results with others (crowdsourcing, for example) who may see things the user does not.
Table 1: Barriers to adoption of analytics and ways to address them Barrier
Solution
Too difficult to use
Ensure the tool and data are user friendly; use published application programming interfaces (APIs) against data warehouses; seed user groups with analytics-trained staff; offer frequent training broadly; establish an analytics help desk.
Refusal to accept facts and resulting analysis, thereby discounting analytics
Require a 360-degree perspective and pay attention to dissenters; establish a culture of fact finding, inquiry, and learning.
Lack of analytics incentives and performance review criteria
Make contributions to insights from analytics an explicit part of performance reviews; recognize and reward those who creatively use analytics.
Training should encompass multiple tools, since part of what enables discovery is the proper pairing of tool, person, and problem; these pairings vary from problem to problem and person to person. You want a toolset to handle a range of analytics, not a single tool that works only in limited domains and for specific modes of thinking. The CIO could also establish and reinforce a culture of information inquiry by getting involved in data analysis trials. This involvement lends direct and moral support to some of the most important people in the organization. For CIOs, the bottom line is to care for the infrastructure but focus more on the actual use of information services. Advanced analytics is adding insight and power to those services. Renew the IT infrastructure for the new analytics As with all IT investments, CIOs are accountable for the payback from analytics. For decades, much time and money has been spent on data architectures; identification of “interesting” data; collecting, filtering, storing, archiving, securing, processing, and reporting data; training users; and the associated software and hardware in pursuit of the unique
insights that would translate to improved marketing, increased sales, improved customer relationships, and more effective business operations. Because most enterprises have been frustrated by the lack of clear payoffs from large investments in data analysis, they may be tempted to treat the new analytics as not really new. This would be a mistake. As with most developments in IT, there is something old, something new, something borrowed, and possibly something blue in the new analytics. Not everything is new, but that doesn’t justify treating the new analytics as more of the same. In fact, doing so indicates that your adoption of the new analytics is merely applying new tools and perhaps personnel to your existing activities. It’s not the tool per se that solves problems or finds insights—it’s the people who are able to explore openly and freely and to think outside the box, aided by various tools. So don’t just re-create or refurbish the existing box. Even if the CIO is skeptical and believes analytics is in a major hype cycle, there is still reason to engage. At the very least, the new analytics extends IT’s prior initiatives; for example, the new analytics makes possible
Reshaping the workforce with the new analytics
63
the kind of analytics your company has needed for decades to enhance business decisions, such as complex, real-time events management, or it makes possible new, disruptive business opportunities, such as the on-location promotion of sales to mobile shoppers. Given limited resources, a portfolio approach is warranted. The portfolio should encompass many groups in the enterprise and the many functions they perform. It also should encompass the convergence of multiple data sources and multiple tools. If you follow Ingram Micro’s backplane approach, you get the data convergence side of the backplane from the combination of traditional information sources with new data sources. Traditional information sources include structured transaction data from enterprise resource planning (ERP) and customer relationship management (CRM) systems; new data sources include textual information from social media, clickstream transactions, web logs, radio frequency identification (RFID) sensors, and other forms of unstructured and/or disparate information. The analytics tools side of the backplane arises from the broad availability of new tools and infrastructure, such as mobile devices; improved in-memory systems; better user interfaces for search; significantly improved visualization technologies; improved pattern recognition, optimization, and analytics software; and the use of the cloud for storing and processing. (See the article, “The art and science of new analytics technology,” on page 30.) Understanding what remains the same and what is new is a key to profiting from the new analytics. Even for what remains the same, additional investments are required.
64
PwC Technology Forecast 2012 Issue 1
Develop the new analytics strategic plan As always, the CIO should start with a strategic plan. Gallo’s Kushar refers to the data analytics specific plan as a strategic plan for the “enterprise information fabric,” a reference to all the crossover threads that form an identifiable pattern. An important component of this fabric is the identification of the uses and users that have the highest potential for payback. Places to look for such payback include areas where the company has struggled, where traditional or nontraditional competition is making inroads, and where the data has not been available or granular enough until now. The strategic plan must include the data scientist talent required and the technologies in which investments need to be made, such as hardware and software, user tools, structured and unstructured data sources, reporting and visualization capabilities, and higher-capacity networks for moving larger volumes of data. The strategic planning process brings several benefits: it updates IT’s knowledge of emerging capabilities as well as traditional and new vendors, and it indirectly informs prospective vendors that the CIO and IT are not to be bypassed. Once the vendor channels are known to be open, the vendors will come. Criteria for selecting tools may vary by organization, but the fundamentals are the same. Tools must efficiently handle larger volumes within acceptable response times, be friendly to users and IT support teams, be sound technically, meet security standards, and be affordable. The new appliances and tools could each cost several millions of dollars, and millions more to support. The good news is some of the tools and infrastructure can be rented through the cloud, and then tested until the concepts and
You want a toolset to handle a range of analytics, not a single tool that works only in limited domains and for specific modes of thinking.
super-users have demonstrated their potential. (See the interview with Mike Driscoll on page 20.) “All of this doesn’t have to be done in-house with expensive computing platforms,” says Edwards’ Rangan. “You can throw it in the cloud … without investing in tremendous capital-intensive equipment.” With an approved strategy, CIOs can begin to update the IT internal capabilities. At a minimum, IT must first provision the new data, tools, and infrastructure, and then ensure the IT team is up to speed on the new tools and capabilities. Gallo’s IT organization, for example, recently reinvested heavily in new appliances; system architecture; extract, transform, and load (ETL) tools; and ways in which SQL calls were written, and then began to coalesce unstructured data with the traditional structured consumer data. Provision data, tools, and infrastructure The talent, toolset, and infrastructure are prerequisites for data analytics. In the new analytics, CIOs and their
business partners are changing or extending the following: • Data sources to include the traditional enterprise structured information in core systems such as ERP, CRM, manufacturing execution systems, and supply chain, plus newer sources such as syndicated data (point of sale, Nielsen, and so on) and unstructured data from social media and other sources—all without compromising the integrity of the production systems or their data and while managing data archives efficiently. • Appliances to include faster processing and better in-memory caching. In-memory caching improves cycle time significantly, enabling information insights to follow human thought patterns closer to their native speeds. • Software to include newer data management, analysis, reporting, and visualization tools—likely multiple tools, each tuned to a specific capability.
Reshaping the workforce with the new analytics
65
• Data architectures and flexible metadata to accommodate multiple streams of multiple types of data stored in multiple databases. In this environment, a single database architecture is unworkable. • A cloud computing strategy that factors in the requirements of newly expanded analytics capability and how best to tap external as well as internal resources. Service-level expectations should be established for customers to ensure that these expanded sources of relevant data are always online and available in real time. The adoption of new analytics is an opportunity for IT to augment or update the business’s current capabilities. According to Kushar, Gallo IT’s latest investments are extensions of what Gallo wanted to do 25 years ago but could not due to limited availability of data and tools. Of course, each change requires a new response from IT, and each raises the perpetual dilemma of how to be selective with investments (to conserve funds) while being as broad and heterogeneous as possible so a larger population can create analytic insights, which could come from almost anywhere. Update IT capabilities: Leverage the cloud’s capacity With a strategic plan in place and the tools provisioned, the next prerequisite is to ensure that the IT organization is ready to perform its new or extended job. One part of this preparation is the research on tools the team needs to undertake with vendors, consultancies, and researchers. The CIO should consider some organizational investments to add to the core human resources in IT, because once the business users get traction, IT must be prepared
66
PwC Technology Forecast 2012 Issue 1
to meet the increased demands for technical support. IT will need new skills and capabilities that include: • Broader access to all relevant types of data, including data from transaction systems and new sources • Broader use of nontraditional resources, such as big data analytics services • Possible creation of specialized databases and data warehouses • Competence in new tools and techniques, such as database appliances, column and row databases, compression techniques, and NoSQL frameworks • Support in the use of tools for reporting and visualization • Updated approaches for mobile access to data and analytic results • New rules and approaches to data security • Expanded help desk services Without a parallel investment in IT skills, investments in tools and infrastructure could lie fallow, causing frustrated users to seek outside help. For example, without advanced compression and processing techniques, performance becomes a significant problem as databases grow larger and more varied. That’s an IT challenge that users would not anticipate, but it could result in a poor experience that leads them to third parties that have solved the issue (even if the users never knew what the issue was).
Most of the IT staff will welcome the opportunities to learn new tools and help support new capabilities, even if the first reaction might be to fret over any extra work. CIOs must lead this evolution by being a source for innovation and trends in analytics, encouraging adoption, having the courage to make the investments, demonstrating trust in IT teams and users, and ensuring that execution matches the strategy. Conclusion Data analytics is no longer an obscure science for specialists in the ivory tower. Increasingly more analytics power is available for more people. Thanks to these new analytics, business users have been unchained from prior restrictions, and finding answers is easier, faster, and less costly. Developing insightful, actionable analytics is a necessary skill for every knowledge worker, researcher, consumer, teacher, and student. It is driven by a world in which faster insight is treasured, and it often needs to be real time to be most effective. Real-time data that changes quickly invokes a quest for real-time analytic insights and is not tolerant of insights from last quarter, last month, last week, or even yesterday.
Enabling the productive use of information tools is not a new obligation for the CIO, but the new analytics extends that obligation—in some cases, hyperextends it. Fulfilling that obligation requires the CIO to partner with human resources, sales, and other functional groups to establish the analytics credentials for knowledge workers and to take responsibility for their success. The CIO becomes a teacher and role model for the increasing number of data engineers, both the formal and informal ones. Certainly, IT must do its part to plan and provision the raw enabling capabilities and handle GRC, but more than ever, data analytics is the opportunity for the CIO to move out of the data center and into the front office. It is the chance for the CIO to demonstrate information leadership.
Developing insightful, actionable analytics is a necessary skill for every knowledge worker, researcher, consumer, teacher, and student.
The adoption of new analytics is an opportunity for IT to augment or update the business’s current capabilities. According to CIO Kent Kushar, Gallo IT’s latest investments are extensions of what Gallo wanted to do 25 years ago but could not due to limited availability of data and tools.
Reshaping the workforce with the new analytics
67
How visualization and clinical decision support can improve patient care Ashwin Rangan details what’s different about hemodynamic monitoring methods these days. Interview conducted by Bud Mathaisel and Alan Morrison
Ashwin Rangan Ashwin Rangan is the CIO of Edwards Lifesciences, a medical device company.
68
PwC Technology Forecast 2012 Issue 1
PwC: What are Edwards Lifesciences’ main business intelligence concerns given its role as a medical device company? AR: There’s the traditional application of BI [business intelligence], and then there’s the instrumentation part of our business that serves many different clinicians in the OR and ICU. We make a hemodynamic [blood circulation and cardiac function] monitoring platform that is able to communicate valuable information and hemodynamic parameters to the clinician using a variety of visualization tools and a rich graphical user interface. The clinician can use this information to make treatment decisions for his or her patients. PwC: You’ve said that the form in which the device provides information adds value for the clinician or guides the clinician. What does the monitoring equipment do in this case? AR: The EV1000 Clinical Platform provides information in a more meaningful way, intended to better inform the treating clinician and lead to earlier and better diagnosis and care. In the critical care setting, the earlier the clinician can identify an issue, the more choices the clinician has when treating the patient. The instrument’s intuitive screens and physiologic displays are also ideal for teaching, presenting the various hemodynamic parameters in the context of each other. Ultimately, the screens are intended to offer a more comprehensive view of the patient’s status in a very intuitive, user-friendly format.
PwC: How does this approach compare with the way the monitoring was done before? AR: Traditional monitoring historically presented physiologic information, in this case hemodynamic parameters, in the form of a number and in some cases a trend line. When a parameter would fall out of the defined target zones, the clinician would be alerted with an alarm and would be left to determine the best course of action based upon the displayed number or a line. Comparatively, the EV1000 clinical platform has the ability to show physiologic animations and physiologic decision trees to better inform and guide the treating clinician, whether it is a physician or nurse. PwC: How did the physician view the information before? AR: It has been traditional in movies, for example, to see a patient surrounded by devices that displayed parameters, all of which looked like numbers and jagged lines on a timescale. In our view and where we’re currently at with the development of our technology, this is considered more basic hemodynamic monitoring. In our experience, the “new-school” hemodynamic monitoring is a device that presents the dynamics of the circulatory system, the dampness of the lungs and the cardiac output realtime in an intuitive display. The only lag time between what’s happening in the patient and what’s being reflected on the monitor is the time between the analog body and the digital rendering.
PwC: Why is visualization important to this process? AR: Before, we tended to want to tell doctors and nurses to think like engineers when we constructed these monitors. Now, we’ve taken inspiration from the glass display in Minority Report [a 2002 science-fiction movie] and influenced the design of the EV1000 clinical platform screens. The EV1000 clinical platform is unlike any other monitoring tool because you have the ability to customize display screens to present parameters, color codes, time frames and more according to specific patient needs and/or clinician preferences, truly offering the clinician what they need, when they need it and how they need it. We are no longer asking clinicians to translate the next step in their heads. The goal now is to have the engineer reflect the data and articulate it in a contextual and intuitive language for the clinician. The clinician is already under pressure, caring for critically ill patients; our goal is to alleviate unnecessary pressure and provide not just information but also guidance, enabling the clinician to more immediately navigate to the best therapy decisions. PwC: Looking toward the next couple of years and some of the emerging technical capability, what do you think is most promising? AR: Visualization technologies. The human ability to discern patterns is not changing. That gap can only be bridged by rendering technologies that are visual in nature. And the visualization varies
Figure 1: Edwards Lifesciences EV1000 wireless monitor Patton Design helped develop this monitor, which displays a range of blood-circulation parameters very simply.
Source: Patton Design, 2012
depending on the kind of statistics that people are looking to understand. I think we need to look at this more broadly and not just print bar graphs or pie graphs. What is the visualization that can really be contextually applicable with different applications? How do you make it easier? And more quickly understood?
Reshaping the workforce with the new analytics
69
To have a deeper conversation about this subject, please contact:
Tom DeGarmo US Technology Consulting Leader +1 (267) 330 2658 thomas.p.degarmo@us.pwc.com
Bill Abbott Principal, Applied Analytics +1 (312) 298 6889 william.abbott@us.pwc.com
Bo Parker Managing Director Center for Technology & Innovation +1 (408) 817 5733 bo.parker@us.pwc.com
Oliver Halter Principal, Applied Analytics +1 (312) 298 6886 oliver.halter@us.pwc.com
Robert Scott Global Consulting Technology Leader +1 (416) 815 5221 robert.w.scott@ca.pwc.com
Comments or requests? Please visit www.pwc.com/techforecast or send e-mail to techforecasteditors@us.pwc.com
This publication is printed on McCoy Silk. It is a Forest Stewardship Council™ (FSC®) certified stock containing 10% postconsumer waste (PCW) fiber and manufactured with 100% certified renewable energy. By using postconsumer recycled fiber in lieu of virgin fiber: 6 trees were preserved for the future 16 lbs of waterborne waste were not created 2,426 gallons of wastewater flow were saved 268 lbs of solid waste were not generated 529 lbs net of greenhouse gases were prevented 4,046,000 BTUs of energy were not consumed
Photography Catherine Hall: Cover, pages 06, 20 Gettyimages: pages 30, 44, 58
PwC (www.pwc.com) provides industry-focused assurance, tax and advisory services to build public trust and enhance value for its clients and their stakeholders. More than 155,000 people in 153 countries across our network share their thinking, experience and solutions to develop fresh perspectives and practical advice. © 2012 PricewaterhouseCoopers LLP, a Delaware limited liability partnership. All rights reserved. PwC refers to the US member firm, and may sometimes refer to the PwC network. Each member firm is a separate legal entity. Please see www.pwc.com/structure for further details. This content is for general information purposes only, and should not be used as a substitute for consultation with professional advisors. NY-12-0340
www.pwc.com/techforecast
Subtext Culture of inquiry
A business environment focused on asking better questions, getting better answers to those questions, and using the results to inform continual improvement. A culture of inquiry infuses the skills and capabilities of data scientists into business units and compels a collaborative effort to find answers to critical business questions. It also engages the workforce at large— whether or not the workforce is formally versed in data analysis methods—in enterprise discovery efforts.
In-memory
A method of running entire databases in random access memory (RAM) without direct reliance on disk storage. In this scheme, large amounts of dynamic random access memory (DRAM) constitute the operational memory, and an indirect backup method called write-behind caching is the only disk function. Running databases or entire suites in memory speeds up queries by eliminating the need to perform disk writes and reads for immediate database operations.
Interactive visualization
The blending of a graphical user interface for data analysis with the presentation of the results, which makes possible more iterative analysis and broader use of the analytics tool.
Natural language processing (NLP)
ethods of modeling and enabling machines to extract M meaning and context from human speech or writing, with the goal of improving overall text analytics results. The linguistics focus of NLP complements purely statistical methods of text analytics that can range from the very simple (such as pattern matching in word counting functions) to the more sophisticated (pattern recognition or “fuzzy” matching of various kinds).