1
ISSUE 6
1ST ANNIVERSARY We take a look back over the past year
2
Editor’s Letter
Letter From The Editor
§
Welcome to the Big Data Innovation Annual Review. We have achieved so much in the past 12 months, from the launch of the first magazine to the first printed edition and the introduction of the iPad app.
Managing Editor George Hill
Thanks to everybody involved, from those who have contributed fantastic stories to those who have shared the magazine and made it the truly global publication that it now is.
Assistant Editors Richard Angus Helena MacAdam
This issue is a celebration of everything that has been done in the past 12 months and I have personally chosen two articles from each edition that I think you will all find fascinating. These are some of the best articles that we have included within the magazine and each has received significant praise.
President Josie King
From our humble beginnings in December last year we have grown to include some of the most influential data minds as both contributors, readers and experts. In 2014 we hope to be able to bring you more magazines, more innovative and thought provoking articles and also bring you the latest big data news. In addition to this we are hoping to add a dedicated website for the magazines, allowing people to catch up with the latest content daily as well as finding more in depth analysis in the magazines. As mentioned before, this issue is a celebration of some of the best articles from the past 12 months. I have hand picked these as both personal favourites as well as from audience appreciation. That this process was so difficult is testament to the quality of writing that we have seen in the past year, a level of quality that we are dedicated to maintaining throughout 2014. If you are interested in writing, advertising or even an advisory role within the magazine then please get in touch.
George Hill Chief Editor
Art Director Gavin Bailey Advertising Hannah Sturgess
hsturgess@theiegroup.com
Contributors Damien Demaj Chris Towers Tom Deutsch Heather James Claire Walmsley Daniel Miller Gil Press General Enquiries ghill@theiegroup.com
3
Contents
Contents Page Number
4 7 10 13 17 24 33 37 42 45
Description
Originally Published Issue
Drew Linzer lifts the lid on his famous work in predicting the Obama re-election win
1
Ashok Srivastava talks NASA’s Big Data in this interview from the first issue of the magazine
1
One of our most popular articles, Chris Towers looks at the impact of Quantum Computing on Big Data
2
The skills gap was discussed in detail by Tom Deutsch in the second issue of the magazine this year
2
Damien Demaj’s work on spacial analytics in tennis was well received when published in Issue 3
3
Gil Press wrote a fantastic piece on the history of Big Data, to much acclaim
3
Andrew Claster, the man behind Obama’s Big Data team spoke to Daniel Miller in Issue 4
4
Education was the issue when Chris Towers spoke to Gregory Shapiro-Sharp
4
Heather James spoke to the famous Stephen Wolfram in September about his work at Wolfram Alpha
5
Data Transparency was the focus of this brilliant article by Claire Walmsley in Issue 5
5
4
Drew Linzer
An Interview with Drew Linzer: The Man Who Predicted The 2012 Election George Hill Chief Editor
4
5
Drew Linzer
Drew Linzer is the analyst who predicted the results of the 2012 election four months in advance. His algorithms even correctly predicted the exact number of votes and the winning margin for Obama. Drew has documented his analytical processes in his blog votamatic.org which details the algorithms used, their selections and also the results. He also appeared on multiple national and international news programs due to his results. I caught up with him to discuss not only his analytical techniques but also his opinions on what was the biggest data driven election ever. George: What kind of reaction has there been to your predictions? Drew: Most of the reaction has focused on the difference in accuracy between those of us who studied the public opinion polls, and the “gut feeling” predictions of popular pundits and commentators. On Election Day, data analysts like me, Nate Silver (New York Times FiveThirtyEight blog), Simon Jackman (Stanford University and Huffington Post), and Sam Wang (Princeton Election Consortium) all placed Obama’s reelection chances at over 90%, and correctly foresaw 332 electoral votes for Obama as the
most likely outcome. Meanwhile, pundits such as Karl Rove, George Will, and Steve Forbes said Romney was going to win and in some cases, easily. This has led to talk of a “victory for the quants” which I’m hopeful will carry through to future elections. How do you evaluate the algorithm used in your predictions? My forecasting model estimated the state vote outcomes and the final electoral vote, on every day of the campaign, starting in June. I wanted the assessment of these forecasts to be as fair and objective as possible and not leave me any wiggle room if they were wrong. So, about a month before the election, I posted on my website a set of eight evaluation criteria I would use once the results were known. As it turned out, the model worked perfectly. It predicted over the summer that Obama would win all of his 2008 states minus Indiana and North Carolina, and barely budged from that prediction even after support for Obama inched upward in September, then dipped after the first presidential debate. The amount of data used throughout this campaign both by independent analysts and campaign teams has been huge, what kind of implications does this have for data usage in 2016? The 2012 campaign proved that multiple, diverse sources of quantitative information could be
6
Drew Linzer
managed, trusted, and applied successfully towards a variety of ends. We outsiders were able to predict the election outcome far in advance. Inside the campaigns, there were enormous strides made in voter targeting, opinion tracking, fundraising, and voter turnout. Now that we know these methods can work, I think there’s no going back. I expect reporters and campaign commentators to take survey aggregation much more seriously in 2016. And although Obama and the Democrats currently appear to hold an advantage in campaign technology, I would be surprised if the Republicans didn’t quickly catch up. Do you think that the success of this data driven campaign has meant that campaign managers now need to be an analyst as well as a strategist? The campaign managers may not need to be analysts themselves, but they should have a greater appreciation for how data and technology can be harnessed to their advantage. Campaigns have always used survey research to formulate strategy and measure voter sentiment. But now there are a range of other powerful tools available: social networking websites, voter databases, mobile smartphones, and email marketing, to name only a few. That is in addition to recent advances in polling methodologies and statistical opinion modeling. There is a lot of innovation happening in
American campaign politics right now. You managed to predict the election results 4 months beforehand, what do you think is the realistic maximum timeframe to accurately predict a result using your analytics techniques? About four or five months is about as far back as the science lets us go right now; and that’s even pushing it a bit. Prior to that, the polls just aren’t sufficiently informative about the eventual outcome: too many people are either undecided or haven’t started paying attention to the campaign. The historical, economic and political factors that have been shown to correlate with election outcomes also start to lose their predictive power once we get beyond the roughly 4-5 month range. Fortunately, that still gives the campaigns plenty of time to plot strategy and make decisions about how to allocate their resources.
7
Data At NASA
NASA’S Big Data: An Interview With Ashok Srivastava George Hill Chief Editor
Edwin Verin / Shutterstock.com
8
Data At NASA
On a sunny September morning in Boston this year, Ashok Srivastava was waiting to stand at the podium and present to a room of 600 people at the Big Data Innovation summit - the largest, dedicated big data conference in the world. Giving his perspectives on the growth of big data, its uses in aviation safety and how his employer, NASA, have utilised and innovated through its uses, Ashok came out as one of the most popular speakers at the summit.
of data scientists to analyze large data sets have meant that there has been an acceleration in the speed at which this happens. With new types of databases and the ability to record and analyze data quickly the levels of technology required have been reached.
NASA has been at the forefront of technology innovation for the past 50 years, bringing us everything from the modern computer to instant coffee. Ashok explains how NASA is still innovating today and with After the success of the the huge amounts of data presentation and the summit that they are consuming, what in general, I was lucky enough is happening in big data there to sit down with Ashok to is going to affect businesses. discuss the way that big data For instance, NASA are discussing the has changed within NASA currently and the success that it has use of their big data and systems had in the wider business algorithms with companies ranging community. So why has big data come to from medical specialists to prominence in the last three CPG organizations. The work that they have done within or four years? data in the past few years Ashok argues that this is not a has created the foundations change that has taken place allowing many companies to solely over the past three or become successful. four years. It is a reaction to society’s change to becoming One of the issues that is more data driven. The last really affecting companies 25 years have seen people looking to adopt big data is increasingly needing data the current gap in skilled big to either make or back up data professionals. The way to solve this in Ashok’s opinion decisions. is through a different set of Recent advancements in teaching parameters. technology and the ability
9
Data At NASA
The training for these should revolve around machine learning and optimization, allowing people to learn the “trade of big data” meaning that they can learn how systems work from the basics upwards, allowing them to have full insight when analyzing. Given the relative youth of big data, I wanted to know what Ashok thought would happen with big data at NASA in the next 10 years in addition to the wider business community. NASA in ten years will be dealing with a huge amount of data, on a scale that is currently unimaginable. This could include things like full global observations as well as universe observations, gathering and analyzing petabytes of information within seconds. With public money being spent on these big data projects, Ashok makes it clear that the key benefit should always boil down to ultimately providing value for the public. This is a refreshing view of NASA who have traditionally been seen as secretive due to the h i g h l y
confidential nature of their operations and the lack of public understanding. Ashok also had some pieces of advice for people currently looking to make waves in the big data world: “It is important to understand the business problem that is being solved” “Making sure the technologies that are being deployed are scalable and efficient”
10
Quantum Computing
Quantum Computing: The Future of Big Data?
Chris Towers Organiser, Big Data Innovation Summit
11
Quantum Computing
When we look back at the computers in the 1980’s, with the two tone terminal screen and the command driven format, then look at our colorful and seemingly complex laptops, we think that there must have been a huge revolution in the basic building blocks of these machines. However in reality they are still built on the same principles and the same technological theories.
transistor can only be so small. When it gets to an atomic level they will cease to function, meaning that when this point is reached computers will again have to start growing in order to maintain the demand for faster computing.
With the current need for faster computers with the data revolution and the ever increasing numbers of companies adopting analytics and big data, in a few years In essence the way that all time there will need to be a modern computers currently change. work is through using silicon The idea that there could transistors on a chip. These be a different method transistors can be either to computing aside from on or off, dictating the transistors was originally function of the computer in theorized in 1982 by Richard order to perform tasks and Feynman, who pioneered the calculations. The computers idea of quantum computing. that you are using today are using this same technology So what does this have to do as the first computer in the with big data? 1940’s albeit in a far more The idea behind quantum advanced state. computing is complex as it Whereas the transistors in the uses the highly confusing 1940’s were bulky, today IBM quantum theory, described can fit millions of transistors famously by Niels Bohr as onto a single chip, allowing “anyone who is not shocked superior processing power by quantum theory has not albeit using the same theories understood it”. as when computers were first invented. One of the ways that this is possible is through the miniaturization of transistors allowing people to fit far more into a smaller space.
Current computers work through utilizing ‘bits’ which are either on or off making it difficult to understand the way that quantum computing works. This is through ‘qubits’ The issue with this, in terms which can be on or off or both, of future iterations, is that a meaning that the qubit can be
12
Quantum Computing
essentially in two states at the with both the size and price for companies at the same time. The element that makes this moment, Google has begun with complex is that this goes experimentations beyond regular thinking which quantum computers. After is digital. It is essentially conducting tests with 20,000 watching somebody flick a images, half with cars and coin and it landing on both half without, the quantum heads and tails at the same computer could sort the photos into those including time. cars and those not including The way that this has cars considerably faster than implications for big data anything in Google’s current is that due to the multiple data centers. options with states, the information processing power This kind of real world and is almost incomprehensible experimentation when compared to current investment by major civilian traditional computers. In fact, companies will help push computing into the processing power of these quantum computers has had experts the mainstream, allowing describing the power of companies working with big quantum computing as being data to be able to access and the hydrogen bomb of cyber use larger data sets quicker than they ever could before. warfare. Although several years away, Although at the moment it is predicted that with the prohibitive pricing and logistics quantum computers current uptake of cloud usage of amongst companies, within 15 makes it unviable for regular years the majority of civilian companies, in a decade’s companies will have access time the changes that will to quantum computers to be wrung out by Google and deal with their ever increasing other early adopters may well spark a genuine revolution in volumes of information. computing. The first commercially available quantum computer A potential way to overcome has been produced and one this would be through the has been sold, although they use of the cloud to access are currently being sold by quantum computers. This their makers, D-Wave for $10 would not only drive the prices million and they are housed in down for infrequent usage but would also be a viable option a 10mx10m room. for most companies. However, despite the issues
13
Bridging The Gap
Bridging The Big Data Skills Gap
Tom Deutsch Big Data Solution Architect
14
Bridging The Gap
Big Data technologies certainly haven’t suffered for a lack of market hype over the past 18 months, and if anything, the noise in the space is continuing to grow. At the same time there’s been a lot of discussion around a skills gap that exists to make effective use of these new technologies. Not surprisingly one of the most frequent questions I am asked is “how do those two come together in the real world”? So what should smart firms be aware of as they investigate how to get started using these new technologies with the people and skillsets they currently have? It makes sense to break this question down into three skills components; administration, data manipulation and data science. The skills required in each of these areas varies from others and functionally they represent different challenges. You are not likely to face the same challenges and staffing considerations in all three. Administration – this includes the functional responsibilities for setting up, deploying and maintaining the underlying systems and architectures. I would argue that a real skills gap exists here today for Enterprises of any significant size. It may take a little
bit of time to understand how some of the systems, such as a Hadoop, differs in scaling horizontally and handles availability. Generally speaking setup, configuration and administration tasks are reasonably well-documented and existing server and network administrators can successfully manage them. In fact, in comparison to vertically scaled traditional technologies the administration component here can be dramatically simplified compared to what your are used to. Keep in mind that this is a space where the hardware/software
vendors are starting to compete on manageability and an increasing number of appliance options exists. To oversimplify in the interest of space, if your team can manage a LAMP stack you should be fine here. Data Manipulation – so here is when the fun really starts and also where you may first encounter issues. This is when you start working with the data in new ways and not surprisingly this is likely to be the first place that a skills gap appear. In practice I would suggest planning a gap appearing here – how mild or severe of a gap depends upon several factors. These factors boil down to two big buckets – first, can they manipulate the data in the new platforms and second do they know how to manipulate the data in valid ways. The first issue – can you manipulate the data often comes down to how much scripting and/or java development experience your teams have. While the tools and abstraction options are improving rapidly across most of these technologies there is usually no escaping having to dive into scripting or even write some Java code at some point. If your teams are already doing that, no big deal. If they aren’t already familiar
15
Bridging The Gap
and using scripting languages then there is some reason for pause. Similarly, while there are interface options that are increasingly SQL-like if your teams aren’t experienced in multiple development languages you should expect them to have some kind of learning curve. Can they push through it? Almost certainly, just budget some time to allow that to happen. As noted above, this will get easier and easier over time, but do not assume that tools will prevent the need for some coding. This is where you are going to spend the bulk of your time and people so make sure you are being realistic about your entry-point skills. Also keep in mind this isn’t the hardest part. In many cases the second challenge here is the bigger one – not how can you manipulate the data but how should you manipulate the data. The real open question here is what to collect in the first place and how to actually use it in a meaningful way. That, of course, is a bigger issue which brings us to data scientist question. Data Science – so finally to the hotly debated data scientist role. Popular press would have you believe that there is a plus or minus 10 year shortage of people that are skilled in data science. At the same time literally tens
of thousands of people have completed open coursework from MIT and others on data science. Another variable is the evolution and progress of tools that make data collection and analytic routines more easily understood. So where does that put us?
level such as creating a data landing zone, data warehouse augmentation and alternative ELT approaches. No data science needed there – and as I’ve written elsewhere diving directly into a data science driven projects is a lousy idea. What if you have First, it is important to note a project that has a data that there are many use science dependency though, cases that never get to this what should you expect?
16
Bridging The Gap
Frankly, your experience here will vastly differ depending on the depth and robustness of your existing analytics practice. Most large Enterprises already have pockets of expertise to draw on here from their SPSS, SAS or R communities. The data sources may be new (and faster moving or bigger) but math is math, statistics are statistics. These tools increasingly work with these technologies (especially Hadoop) so in some cases they won’t even have to leave their existing
environments. If you already have the skills, so far so good. If you don’t have these skills you are going to have to grow, buy or rent them. Growing is slow, buying expensive and renting somewhere in between. Do not expect to be successful taking people with reporting or BI backgrounds and throwing them into data science issues. If you cannot honestly say “yes, we have advanced statisticians that are flexible in their thinking and understand the business” you are going to struggle and
need to adopt a grow, buy or rent strategy. We’ll pick up effective strategies for dealing with grown, buy or rent issue, including notions of Center of Excellence, in future topics.
17
Damien Demaj
Using Spatial Analytics To Study Spatio-Temporal Patterns In Sport Damien Demaj Geospatial Product Engineer
18
Damien Demaj
Late last year I introduced ArcGIS users to sports analytics, an emerging and exciting field within the GIS industry. Using ArcGIS for sports analytics can be read here. Recently I expanded the work by using a number of spatial analysis tools in ArcGIS to study the spatial variation of serve patterns from the London Olympics Gold Medal match played between Roger Federer and Andy Murray. In this blog I present results that suggest there is potential to better understand players’ serve tendencies using spatiotemporal analysis. The Most Important Shot in Tennis? The serve is arguably the most important shot in tennis. The location and predictability of a player’s serve has a big influence on their overall winning serve percentage. A player is who is unpredictable with their serve and can consistently place their serve wide into the service box, at the body or down the T is more likely to either win a point outright, or at least weaken their opponent’s return [1]. The results of tennis matches are often determined by a small number of important points during the game. It is common to see a player win a match who has won the same number of points as his opponent. The scoring system in tennis also makes it possible for a
Figure 1: Igniting further exploration using visual analytics. Created in ArcScene, this 3D visualization depicts the effectiveness of Murray’s return in each rally and what effect it had on Federer’s second shot after his serve.
player to win fewer points than his opponent yet win the match [2]. Winning these big points is critical to a player’s success. For the player serving their aim is to produce an ace or, force their opponent into an outright error, as this could make the difference between winning and losing. It is of particular interest to coaches and players to know the success of the player’s serve at these big points. Geospatial Analysis In order to demonstrate the effectiveness of geovisualizing spatio-temporal data using GIS we conducted a case study to determine the following: Which player served with more spatio-temporal variation at important points during the match? To find out where each player
Figure 2. The K Means algorithm in the Grouping Analysis tool in ArcGIS groups features based on attributes and optional spatial temporal constraints.
19
Damien Demaj
Figure 3. Calculating the Euclidean distance (shortest path) between two sequential serve locations to identify spatial variation within a player’s serve pattern.
served during the match we plotted the x,y coordinate of the serve bounce. A total of 86 points were mapped for Murray, and 78 for Federer. Only serves that landed in were included in the analysis. Visually we could see clusters formed by wide serves, serves into the body and serves hit down the T. The K Means algorithm [3] in the Grouping Analysis tool in ArcGIS (Figure 2) enabled us to statically
Figure 4. The importance of points in a tennis match as defined by Morris. The data for the match was classified into 3 categories as indicated by the sequential color scheme in the table (dark red, medium red and light red).
replicate the characteristics of the visual clusters. It enabled us to tag each point as either a wide serve, serve into the body or serve down the T. The organization of the serves into each group was based on the direction of serve. Using the serve direction allowed us to know which service box the points belong to. Direction gave us an advantage over proximity as this would have grouped points in neighbouring service boxes. To determine who changed the location of their serve the most we arranged the serve bounces into a temporal sequence by ranking the data according to the side of the net (left or right), by court location (deuce or ad court), game number and point number. The sequence of bounces then allowed us to create Euclidean lines (Figure 3) between p1 (x1,y1) and p2 (x2,y2), p2 (x2,y2) and p3 (x3,y3), p3 (x3,y3) and p4 (x4,y4) etc in each court location. It is possible to determine, with greater spatial variation, who was the more predictable server using the mean Euclidean distance between each serve location. For example, a player who served to the same part of the court each time would exhibit a smaller mean Euclidean distance than a player who frequently changed the position of their serve. The
20
Damien Demaj
mean Euclidean distance was calculated by adding all of the distances linking the sequence of serves in each service box divided by the total number of distances. To identify where a player served at key points in the match we assigned an importance value to each point based on the work by Morris [4]. The table in Figure 4 shows the importance of points to winning a game, when a server has 0.62 probability of winning a point on serve. This shows the two most important points in tennis are 30-40 and 40-Ad, highlighted in dark red. To simplify the rankings we grouped the data into three classes, as shown in Figure 4. In order see a relationship between outright success on a serve at the important points we mapped the distribution of successful serves and overlaid the results onto a layer containing the important points. If the player returning the serve made an error directly on their return, then this was deemed to be an outright success for the player. An ace was also deemed to be an outright success for the server. Results Federer’s spatial serve cluster in the ad court on the left side of the net was the most spread of all his clusters. However, he served out wide with great accuracy into the deuce court
Figure 5. Mapping the spatial serve clusters using the K Means Algorithm. Serves are grouped according to the direction they were hit. The direction of each serve is indicated by the thin green trajectory lines. The direction of serve was used to statistically group similar serve locations.
on the left side of the net by hugging the line 9 times out 10 (Figure 5). Murray’s clusters appeared to be grouped overall more tightly in each of the service boxes. He showed a clear bias by serving down the T in the deuce court on the right side of the net. Visually there appeared to be no other significant differences between each player’s patterns of serve. By mapping the location of the players serve bounces and grouping them into spatial serve clusters we were able to quickly identify where in the service box each player was hitting their serves. The spatial
Figure 6. A comparison of spatial serve variation between each player. Federer’s mean Euclidean distance was 1.72m (5.64 ft) - Murrray’s was 1.45m (4.76 ft). The results suggest that Federer’s serve had greater spatial variation than Murray’s. The lines of connectivity represent the Euclidean distance (shortest path) between each sequential service bounce in each service box.
21
Damien Demaj
Figure 7. A proportional symbol map showing the relationship of where each player served at big points during the match and their outright success at those points.
serve clusters, wide, body or T were symbolized using a unique color, making it easier for the user to identify each group on the map. To give the location of each serve some context we added the trajectory (direction) lines for each serve. These lines helped link where the serve was hit from to where the serve landed. They help enhance the visual structure of each cluster and improve the visual summary of the serve patterns. The Euclidean distance calculations showed Federer’s mean distance between sequential serve bounces was 1.72 m (5.64 ft), whereas Murray’s mean Euclidean distance was 1.45 m (4.76 ft). These results suggest that Federer’s serve had greater spatial variation than Murray’s. Visually, we could detect that the network of Federer’s Euclidean lines showed a greater spread
than Murray’s in each service box. Murray served with more variation than Federer in only one service box, the ad service box on the right side of the net. The directional arrows in Figure 6 allow us to visually follow the temporal sequence of serves from each player in any given service box. We have maintained the colors for each spatial serve cluster (wide, body, T) so you can see when a player served from one group into another. At the most important points in each game (30-40 and 40-Ad), Murray served out wide targeting Federer’s backhand 7 times out of 8 (88%). He had success doing this 38% of the time, drawing 3 outright errors from Federer. Federer mixed up the location of his 4 serves at the big points across all of the spatial serve clusters, 2 wide, 1 body and 1 T. He had success 25% of
the time drawing 1 outright error from Murray. At other less important points Murray tended to favour going down the T, while Federer continued his trend spreading his serve evenly across all spatial serve clusters (Figure 7). The proportional symbols in Figure 7 indicate a level of importance for each serve. The larger circles represent the most important points in each game – the smallest circles the least important. The ticks represent the success of each serve. By overlaying the ticks on-top of the graduated circles we can clearly see a relationship between the success at big points on serve. The map also indicates where each player served. The results suggest that Murray served with more spatial variation across the two most important point categories, recording a mean Euclidean distance of 1.73 m (5.68 ft) to Federer’s 1.64 m (5.38 ft). Conclusion Successfully identifying patterns of behavior in sport in an on-going area of work [5] (see figure 8), be that in tennis, football or basketball. The examples in this blog show that GIS can provide an effective means to geovisualize spatio-temporal sports data, in order to
22
Damien Demaj
reveal potential new patterns within a tennis match. By incorporating space-time into our analysis we were able to focus on relationships between events in the match, not the individual events themselves. The results of our analysis were presented using maps. These visualizations function as a convenient and comprehensive way to display the results, as well as acting as an inventory for the spatio-temporal component of the match [6]. Expanding the scope of geospatial research in tennis, and other sports relies on open access to reliable spatial data. At present, such data is not publically available from the governing bodies of tennis. An integrated approach with these organizations, players, coaches, and sports scientists would allow for further validation and development of geospatial analytics for tennis. The aim of this research is to evoke a new wave of geospatial analytics in
Figure 8. The heatmap above shows Federer’s frequency of shots passing through a given point on the court. The map displays stroke paths from both ends of the court, including serves. The heat map can be used to study potential anomalies in the data that may result in further analysis.
the game of tennis and across other sports. Furthermore, to encourage statistics published on tennis to become more time and space aware to better improve the understanding of the game, for everyone.
References [1] United States Tennis Association, “Tennis tactics, winning patterns of play”, Human Kinetics, 1st Edition, 1996.
[4] C. Morris, “The most important points in tennis”, In Optimal Strategies in Sports, vol 5 in Studies and Management Science and Systems, , North-Holland Publishing, Amsterdam, pp. 131-140, 1977.
[2] G. E. Parker, “Percentage Play in Tennis”, In Mathematics and Sports Theme Articles, http://www.mathaware.org/mam/2010/ essays/
[5] M. Lames, “Modeling the interaction in games sports – relative phase and moving correlations”, Journal of Sports Science and Medicine, vol 5, pp. 556-560, 2006.
[3] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A K-Means Clustering Algorithm”, Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, No. 1, pp. 100-108, 1979.
[6] J. Bertin, “Semiology of Graphics: Diagrams, Networks, Maps”, Esri Press, 2nd Edition, 2010.
23
A History Of Big Data
A Short History Of Big Data Gil Press Big Data Expert
24
A History Of Big Data
The story of how data became big starts many years before the current buzz around big data. Already seventy years ago we encounter the first attempts to quantify the growth rate in the volume of data or what has popularly been known as the “information explosion” (a term first used in 1941, according to the Oxford English Dictionary). The following are the major milestones in the history of sizing data volumes plus other “firsts” in the evolution of the idea of “big data” and observations pertaining to data or information explosion. 1944 Fremont Rider, Wesleyan University Librarian, publishes The Scholar and the Future of the Research Library. He estimates that American university libraries were doubling in size every sixteen years. Given this growth rate, Rider speculates that the Yale Library in 2040 will have “approximately 200,000,000 volumes, which will occupy over 6,000 miles of shelves… [requiring] a cataloging staff of over six thousand persons.” 1961 Derek Price publishes Science Since Babylon, in which he charts the growth of scientific knowledge by looking at the growth in the number of scientific journals and papers. He concludes that the number of new journals has grown exponentially rather
than linearly, doubling every fifteen years and increasing by a factor of ten during every half-century. Price calls this the “law of exponential increase,” explaining that “each [scientific] advance generates a new series of advances at a reasonably constant birth rate, so that the number of births is strictly proportional to the size of the population of discoveries at any given time.” November 1967 B. A. Marron and P. A. D. de Maine publish “Automatic data compression” in the Communications of the ACM, stating that ”The ‘information explosion’ noted in recent years makes it essential that storage requirements for all information be kept to a minimum.” The paper describes “a fully automatic and rapid three-part compressor which can be used with ‘any’ body of information to greatly reduce slow external storage requirements and to increase the rate of information transmission through a computer.” 1971 Arthur Miller writes in The Assault on Privacy that “Too many information handlers seem to measure a man by the number of bits of storage capacity his dossier will occupy.” 1975 The Ministry of Posts and Telecommunications in Japan starts conducting the Information Flow Census,
25
A History Of Big Data
tracking the volume of information circulating in Japan (the idea was first suggested in a 1969 paper). The census introduces “amount of words” as the unifying unit of measurement across all media. The 1975 census already finds that information supply is increasing much faster than information consumption and in 1978 it reports that “the demand for information provided by mass media, which are one-way communication, has become stagnant and the demand for information provided by personal telecommunications media, which are characterized by two-way communications, has drastically increased…. Our society is moving toward a new stage… in which more priority is placed on segmented, more detailed information to meet individual needs, instead of conventional mass-reproduced conformed information.” [Translated in Alistair D. Duff 2000; see also Martin Hilbert 2012 (PDF)] April 1980 I.A. Tjomsland gives a talk titled “Where Do We Go From Here?” at the Fourth IEEE Symposium on Mass Storage Systems, in which he says “Those associated with storage devices long ago realized that Parkinson’s First Law may be paraphrased to describe our industry—‘Data expands to fill the space available’…. I believe
that large amounts of data are being retained because users have no way of identifying obsolete data; the penalties for storing obsolete data are less apparent than the penalties for discarding potentially useful data.” 1981 The Hungarian Central Statistics Office starts a research project to account for the country’s information industries. Including measuring information volume in bits, the research continues to this day. In 1993, Istvan Dienes, chief scientist of the Hungarian Central Statistics Office, compiles a manual for a standard system of national information accounts. [See Istvan Dienes 1994 (PDF) and Martin Hilbert 2012 (PDF)] August 1983 Ithiel de Sola Pool publishes “Tracking the Flow of Information” in Science. Looking at growth trends in 17 major communications media from 1960 to 1977, he concludes that “words made available to Americans (over the age of 10) through these media grew at a rate of 8.9 percent per year… words actually attended to from those media grew at just 2.9 percent per year…. In the period of observation, much of the growth in the flow of information was due to the growth in broadcasting… But toward the end of that period [1977] the situation
26
A History Of Big Data
was changing: point-to-point media were growing faster than broadcasting.” Pool, Inose, Takasaki and Hurwitz follow in 1984 with Communications Flows: A Census in the United States and Japan, a book comparing the volumes of information produced in the United States and Japan. July 1986 Hal B. Becker publishes “Can users really absorb data at today’s rates? Tomorrow’s?” in Data Communications. Becker estimates that “the recoding density achieved by Gutenberg was approximately 500 symbols (characters) per cubic inch—500 times the density of [4,000 B.C. Sumerian] clay tablets. By the year 2000, semiconductor random access memory should be storing 1.25X10^11 bytes per cubic inch.” 1996 Digital storage becomes more cost-effective for storing data than paper according to R.J.T. Morris and B.J. Truskowski, in “The Evolution of Storage Systems,” IBM Systems Journal, July 1, 2003. October 1997 Michael Cox and David Ellsworth publish “A p p l i c a t i o n - c o n t r o l l e d demand paging for outof-core visualization” in the Proceedings of the IEEE 8th conference on Visualization. They start the article with “Visualization provides an interesting challenge for
computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.” It is the first article in the ACM digital library to use the term “big data.” 1997 Michael Lesk publishes “How much information is there in the world?” Lesk concludes that “There may be a few thousand petabytes of information all told; and the production of tape and disk will reach that level by the year 2000. So in only a few years, (a) we will be able [to] save everything–no information will have to be thrown out and (b) the typical piece of information will never be looked at by a human being.” April 1998 John R. Masey, Chief Scientist at SGI, presents at a USENIX meeting a paper titled “Big Data… and the Next Wave of Infrastress.” October 1998 K.G. Coffman and Andrew Odlyzko publish “The Size and Growth Rate of the Internet.” They conclude that “the growth rate of traffic on the public Internet, while lower than is often cited, is still about 100% per year, much higher than for traffic on other
27
A History Of Big Data
networks. Hence, if present growth trends continue, data traffic in the U. S. will overtake voice traffic around the year 2002 and will be dominated by the Internet.” Odlyzko later established the Minnesota Internet Traffic Studies (MINTS), tracking the growth in Internet traffic from 2002 to 2009. August 1999 Steve Bryson, David Kenwright, Michael Cox, David Ellsworth and Robert Haimes publish “Visually exploring gigabyte data sets in real time” in the Communications of the ACM. It is the first CACM article to use the term “Big Data” (the title of one of the article’s sections is “Big Data for Scientific Visualization”). The article opens with the following statement: “Very powerful computers are a blessing to many fields of inquiry. They are also a curse; fast computations spew out massive amounts of data. Where megabyte data sets were once considered large, we now find data sets from individual simulations in the 300GB range. But understanding the data resulting from high-end computations is a significant endeavor. As more than one scientist has put it, it is just plain difficult to look at all the numbers. And as Richard W. Hamming, mathematician and pioneer computer scientist, pointed out, the purpose of computing is insight, not
numbers.” October 1999 Bryson, Kenwright and Haimes join David Banks, Robert van Liere and Sam Uselton on a panel titled “Automation or interaction: what’s best for big data?” at the IEEE 1999 conference on Visualization. October 2000 Peter Lyman and Hal R. Varian at UC Berkeley publish “How Much Information?” It is the first comprehensive study to quantify, in computer storage terms, the total amount of new and original information (not counting copies) created in the world annually and stored in four physical media: paper, film, optical (CDs and DVDs) and magnetic. The study finds that in 1999, the world produced about 1.5 exabytes of unique information, or about 250 megabytes for every man, woman and child on earth. It also finds that “a vast amount of unique information is created and stored by individuals” (what it calls the “democratization of data”) and that “not only is digital information production the largest in total, it is also the most rapidly growing.” Calling this finding “dominance of digital,” Lyman and Varian state that “even today, most textual information is ‘born digital,’ and within a few years this will be true for images as well.” A similar study conducted in 2003
28
A History Of Big Data
by the same researchers found that the world produced about 5 exabytes of new information in 2002 and that 92% of the new information was stored on magnetic media, mostly in hard disks. November 2000 Francis X. Diebold presents to the Eighth World Congress of the Econometric Society a paper titled “’Big Data’ Dynamic Factor Models for Macroeconomic Measurement and Forecasting (PDF),” in which he states “Recently, much good science, whether physical, biological, or social, has been forced to confront— and has often benefited from— the “Big Data” phenomenon. Big Data refers to the explosion in the quantity (and sometimes, quality) of available and potentially relevant data, largely the result of recent and unprecedented advancements in data recording and storage technology.” February 2001 Doug Laney, an analyst with the Meta Group, publishes a research note titled “3D Data Management: Controlling Data Volume, Velocity and Variety.” A decade later, the “3Vs” have become the generally-accepted three defining dimensions of big data, although the term itself does not appear in Laney’s note. September 2005 Tim O’Reilly publishes “What is Web 2.0” in
which he asserts that “data is the next Intel inside.” O’Reilly: “As Hal Varian remarked in a personal conversation last year, ‘SQL is the new HTML.’ Database management is a core competency of Web 2.0 companies, so much so that we have sometimes referred to these applications as ‘infoware’ rather than merely software.” March 2007 John F. Gantz, David Reinsel and other researchers at IDC release a white paper titled “The Expanding Digital Universe: A Forecast of Worldwide Information Growth through 2010 (PDF).” It is the first study to estimate and forecast the amount of digital data created and replicated each year. IDC estimates that in 2006, the world created 161 exabytes of data and forecasts that between 2006 and 2010, the information added annually to the digital universe will increase more than six fold to 988 exabytes, or doubling every 18 months. According to the 2010 (PDF) and 2012 (PDF) releases of the same study, the amount of digital data created annually surpassed this forecast, reaching 1227 exabytes in 2010 and growing to 2837 exabytes in 2012. January 2008 Bret Swanson and George Gilder publish “Estimating the Exaflood (PDF),” in which they project
29
A History Of Big Data
that U.S. IP traffic could reach one zettabyte by 2015 and that the U.S. Internet of 2015 will be at least 50 times larger than it was in 2006. June 2008 Cisco releases the “Cisco Visual Networking Index – Forecast and Methodology, 2007–2012 (PDF)” part of an “ongoing initiative to track and forecast the impact of visual networking applications.” It predicts that “IP traffic will nearly double every two years through 2012” and that it will reach half a zettabyte in 2012. The forecast held well, as Cisco’s latest report (May 30, 2012) estimates IP traffic in 2012 at just over half a zettabyte and notes it “has increased eightfold over the past 5 years.” December 2008 Randal E. Bryant, Randy H. Katz and Edward D. Lazowska publish “Big-Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science and Society (PDF).” They write: “Just as search engines have transformed how we access information, other forms of big-data computing can and will transform the activities of companies, scientific researchers, medical practitioners and our nation’s defense and intelligence operations…. Bigdata computing is perhaps the biggest innovation in computing in the last decade.
We have only begun to see its potential to collect, organize and process data in all walks of life. A modest investment by the federal government could greatly accelerate its development and deployment.” December 2009 Roger E. Bohn and James E. Short publish “How Much Information? 2009 Report on American Consumers.” The study finds that in 2008, “Americans consumed information for about 1.3 trillion hours, an average of almost 12 hours per day. Consumption totaled 3.6 Zettabytes and 10,845 trillion words, corresponding to 100,500 words and 34 gigabytes for an average person on an average day.” Bohn, Short and Chattanya Baru follow this up in January 2011 with “How Much Information? 2010 Report on Enterprise Server Information,” in which they estimate that in 2008, “the world’s servers processed 9.57 Zettabytes of information, almost 10 to the 22nd power, or ten million million gigabytes. This was 12 gigabytes of information daily for the average worker, or about 3 terabytes of information per worker per year. The world’s companies on average processed 63 terabytes of information annually.” February 2010 Kenneth Cukier publishes in The Economist a
30
A History Of Big Data
Special Report titled, “Data, data everywhere.” Writes Cukier: “…the world contains an unimaginably vast amount of digital information which is getting ever vaster more rapidly… The effect is being felt everywhere, from business to science, from governments to the arts. Scientists and computer engineers have coined a new term for the phenomenon: ‘big data.’” February 2011 Martin Hilbert and Priscila Lopez publish “The World’s Technological Capacity to Store, Communicate and Compute Information” in Science. They estimate that the world’s information storage capacity grew at a compound annual growth rate of 25% per year between 1986 and 2007. They also estimate that in 1986, 99.2% of all storage capacity was analog, but in 2007, 94% of storage capacity was digital, a complete reversal of roles (in 2002, digital information storage surpassed non-digital for the first time). May 2011 James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh and Angela Hung Byers of the McKinsey Global Institute publish “Big data: The next frontier for innovation, competition and
productivity.” They estimate that “by 2009, nearly all sectors in the US economy had at least an average of 200 terabytes of stored data (twice the size of US retailer Wal-Mart’s data warehouse in 1999) per company with more than 1,000 employees” and that the securities and investment services sector leads in terms of stored data per firm. In total, the study estimates that 7.4 exabytes of new data were stored by enterprises and 6.8 exabytes by consumers in 2010. April 2012 The International Journal of Communications publishes a Special Section titled “Info Capacity” on the methodologies and findings of various studies measuring the volume of information. In “Tracking the flow of information into the home (PDF),” Neuman, Park and Panek (following the methodology used by Japan’s MPT and Pool above) estimate that the total media supply to U.S. homes has risen from around 50,000 minutes per day in 1960 to close to 900,000 in 2005.Looking at the ratio of supply to demand in 2005, they estimate that people in the U.S. are “approaching a thousand minutes of mediated content available for every minute available for consumption.” In
31
A History Of Big Data
“International Production and Dissemination of Information (PDF),” Bounie and Gille (following Lyman and Varian above) estimate that the world produced 14.7 exabytes of new information in 2008, nearly triple the volume of information in 2003. May 2012 danah boyd and Kate Crawford publish “Critical Questions for Big Data” in Information, Communications and Society. They define big data as “a cultural, technological and scholarly phenomenon that rests on the interplay of: (1) Technology: maximizing computation power and
algorithmic accuracy to gather, analyze, link and compare large data sets. (2) Analysis: drawing on large data sets to identify patterns in order to make economic, social, technical and legal claims. (3) Mythology: the widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity and accuracy.”
32
You need a new strategy. You rally your team. Your brainstorming session begins. And the fun fizzles out. But what if you could bring your meetings to life with the power of visual storytelling? At ImageThink, we specialize in Graphic Facilitation, a creative, grasp-at-a-glance solution to your strategy problems. Using hand-drawn images, we’ll help you draw out a roadmap for success, bringing your goals into focus, illustrating a clear vision, and simplifying team communication with the power of visuals.
Call on us for: Productive meetings and brainstorming sessions Visual summaries for keynotes and conferences Team trainings and workshops
Unique social media content Creative and engaging tradeshows Whiteboard videos Hand-drawn infographics
Learn more about ImageThink at
www.imagethink.net
-Picturing your big ideas-
www.facebook.com/imagethinknyc
@ImageThink
33
Obama’s Big Data
Data In An Election: An Interview With Andrew Claster Deputy Chief Analytics Officer Obama for America Daniel Miller Big Data Leader
Filip Fuxa / Shutterstock.com
34
Obama’s Big Data
We were lucky enough to talk to Andrew Claster, Deputy Chief Analytics Officer for President Barack Obama’s 2012 re-election campaign ahead of his presentation at the Big Data Innovation Summit in Boston, September 12 & 13 2013.
- Data Integration: We have several major database platforms – the national voter file, our proprietary email list, campaign donation history, volunteer list, field contact history, etc. How do we integrate these and use a unified dataset to inform Andrew Claster, Deputy campaign decisions? Chief Analytics Officer for - Online/Offline: How do we President Barack Obama’s encourage online activists to 2012 re-election campaign, take action online and vicehelped create and lead the versa? How do we facilitate largest, most innovative and and measure this activity? most successful political - Models: How do we develop analytics operation ever and validate our models about developed. Andrew previously what the electorate is going to developed microtargeting and look like in November 2012? communications strategies Our as Vice President at Penn, -Communications: opponents and the press are Schoen & Berland for clients including Hillary Rodham continually discussing areas in Clinton, Tony Blair, Gordon which they say we are falling Brown, Ehud Barak, Leonel short. When is it in our interest Fernandez, Verizon, Alcatel, to push back, when is it in our Microsoft, BP, KPMG, TXU and interest to let them believe the Washington Nationals their own spin, and what baseball team. Andrew information are we willing to completed his undergraduate share if we do push back? studies in political science - Cost: How do we evaluate at Yale University and everything we do in terms his graduate training in of cost per vote, cost per economics at the London volunteer hour or cost per School of Economics. staff hour? What was the biggest challenge for the data team during the Obama re-election campaign? It is difficult to identify just one. Here are some of the most important:
Sales and Marketing: How do we support every department within the campaign (communications, field, digital, finance, scheduling, advertising)? How do we demonstrate value? How do we build relationships? How do we ensure that data and analytics are used to inform decision-making across the campaign? - Hiring and Training: Where and how do we recruit more than 50 highly qualified analysts, statistical modelers and engineers who are committed to re-electing Barack Obama and willing to move to Chicago for a job that ends on Election Day 2012, requires that they work more than 80 hours a week for months with no vacation in a crowded room with no windows (nicknamed ‘The Cave’), and pays less with fewer benefits than they would earn doing a similar job in the private sector?
Many working within political statistics and analytics say that the incumbent candidate always has a significant advantage with their data - Prioritization: We don’t have effectiveness, do you think this enough resources to test is the case? everything, model everything The incumbent has many and do everything. How do advantages including the we efficiently allocate human following: and financial resources? - Incumbent has data, -Internal Communication, infrastructure and experience
35
Obama’s Big Data
from the previous campaign. - Incumbent is known in advance – no primary – and can start planning and implementing general election strategy earlier.
determinants of the election result were:
-Having a candidate with a record of accomplishment and policy positions that are - Incumbent is known to voters consistent with the preferences – there is less uncertainty of the majority of the electorate. Building a national regarding underlying data and organization of supporters, models. However, the incumbent may volunteers and donors to also have certain disadvantages: register likely supporters to vote, persuade likely voters to - Strategy is more likely to be support our candidate, turn out known to the other side because likely supporters and protect it is likely to be similar to the the ballot to ensure their vote is previous campaign. counted. - With a similar strategy and Data, technology and analytics many of the same strategists made us more effective and and vested interests as the more efficient with every one previous campaign, it could be of these steps. They helped us harder to innovate. target the right people with the On balance, the incumbent has an opportunity to put herself or himself in a superior position regarding data, analytics and technology. However, it is not necessarily the case that s/he will do so – the incumbent must have the will and the ability to develop and invest in this potential advantage. When there is no incumbent and there is a competitive primary, it is the role of the national party and other affiliated groups to invest in and develop this data, analytics and technology infrastructure. How much effect do you think data had on the election result? The
most
important
right message delivered in the right medium at the right time.
We conducted several tests to measure the impact of our work on the election result, but we will not be sharing those results publicly. As an example however, I can point out that there were times during the campaign when the press and our opponent claimed that states such as Michigan and Minnesota were highly competitive, that we were losing in Ohio, Iowa, Colorado, Virginia and Wisconsin, and that Florida was firmly in our opponent’s camp. We had internal data (and there was plenty of public data, for those who are able to analyze
36
Obama’s Big Data
that can help them make more intelligent decisions. Furthermore, their supporters, volunteers and donors want to know that the campaign is using their contributions of time and money as efficiently and effectively as possible, and that Given the reaction of the public the campaign is making smart to the NSA and PRISM data strategic decisions using the gathering techniques, what kind latest techniques. of effect is this likely to have on the wider data gathering activities of others working within the data science community? it properly) demonstrating that these statements were inaccurate. If we didn’t have accurate internal data, our campaign might have made multi-million dollar mistakes that could have cost us several key states and the election.
Consumers are becoming more aware of what data is available and to whom. It is increasingly important for those of us in the data science community to help educate consumers about what information is available, when and how they can opt out of sharing their information and how their information is being used. Do you think that after the success of the data teams in the previous two elections that it is no longer an advantage, but a necessity for a successful campaign? Campaigns have always used data to make decisions, but new techniques and technology have made more data accessible and allowed it to be used in innovative ways. Campaigns that do not invest in data, technology or analytics are missing a huge opportunity
37
Do You Have The Spark? If you have a new idea that you want to tell the world, contact us to contribute an article or idea ghill@theiegroup.com
38
Gregory Piatetsky -Shapiro
Gregory Piatetsky-Shapiro Talks Big Data Education Chris Towers Organizer, Big Data Innovation Summit
39
Gregory Piatetsky -Shapiro
One of the aspects of big data that many in the industry are currently concerned by is the perceived skills gap. The lack of qualified and experienced data scientists has meant that many companies find themselves adrift of where they want to be in the data world. I thought I would talk to one of the most knowledgeable and influential big data leaders in the world, Gregory PiatetskyShapiro. After running the first ever Knowledge Discovery in Databases (KDD) workshop in 1989, he has stayed at the sharp end of analytics and big data for the past 25 years. His website and consultancy, KD Nuggets, is one of the most widely read data information sources and he has worked with some of the largest companies in the world. The first thing that I wanted to discuss with Gregory was his perception of the big data skills gap. Many have claimed that this could just be a flash in the pan and something that has been manipulated, rather than something that actually exists. Gregory references the McKinsey report of May 2011 which quotes: “There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts
with the know-how to use the analysis of big data to make effective decisions.” The report predicts that this kind of skills gap will exist in 2017, but Gregory believes that we are already seeing this. Whilst using Indeed.com to look at what expertise companies are looking for, Gregory found that of the top 10 job trends both Mongo DB and Hadoop appear. “Big Data is actually rising faster than any of them. This indicates that demand for Big Data skills exceeds the supply. My experience with KDnuggets jobs board confirms it - many companies are finding it hard to get enough candidates.” There are people responding to this however, with many universities and colleges recognising not only the shortages, but also the desire from people to learn. Companies looking to expand their data teams are also looking at both internal and external training. For instance companies such as EMC and IBM are training their data scientists internally. Not only does this mean that they know that they are getting a high quality of training, but that the data scientists that they are employing are being educated in ‘their ways’. With companies finding it hard to employ qualified candidates, through training programs like this, companies can look for great candidates and make
40
Gregory Piatetsky -Shapiro
sure they are sufficiently qualified afterwards. The IBMs and EMCs of this world are few and far between. The money that needs to be invested in indepth internal training is considerable and so many companies would struggle with this proposition. So what about those other companies? How can they avoid falling through the big data skills gap? Gregory thinks that most companies have three options. Do you need BIG data? Most companies confuse big data with basic data analysis. At the moment with the buzz around big data, many companies are over investing in technology that realistically isn’t required. A company with 10,000 customers, for instance, does not necessarily need a big data solution with multiple Hadoop clusters. Gregory makes the point that on his standard laptop he would be able to process data for a large software company with 1 million customers. Companies need to ask if they really need the depth of data skills that they think. What if you do need it? For large companies who may need to manage larger data sets, the reality is that it is not necessary to employ a big data expert straight from
university. Gregory makes the point that somebody who is trained in Mongo DB can become trained as a data scientist relatively easily. If an internal training programme is not a realistic target in this instance, then external training may become the best option. There are several companies who can
offer this such as Cloudera and many others, who can train data scientists to a relatively high standard. Gregory also mentions that one way in which several companies are learning about big data and analytics is through attending conferences. There are now hundreds of conferences a year on Big Data and related topics, from leaders in the field such as Innovation Enterprise and other smaller conferences all around the world. What if these are untenable? Some Big Data and analytics work can be outsourced or given to consultants. This allows to not only free up their existing data team to specific tasks, but also means that they are not having to risk taking on a full time employee who may not be sufficiently qualified. Here, the leading companies include IBM, Deloitte, Accenture and also pureplay analytics outsourcing providers like Opera Solutions and Mu Sigma. Having discussed the big data skills gap with several people who have worked in big data for years, one of the main concerns they have is the fanfare affecting the long term viability of the business function. Gregory does not have this concern, but does make it
41
Gregory Piatetsky -Shapiro
clear that we need to make sure that the buzzword ‘big data’ is separated from the technological trend. He has written in Harvard Business Review about this belief that the ‘sexy’ big data is being overhyped. The majority of companies who have implemented big data have done so in order to predict human behaviour, but this is not something that can be done consistently with big data. Therefore, Gregory believes that any disillusionment with big data will not come from an inability to find the right talent, but in it’s build-up not living up to the reality. On the other hand, Gregory is quick to point out that the amount of data that we are producing will continue its rapid growth for the foreseeable future. This data will still need people to manage and analyze it and so we are going to continue to see growth even if the initial hype dies down. We are also seeing an increasing interest in countries outside the US, the current market leader. This global interest is likely to increase the big data talent pool and therefore allow for expansion. Having used Google Trends Gregory points out that the top 10 searches for ‘Big Data’ are:
Given the interest from elsewhere we are going to see an increasingly globalized talent pool and potentially the migration of the big data hub from the US to Asia. Gregory also points out that given that the top five do not have english as a primary language (the trend analysis was purely for english language searches) the likelihood is that this does not represent every search for big data in those countries. This interest in the subject certainly shows that the appetite for big data education exists globally and those working in the big data educational sphere
are utilising technology to increase effectiveness. Gregory points out that many companies are using analytics within their online education to make the experience more productive for both the students and teachers. Through the use of this technology, big data education is becoming more productive and also more tenable to a truly international audience. One of the aspects of big data that is clear, is that in order to succeed you need curiosity and passion. The other aspects of the role will always involve training and the kind of options and platforms for this will mean that in the coming years, we will see this gap closing. Gregory is a fine example of somebody who has managed to not only innovate within the industry for the past 25 years, but was one of the first to try and share the practice across many people. If we can find even one person from those working in data with the same passion and curiosity as him then the quality and breadth of education can continue to grow at the same speed as this exciting industry.
42
Stephen Wolfram
Big Data Innovation with Stephen Wolfram Heather James Big Data Innovation Summit Curator
43
Stephen Wolfram
At the Big Data Innovation Summit, Boston in September 2013, Stephen Wolfram took the stage to deliver a presentation that many have described as the best amongst the hundreds that took place over the 2 day event.
using data rather than suggesting results like a traditional search engine such as Google or Bing. Wolfram Alpha is the product of Stephen's ultimate goal, to make all knowledge computational, being able to answer and rationalise natural Discussing his use of data and language questions into data the way that his Wolfram Al- driven answers. pha programme and Mathe- He describes it as 'A major matica language are changing democratisation of access the ways that machines utilise to knowledge', allowing peodata, the audience was en- ple the opportunity to answer questions that previously thralled. I had initially organised to sit would have required a sigdown with Stephen immedi- nificant amount of data ately following his presenta- and expert knowltion, but I was forced to wait edge. According for several hours due to the to Stephen the crowds surrounding him as product is albeen soon as he finished. The 20 ready used everypeople surrounding him for an hour after his presentation where from were testament to Stephens education to achievements in the past 25 big business, years. Having spoken to oth- it is a product on ers around the conference the the up. Many will claim that they have never used the system, During the afternoon I did however anybody who has manage to sit down with Ste- asked Apple's SIRI system phen. What I found was a down on the iPhone a question to earth, eloquent man with a will have unwittingly expegenuine passion for data and rienced it. Along with Bing the way that we are using it as and Google, Wolfram Alpha powers the SIRI platform, a society. enabling users to ask quesStephen is the CEO and found- tions in standard language er of Wolfram Alpha, a com- and translate this into data putational knowledge engine driven answers. designed to answer questions What Wolfram Alpha is doing most common adjective was 'brilliant'.
differently to everybody else at the moment is taking publicly held knowledge and using it to answer questions rather than simply showing people how to find the information. It allows users to ingest the information that others have found to find interesting and deep answers. Stephen has a real passion for data, through not only Wolfram Alpha and his mission to computate knowledge, but also on a personal level. He is the human who has the record for holdi n g
the m o s t d a t a about himself. He has been m e a s u ring this for the past 25 years and he can see this becoming more and
44
Stephen Wolfram
more popular in wider society, with wearable personal measurement technologies become increasingly popular.
sonal analytics will become part of a daily routine and this will only see the amount of data increase.
This change in the mindset of society as a whole to a more data driven and accepting society is what Stephen believes to be the key component to Wolfram Alpha now becoming what it is. Stephen says that he always knew that there would come a time when society had created enough data to be able to make Wolfram Alpha viable and that time is now.
He also sees the use of science and mechanics having a profound effect on the ways in which companies utilise their data. We will see analysis looking at more than just numbers, but also putting giving these numbers meaning through scientific principles.
Overall, what I have learnt from talking to Stephen is that data is the future in more than just This is testament to how far we a business context. Softhave come as an industry that ware that allows we can now power something people to mine like Wolfram Alpha through the data without amount of data that we have r e a l i s i n g now recorded. It is a real mile- they are stone in the development of a even doing it will data driven society. The reason for this according be imto Stephen is that many of the portant key data sources haven't been to develaround for a long time, things o p m e n t like social media and machine of how we use information. data has allowed this shift to Wolfram Alpha is changing occur. He only sees this trend the data landscape and continuing with increasing with the passion and geamounts of machine driven nius of Stephen Wolfram behind it, who knows how sensors collecting data. With the use of data at Wolf- far it could go. ram Alpha now hitting an all time high, I was curious about where Stephen thought big data would be in 5 years time. He believes that the upward curve will only continue, per-
45
Data Transparency
Data Transparency Claire Walmsley Big Data Expert
46
Data Transparency
Recently companies have received a bad reputation about how they are holding individual information. There have been countless data leaks, hackers exposing personal details and exploitation of individual data for criminal activities. The world's press has had it's attention d r a w n towards data protection and individual data collection through the NSA and GCHQ spying s c a n d a l . Society in general is becoming more aware of the power that their data holds and this combined with the increased media attention, has led to consumers becoming more data savvy. Companies like Facebook and Google have made billions of dollars through their efficient use of data and are now looked at warily by many. Although major data secrecy violations are yet to occur at either organisation, the reality is that people know that data is held about them and need to trust the company who is keeping it. So how can companies become more trustworthy with their customer data? One of the keys to success within
a customer base is trust and the best way to gain this is through transparency. Allowing people to see what kind of information that they have held on them by any particular company creates trust. By outlining exactly what is held on people will create an understanding of what the
information is used for. A sure fire way to lose trust is through t h e 'if you don’t ask you don’t get’ use of data collection visibility. This is the idea that when reading complex or overly long agreements the data protection aspects are available, but not implicitly stated. In reality this is much of what has happened in several cases, with information management details being buried in small prints, so although technically accessible are in reality not effectively communicated. The best way to circumnavigate this is to make it clear, send an
47
Data Transparency
email, have a separate section or even a blog that is outlining how data is being used and why. It is very seldom that people are having their data used in manipulative or sinister ways, making them aware of how their data is improving their experiences will make an audience far more receptive to it being used. At the moment there are ways that you can check on certain elements of how your data is being used. Using a google account you can see what Google has matched to your here: google.com/dashboard/ This allows you to see who Google presumes you are based on your browsing history and what ads are therefore targeted towards you. It is often interesting to see what your actions online say about you. This detail is a move in the right direction for companies but still has an enigmatic feeling that there isn't total transparency. With the pressures of data protection surrounding most companies today, this kind of move would allay many of the fears that consumers currently have when their data integrity is in question. What the industry needs today is consumer trust and transparency is one of the key components to achieving this.
48
FOLLOW US @IE_BIGDATA
SUBSCRIBE BIT.LY/BIGDATASIGNUP
2 0 1 4 C A L E N DA R
Big Data
49
Flagship
CXO
Women
Finance
Hadoop
Expected
High Tech
Healthcare Government Pharma
January
May
October
Big Data Innovation Summit
Big Data Innovation Summit
January 22 & 23, Las Vegas
May 14 & 15, London
Big Data & Predictive Analytics Summit
February
Big Data & Analytics in Healthcare
Hadoop Innovation Summit February 19 & 20, San Diego
The Digital Oilfield Innovation Summit
May 14 & 15, Philadelphia
Big Data & Advanced Analytics in Government May 21 & 22, Washington, DC
February 20 & 21, Buenos Aires
Chief Data Officer Summit
Big Data & Analytics Innovation Summit
May 21 & 22, San Francisco
February 27 & 28, Singapore
June
March Big Data Innovation Summit March 27 & 28, Hong Kong
April
Big Data Innovation Summit June 4 & 5, Toronto
Big Data & Analytics for Pharma June 11 & 12, Philadelphia
Big Data & Analytics in Retail
October 15 & 16, London
Big Data Innovation Summit October 30 & 31, Bangalore
November Big Data & Analytics for Pharma November 5 & 6, Philadelphia
Big Data & Marketing Innovation Summit November 12 & 13, Miami
Data Science Leadership Summit November 12 & 13, Chicago
Big Data Fest November 27, London
June 18 & 19, Chicago
Big Data Innovation Summit
September
December
Big Data Innovation Summit
Big Data in Finance Summit
Data Visualization Summit
September 10 & 11, Boston
December 3 & 4, New York
April 9 & 10, Santa Clara
Data Visualization Summit
Big Data & Analytics in Banking Summit
Big Data Innovation Summit April 9 & 10, Santa Clara
Big Data Infrastructure Summit April 9 & 10, Santa Clara
September 10 & 11, Boston
Big Data & Analytics Innovation Summit
November 27 & 28, Beijing
December 3 & 4, New York
September 17 & 18, Sydney
Partnership Opportunities: Giles Godwin-Brown | ggb@theiegroup.com | +1 415 692 5498 Attendee Invitation: Sean Foreman | sforeman@theiegroup.com | +1 415 692 5514
Oil & Gas