1
CONTENTS Big Data Goes Quantum - p.4 Chris Towers looks at what effect quantum computing will have on Big Data in the future
Big Data in Corporate America - p.8 Randy Bean and Paul Barth share the results of a recent survey around big data usage in US corporations
The Day Data in Healthcare Went Big - p.11 David Barton looks at February 19, the day that big data went big in healthcare
How Real is the Big Data Gap? - p.14 Tom Deutsch investigates the big data skills gap and whether it exists
Big Data in Tennis - p.16 Damien Demaj using spatial analytics to investigate the Andy Murray vs Roger Federer Olympics final
Interview With Simon Thompson - p.21 George Hill speaks to BT’s head researcher about the changing role of data in companies and how it’s best utilized
HAVE AN IDEA? What would you like to see our writers covering or do you have any innovative Big data articles that you want to contribute? For more information contact George at ghill@theiegroup.com
WANT TO SEE YOURSELF HERE? We have several advertising opportunities available to cater to all budgets and can offer bespoke packages to suit you and your business. Contact Pip at pcurtis@theiegroup.com for more information.
LETTER FROM THE EDITOR Welcome to this issue of Big Data Innovation. The pace of growth in big data has not relented as we have moved further into 2013, and predictions that growth would slow seem to have been wide of the mark. This is great, because all of us here still think that this is one of the fastest growing and innovative industries to be a part of. In this issue we discuss the ways in which big data could change in the future, with a special editorial discussing various possibilities in addition to Chris Towers looking at the role that quantum computing may play in the future of data collection and analysis. Tom Deutsch also gives us an in depth industry perspective on how the perceived skills gap is affecting the industry. Randy Bean and Paul Barth also release the findings from their recent big data survey. With Big Data moving increasingly into varied usages, Damien Demaj also breaks down the data he collected using Spatio-Temporal collection during the Andy Murray and Roger Federer gold medal match during the olympics. All this, plus an interview with Simon Thompson make this a great edition. We hope you enjoy this issue and if you like it, share it! GEORGE HILL CHIEF EDITOR
Welcome to the neighborhood.
速
Visit us at upcoming shows: Big Data Innovation Summit - Booth S12 San Francisco, April 11-12
The leader in NOSQL Database Solutions for Big Data Analytics
Enterprise Data World - Booth 207 San Diego, April 28-May 2 Big Data Boot Camp New York City, May 21-22
www.objectivity.com
3
EDITORIAL: BIG DATA IN THE FUTURE, WHERE WILL WE BE? In this magazine, Chris Towers examines the affect that quantum computing could have on the Big Data and Analytics communities. But what else can we expect to see from big data in the next few years? In the last issue I discussed big data with NASA's principle scientist, Ashok Srivastava, who mentioned the potential that some of their technologies will have in the civilian space. One of the aspects discussed was the idea that algorithms and hardware could be created to collect and analyze data on much bigger scales than our current capabilities. This will have a huge impact on the industry in the next few years. Combined with the access to quantum computers which is also being discussed this could result in the amounts of data being analyzed multiplied by hundreds of times compared to our current capabilities. This could see companies truly become data focussed, being able to drill down into customer insights and meanings to an extent that they would never have dreamed of before. This could see a huge boom in the amount of targeted products, creating niche companies that would perhaps not have had the insight to create products and services to cater for smaller audiences, simply because finding these audiences was too difficult. However, conversely with this kind of powerful analytics, there is likely to be a significant increase in regulation around data and its uses. With the kind of power that companies will have with deeper and deeper knowledge of their customers, a line will need to be drawn in order to protect the consumer. We have seen small steps towards this kind of thing before, with the regulation of email communications as well as the new rules of cookies on websites being introduced in 2012. These kinds of steps show that governments and regulatory bodies are taking the protection of
individual's data seriously, which is something that will only increase in the future. Combine this with the steady rise of social media and the amount of information that people are willing to share about themselves, means that the distinction between personal and private data will become increasingly blurred. At the moment there is still a considerable amount of confusion amongst the public about what defines public data and private data. With new social media tools demanding different information from their users every day, the chances of this being easily cleared up are slim. So where does this leave big data? Essentially, big data will become vastly more powerful, with the amount of information available increasing exponentially during the next few years. What effect this will have on companies and customers will depend heavily on the reaction of authorities to the new data influx. With companies such as Google and HP experimenting with new computing and processing techniques, this is not thought for tomorrow, but something that is being built on today. Following on from the interview with Ashok, where we discussed the potential for growth in the scale of data collection, the power to process the data as well as collect it is going to become increasingly important. There are several ways that this may be achieved. Either through cloud computing which may give people access to computing power above and beyond what they may be able to afford on site, like we saw with PC’s once consumers became interested in their uses. Perhaps then the use of big data with consumers will be the real catalyst to driving the innovations within companies and with products like the Nike Fuel Band and sleep recording apps, this is not a far fetched concept.
GEORGE HILL CHIEF EDITOR
4
DATA GOES QUANTUM
W
hen we look back at the computers in the 1980’s, with the two tone terminal screen and the command driven format, then look at our colorful and seemingly complex laptops, we think that there must have been a huge revolution in the basic building blocks of these machines. However in reality they are still built on the same principles and the same technological theories.
Chris Towers looks at the ways that quantum computing could change the collection and processing of data within the next decade.
In essence the way that all modern computers currently work is through using silicon transistors on a chip. These transistors can be either on or off, dictating the function of the computer in order to perform tasks and calculations. The computers that you are using today are using this same technology as the first computer in the 1940’s albeit in a far more advanced state.
5
W
hereas the transistors in the 1940’s were bulky, today IBM can fit millions of transistors onto a single chip, allowing superior processing power albeit using the same theories as when computers were first invented. One of the ways that this is possible is through the miniaturization of transistors allowing people to fit far more into a smaller space. The issue with this, in terms of future iterations, is that a transistor can only be so small. When it gets to an atomic level they will cease to function, meaning that when this point is reached, computers will again have to start growing in order to maintain the demand for faster computing. With the current need for faster computers with the data revo-
Niels Bohr as “anyone who is not shocked by quantum theory has not understood it”. Current computers work through utilizing ‘bits’ which are either on or off making it difficult to understand the way that quantum computing works. This is through ‘qubits’ which can be on or off or both, meaning that the qubit can be essentially in two states at the same time. The element that makes this complex is that this goes beyond regular thinking which is digital. It is essentially watching somebody flick a coin and it landing on both heads and tails at the same time. The way that this has implications for big data is that due to the multiple options with states, the information processing power is almost incomprehensible when compared to current traditional computers. In fact, the processing power of these computers has
“Anyone who is not shocked by quantum theory has not understood it” lution and the ever increasing numbers of companies adopting analytics and big data, in a few years time there will need to be a change. The idea that there could be a different method to computing aside from transistors was originally theorized in 1982 by Richard Feynman, who pioneered the idea of quantum computing. So what does this have to do with big data? The idea behind quantum computing is complex as it uses the highly confusing quantum theory, described famously by
PICTURE
6 had experts describing the power of quantum computing as being the hydrogen bomb of cyber warfare. Although several years away, it is predicted that with the current uptake of cloud usage amongst companies, within 15 years the majority of civilian companies will have access to quantum computers to deal with their ever increasing volumes of information. The first commercially available quantum computer has been produced and one has been sold, although they are currently being sold by their makers, D-Wave for $10 million and they are housed in a 10mx10m room. However, despite the issues with both the size and price for companies at the moment, Google has begun experimentations with quantum computers. After conducting tests with 20,000 images, half with cars and half without, the quantum computer could sort
“...within 15 years the majority of civilian companies will have access to quantum computers”
“Considerably faster than anything in Google’s current data centers.”
the photos into those including cars and those not including cars considerably faster than anything in Google’s current data centers. This kind of real world experimentation and investment by major civilian companies will help push quantum computing into the mainstream, allowing companies working with big data to be able to access and use larger data sets quicker than they ever could before. Although at the moment prohibitive pricing and logistics of quantum computers makes it unviable for regular companies, in a decade’s time the changes that will be wrung out by Google and other early adopters may well spark a genuine revolution in computing. A potential way to overcome this would be through the use of the cloud to access quantum computers. This would not only drive the prices down for infrequent usage but would also be a viable option for most companies.
CHRIS TOWERS BIG DATA LEADER
Meet Lyn. Lyn’s challenge: As an executive, she’s expected to break down her business’ goals into objectives and create the KPIs to determine if objectives are being met. It sounds simple, and it would be—if any of the data or systems were integrated. Which they’re not and never will be. Lyn’s biggest problem: The data itself provides no competitive advantage. The advantage comes from how fast her team can transform data into actionable intelligence. Lyn needs a way to bring data—from disparate sources—together into real-time visual analytics, like KPIs and dashboards, for insight. And Lyn’s not alone in this struggle. In a cross-industry survey of IT execs, NewVantage Partners found 52% want the ability to integrate and analyze a wide variety of data—ideally, in real time.
Real Time. Real Impact. Real-Time Visual Analytics. Lyn’s Solution: JackBe Presto IT’S EASY
By providing real-time intelligence, Lyn arms her team with the tools to make immediate, informed decisions about changing business conditions—as they happen.
IT’S FAST
With Presto, she can quickly combine data from all of her existing systems, including Big Data, into easy-to-use, real-time analytics dashboards.
IT’S VISUAL
But accessing real-time data isn’t enough. Lyn needs a customized look at specific pieces of information to decide if something is off-track, what’s causing it to be off-track and what to change to get back on course. Presto displays the exact data elements she needs to create intuitive data visualizations that guide decisions.
JackBe's special report on Real-Time Analytics offers you insight in bringing your unconnected data together for real-time decision making.
Download it at goto.jackbe.com/RTBDA.
R
Real Time. Real Impact.
8
Big Data in Corporate America: A Current State View
“The results suggest that Big Data has arrived”
cies, comprising multiple divisions, operating ommentary on the topic of Big Data can be globally as well as nationally, and serving many found far and wide these days. However, it markets and customer constituencies. was barely 24 months ago when the term was The results suggest that Big Data has arrived largely unknown to business and technology and is being tested and is in use at major organexecutives alike. Data was often viewed as a doizations. 85 percent of the respondents have a main inhabited by PhD statisticians, computer Big Data initiative planned or in progress, and scientists and market researchers, despite Foralmost half are using Big Data in some type of tune 1000 corporations and Federal agencies production/operational capacity, ranghaving engaged in data-driven activities such as ing from production reporting to analytics, predictive modeling, database mar24/7 mission-critical applicaketing and business intelligence for the bet85% of tions. For these organizations, ter part of several decades. respondents have Big Data offers the promise of In response to this surge of interest, we a big data initiative improved data-driven decisought to ask the most senior executive sion-making. planned or decision-makers – C-suite and funcin progress tional business line heads -- with Why Big Data? Why Now? direct responsibility, ownerThe primary reason senior deciship and influence on data 25% of sion makers gave for investing in Big initiatives and investments respondents Data is to improve their organization’s for Fortune 1000 firms and employ over 500 analytics capabilities and make smartFederal agencies. The surer business decisions. Most respondents analysts vey was conducted during cited the ability to integrate and analyze a the summer of 2012, with folwide variety of data as the primary value of low on detailed discussions and big data platforms. They are using big data to briefings taking place through the address significant gaps in accessing relevant end of the year. data and developing analytic skills. Over half The result is our summary report, Big Data Exthe respondents rate their “access to relevant, ecutive Survey 2013, which reflect the perspecaccurate and timely data” as less than adequate tives of more than 50 senior business and techand only 21.3% rank their company’s analytic nology executives representing Fortune 1000 capabilities as “more than adequate” or “world corporate America and large Government agenclass.” Whether they rank their current analytics capabilities as “minimal” or as “world class,” all respondents are looking for Big Data to have a “85 percent of the major impact on their business. Advanced companies with large analytic staffs (over 25 percent respondents have a Big Data of respondents have over 500 data miners and analysts) are seeking to push the productivity initiative planned or in of these teams with advanced tools and automaprogress” tion, while less mature companies hope to use big data to leapfrog their capabilities to get to parity.
C
9 Most are Exploring, But Some Leaders are Emerging While most organizations have begun allocating exploratory budgets for exploration and growth, some leaders are making a major commitment to Big Data. About a quarter of the companies are spending under one million dollars annually to evaluate Big Data, while half are launching preliminary solutions with budgets between one and ten million dollars. A quarter of the respondents are committed, spending more than $10 million on Big Data this year, with some projecting spending over $100 million annually within 3 years. For half the respondents, this is an incremental spend and could be just the beginning for increased Big Data budgets as, on average, each respondent will spend more than $13.5 million annually on Big Data in 3 years. Fulfilling the Promise of Big Data It is important to put Big Data into proper perspective. There is a distinction between what’s going on in Corporate America today and the experience of, and initiatives underway at, companies that were born into Big Data, like Facebook, Google, LinkedIn, eBay, Yahoo, and other companies that have powerful Big Data efforts underway, but have never faced the challenge of dealing with older legacy data and systems.
Big Data should not be 25% of viewed as a panacea. respondents are The promise of Big Data has been an emerging spending over $10 theme in recent years, million on Big Data yet companies have in 2013 been trying to leverage data to gain critical insights and help improve their businesses for decades. Some of the tools and capabilities are new, and certainly the economics of accessing and managing data have improved, but many of the challenges remain the same. Big Data presents organizations with new capabilities for driving business value, but realizing the potential will only come with careful formulation of sound strategies accompanied by thoughtful execution plans. NewVantage Partners’ Executive Survey, and our resulting discussions with firms that are at the forefront and in the mainstream of corporate Big Data initiatives, have highlighted both the opportunity, as well as the challenges that lie ahead. About the Authors Paul Barth, PhD, and Randy Bean, are managing partners at NewVantage Partners. PAUL BARTH & RANDY BEAN CONTRIBUTORS
11
FEBRUARY 19, THE DAY THAT BIG DATA IN HEALTHCARE WENT MAINSTREAM
F
ebruary 19 could mark the day in history where big data in healthcare and disease prevention hit the mainstream. Two significant events happened that will shape the future of big data usage in the prevention of diseases and the speed in which diseases can be cured. The announcement that three of Silicon Valley's stars were funding a life sciences prize to help promote scientific endeavours aimed at identifying and eradicating diseases was met with widespread praise. Mark Zuckerberg (founder and CEO of Facebook), Sergey Brin (Founder and CEO of Google) and Yuri Milner (technology venture capitalist) have launched the annual prize that will see 11 scientists who are leading global initiatives to cure or prevent diseases, receive a $3 million payout.
“February 19 will mark the day in history where big data in healthcare and disease prevention hit the mainstream.� They are hoping that with the recognition and financial reward these awards will bring to the winners, that more people will be willing to take on what has traditionally been seen as an underfunded and unappreciated endeavour. The injection of an additional $33 million to help with this work annually will also see successful initiatives push forward with their thinking and experimenting. This will mean that where before a project may see success but be hindered by budget constraints, now it will be able to push on and make a real difference at a rapid pace. All three founders of these awards are innovators and forward thinkers. Brin has built one of the world's most innovative companies from the ground up, adopting new ideas and branching out what would have been a traditional search engine. Zuckerberg forever changed the way the world uses social media and interacts one another with a simple yet effective way of sharing information with one another. Although less well known, Milner arguably has as much foresight as the other two, as you cannot become a successful venture capitalist without seeing the potential in products and investing in them at the right time. However, at first glance this kind of humanitarian work, although a fantastic idea, is not within their traditional remit. The second piece of news on the 19th that works with this announcement in pushing big data to the forefront of medical thinking is the announcement from Bina that they have launched the first commercially viable big data product to be used in genomics. This may seem like it has been done before, but in reality the cost effectiveness of this will push forward the analysis of diseases and cures at a vastly increased rate. With the cloud element of the technology, it allows institutions to quickly,
12 “The injection of an additional $33 million to help with this work annually will also see successful initiatives push forward with their thinking and experimenting.� easily and most importantly cheaply, test theories and experiment with potential new genetic codes to eradicate certain diseases. Before, the issue that was holding back real progress was that the only people who were really pushing on these subjects were institutes and universities. This meant that many of the projects were not adequately funded and given that one traditional genome test would cost around $1000, the number of these that could be undertaken were severely diminished. With this new
technology, the numbers that can be done are vastly increased whilst the cost will be significantly cut. The mixture of cheaper analysis through this new technology and the increased investment and recognition through the new awards will have a significant impact on the speed of analysis. This will also offer increased hope of curing diseases like cancer within the coming years. Although it is not known if these two events were orchestrated to occur on the same day, one thing that is certain is that February 19 will go down in history as the day that big data and disease prevention started to save lives. Whether intentional or not, the implications that this will have on the future of healthcare will once again see these three Silicon Valley stars innovating another major aspect of our lives. David Barton ANALYTICS LEADER
14
SO HOW REAL IS THE BIG DATA SKILLS GAP?
B
ig Data technologies certainly haven’t suffered for a lack of market hype over the past 18 months and if anything the noise in the space is continuing to grow. At the same time there's been a lot of discussion of a skills gap that exists to make effective use of these new technologies. Not surprisingly one of the most frequent questions I am asked is “how do those two come together in the real world”? So what should smart firms be aware of as they investigate how to get started using these new technologies with the people and skill they currently have? It makes sense to break this question down into three skills components; administration, data manipulation, and data science. The skills required in each of these areas varies from others and functionally they represent different challenges. You are not likely to face the same challenges and staffing considerations in all three. Administration – this includes the functional responsibilities for setting up and deploying and maintaining the underlying systems and architectures. I would argue that a real skills gap exists here today for Enterprise of any significant size. It may take a little bit of time to understand how some of the systems, such as a Hadoop differs in scaling horizontally and handles availability, but generally speaking setup and configuration and
administration tasks are reasonably well-documented and your existing server and network administrators can successfully manage them. In fact, compared to vertically scaled traditional technologies the administration component here can be dramatically simplified compared to what you are used to. Keep in mind that this is a space where the hardware/software vendors are starting to compete on manageability and an increasing number of appliance options exists, and over time this will become easier and easier to manage. To oversimplify in the interest of space, if your team can manage a LAMP stack you should be fine here. Data Manipulation – so here is when the fun really starts and also where you may first encounter issues. This is when you actually start working with the data in new ways and not surprisingly this is likely to be the first place that a skills gap appears. In practice I would suggest planning a gap appearing here – how mild or severe of a gap depends upon several factors. These factors boil down to two big buckets – first, can they manipulate the data in the new platforms and second do they know how to manipulate the data in valid ways. The first issue – can you manipulate the data often comes down to how much scripting and/or java development experience your teams have. While the tools and abstraction options are improvement rapidly across most of these technologies there is usually no escaping having to dive into scripting or even write some Java code at some point. If your teams are already doing that, no big deal. If they aren’t already familiar and using scripting languages then there is some reason for pause. Similarly, while there are interface options that are increasingly SQL, like if your teams aren’t experienced in multiple development languages you should expect them to have some learning curve. Can they push through it? Almost certainly they can, just budget some time to allow that to happen. As noted above, this will get easier and easier over
15 time but do not assume that tools will prevent the need for some coding. This is where you are going to spend the bulk of your time and people so make sure you are being realistic about your entry-point skills here. Also keep in mind this isn’t the hardest part. In many cases the second challenge here is the bigger one – not can you manipulate the data how should you manipulate the data. The real open question here what to collect in the first place and how to actually use it in a meaningful way. That, of course, is a bigger issue which brings us to the data scientist question. Data Science – so finally to the hotly debated data scientist role. Popular press would have you believe that there is a plus or -10 year shortage of people that are skilled in data science. At the same time literally tens of thousands of people have completed open coursework from MIT and others on data science. Another variable from the mix is the evolution and progress tools that make data collection and analytic routines are commonly available and easier understood. So where does that put us? First, it is important to note that there are many-many use cases that never get to this level such as creating a data landing zone, data warehouse augmentation, and alternative ELT (yes, I wrote that correctly) approaches. No data science needed there – and as I’ve written elsewhere diving directly into a data science driven projects is a lousy idea. What if you have a project that has a data science dependency, what should you expect? Frankly, your experience here will vastly differ depending on the depth and robustness of your existing analytics practice. Most large Enterprises already have pockets of expertise to draw on here from their SPSS, SAS or R communities. The data sources may be new (and faster moving or bigger) but math is math, statistics is statistics. These tools increasingly work with these technologies (especially Hadoop) so in some cases they won’t even have to leave their existing environments. If you have the existing skills, so far so good. If you
don’t have these skills you are going to have to grow, buy or rent them. Growing is slow, buying expensive, and renting somewhere in between. Do not expect to be successful taking people with reporting or BI backgrounds and throwing them into data science issues. If you cannot honestly say “yes, we have advanced statisticians that are flexible in their thinking and understand the business” you are going to struggle and need a grow, buy or rent strategy. We’ll pick up effective strategies for dealing with grown, buy or rent issue, including notions of Center of Excellence, in future topics. TOM DEUTSCH CONTRIBUTOR
16
USING SPATIAL ANALYTICS TO STUDY SPATIO-TEMPORAL PATTERNS IN SPORT
L
ate last year I introduced ArcGIS users to sports analytics, an emerging and exciting field within the GIS industry. Using ArcGIS for sports analytics can be read here. Recently I expanded the work by using a number of spatial analysis tools in ArcGIS to study the spatial variation of serve patterns from the London Olympics Gold Medal match played between Roger Federer and Andy Murray. In this blog I present results that suggest there is potential to better understand players serve tendencies using spatio-temporal analysis.
system in tennis also makes it possible for a player to win fewer points than his opponent yet win the match [2]. Winning these big points is critical to a player’s success. For the player serving, their aim is to produce an ace or, force their opponent into an outright error, as this could make the difference between winning and losing. It is of particular interest to coaches and players to know the success of players serve at these big points. Geospatial Analysis In order to demonstrate the effectiveness of geo-visualizing spatio-temporal data using GIS we conducted a case study to determine the following: Which player served with more spatio-temporal variation at important points during the match? To find out where each player served during the match we plotted the x,y coordinate of the serve bounce. A total of 86 points were mapped for Murray, and 78 for Federer. Only serves that landed in were included in the analysis. Visually we could see clusters formed by wide serves, serves into the body and serves hit down the T. The K Means algorithm [3] in the Grouping Analysis tool in ArcGIS Figure 1: Igniting further exploration using visual analytics. Created in ArcScene, this 3D (Figure 2) enabled us to statically replicate visualization depicts the effectiveness of Murray’s return in each rally and what effect it had the characteristics of the visual clusters. on Federer’s second shot after his serve. It enabled The Most Important Shot in Tennis? us to tag each point as either a The serve is arguably the most important shot in wide serve, serve tennis. The location and predictability of a players into the body or serve has a big influence on their overall winning serve down the T. serve percentage. A player who is unpredictable The organization with their serve, and can consistently place their of the serves into serve wide into the service box, at the body or each group was down the T is more likely to either win a point based on the outright, or at least weaken their opponent’s direction of serve. return [1]. Using the serve Figure 2. The K Means algorithm in the The results of tennis matches are often Analysis tool in ArcGIS groups direction allowed Grouping features based on attributes and optional determined by a small number of important us to know which spatial temporal constraints. points during the game. It is common to see service box the points belong to. Direction gave a player win a match who has won the same us an advantage over proximity as this would number of points as his opponent. The scoring
17 have grouped points in neighbouring service serve. This shows the two most important points boxes. in tennis are 30-40 and 40-Ad, highlighted in dark To determine who changed the location of their red. To simplify the rankings we grouped the serve the most we arranged the serve bounces data into three classes, as shown in Figure 4. into a temporal sequence by ranking the data In order to see a relationship between outright according to the side of the net (left or right), by success on a serve at the important points we court location (deuce or ad court), game number mapped the distribution of successful serves and point number. The sequence of bounces and overlaid the results onto a layer containing then allowed us to create Euclidean lines (Figure the important points. If the player returning the 3) between p1 (x1,y1) and p2 (x2,y2), p2 (x2,y2) serve made an error directly on their return, then and p3 (x3,y3), this was deemed to be an outright success for the p3 (x3,y3) and player. An ace was also deemed to be an outright p4 (x4,y4) etc success for the server. in each court Results location. It Federer’s spatial serve cluster in the ad court on is possible to the left side of the net was the most spread of all determine, his clusters. However, he served out wide with with greater great accuracy into the deuce court on the left s p a t i a l Figure 3. Calculating the Euclidean distance side of the net by hugging the line 9 times out (shortest path) between two sequential serve variation, who locations to identify spatial variation within a of 10 (Figure 5). Murray’s clusters appeared to was the more player’s serve pattern. be grouped overall more tightly in each of the predictable service boxes. He showed a clear bias by serving server using the mean Euclidean distance down the T in the deuce court on the right side between each serve location. For example, a of the net. Visually there appeared to be no other player who served to the same part of the court each time would exhibit a smaller mean Euclidean distance than a player who frequently changed the position of their serve. The mean Euclidean distance was calculated by summing all of the distances linking the sequence of serves in each service Figure 5. Mapping the spatial serve clusters using the K Means Algorithm. Serves are grouped box divided by the total number of according to the direction they were hit. The direction of each serve is indicated by the thin green trajectory lines. The direction of serve was used to statistically group similar serve distances. To identify where a player served locations. at key points in the match we assigned an significant differences between each player’s importance value to each point based on the work patterns of serve. by Morris [4]. The table in Figure 4 shows the By mapping the location of the players serve importance of points to winning a game, when a bounces and grouping them into spatial serve server has 0.62 probability of winning a point on clusters we were able to quickly identify where in the service box each player was hitting their serves. The spatial serve clusters, wide, body or T were symbolized using a unique color, making it easier for the user to identify each group on the map. To give Figure 4. The importance of points in a tennis match as defined by Morris. The data for the match the location of each serve some was classified into 3 categories as indicated by the sequential colour scheme in the table (dark red, context we added the trajectory medium red and light red). (direction) lines for each serve.
18 These lines help to link where the serve was hit from where the serve landed. They help enhance the visual structure of each cluster and improve the visual summary of the serve patterns. The Euclidean distance calculations showed Federer’s mean distance between sequential serve bounces was 1.72 m (5.64 ft), whereas Murray’s mean Euclidean distance was 1.45 m (4.76 ft). Figure 7. A proportional symbol map showing the relationship of where each player served at These results suggest that Federer’s big points during the match, and their outright success at those points. serve had greater spatial variation than The proportional symbols in Figure 7 indicate Murray’s. Visually, we could detect that a level of importance for each serve. The larger the network of Federer’s Euclidean lines showed a circles represent the most important points greater spread than Murray’s in each service box. in each game – the smallest circles the least Murray served with more variation than Federer important. The ticks represent the success of each serve. By overlaying the ticks ontop of the graduated circles we can clearly see a relationship between the success at big points on serve. The map also indicates where each player served.
Figure 6. A comparison of spatial serve variation between each player. Federer’s mean Euclidean distance was 1.72m (5.64 ft) - Murrray’s was 1.45m (4.76 ft). The results suggest that Federer’s serve had greater spatial variation than Murray’s. The lines of connectivity represent the Euclidean distance (shortest path) between each sequential service bounce in each service box.
The results suggest that Murray served with more spatial variation across the two most important point categories, recording a mean Euclidean distance of 1.73 m (5.68 ft) to Federer’s 1.64 m (5.38 ft). Conclusion
in only one service box, the ad service box on the right side of the net. The directional arrows in Figure 6 allow us to visually follow the temporal sequence of serves from each player in any given service box. We have maintained the colors for each spatial serve cluster (wide, body, T) so you can see when a player served from one group into another. At the most important points in each game (3040 and 40-Ad), Murray served out wide targeting Federer’s backhand 7 times out of 8 (88%). He had success doing this 38% of the time, drawing 3 outright errors from Federer. Federer mixed up the location of his 4 serves at the big points across all of the spatial serve clusters, 2 wide, 1 body and 1 T. He had success 25% of the time drawing 1 outright error from Murray. At other less important points Murray tended to favour going down the T, while Federer continued his trend spreading his serve evenly across all spatial serve clusters (Figure 7).
Successfully identifying patterns of behavior in sport in an on-going area of work [5] (see figure 8), be that in tennis, football or basketball. The examples in this blog show that GIS can provide an effective means to geovisualize spatio-temporal sports data, in order to reveal potential new patterns within a tennis match. By incorporating space-time into our analysis we were able to focus on relationships between events in the match, not the individual events themselves. The results of our analysis were presented using maps. These visualizations function as a convenient and comprehensive way to display the results, as well as acting as an inventory for the spatio-temporal component of the match [6]. Expanding the scope of geospatial research in tennis, and other sports relies on open access to reliable spatial data. At present, such data is not publically available from the governing bodies of tennis. An integrated approach with these organizations, players, coaches, and sports
19 scientists would allow for further validation and development of geospatial analytics for tennis. The aim of this research is to evoke a new wave of geospatial analytics in the game of tennis and across other sports. Furthermore, to encourage statistics published on tennis to become more time and space aware to better improve the understanding of the game, for everyone. DAMIEN DEMAJ CONTRIBUTOR
Figure 8. The heatmap above shows Federer’s frequency of shots passing through a given point on the court. The map displays stroke paths from both ends of the court, including serves. The heat map can be used to study potential anomalies in the data that may result in further analysis. References [1] United States Tennis Association, “Tennis tactics, winning patterns of play”, Human Kinetics, 1st Edition, 1996. [2] G. E. Parker, “Percentage Play in Tennis”, In Mathematics and Sports Theme Articles, http://www.mathaware.org/mam/2010/essays/ [3] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A K-Means Clustering Algorithm”, Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, No. 1, pp. 100-108, 1979.
[4] C. Morris, “The most important points in tennis”, In Optimal Strategies in Sports, vol 5 in Studies and Management Science and Systems, , North-Holland Publishing, Amsterdam, pp. 131-140, 1977. [5] M. Lames, “Modeling the interaction in games sports – relative phase and moving correlations”, Journal of Sports Science and Medicine, vol 5, pp. 556-560, 2006. [6] J. Bertin, “Semiology of Graphics: Diagrams, Networks, Maps”, Esri Press, 2nd Edition, 2010.
Neale Cousland / Shutterstock.com
21
DATA IN TIME, AN INTERVIEW WITH SIMON THOMPSON CHIEF RESEARCHER, BT
T
he speed that big data has grown in the past 5 Simon must go through in order to even access years has taken many by surprise, even those the data. who have a thorough understanding of data in Simon notes that "In a large corporate like BT general have been learning more and more based there are lots of processes and controls around on the speed in which the technology has been the data sets that you interact with. That can be created. as typical as gaining the required permissions Simon Thompson, Chief Scientist, Customer to access the system on which data resides but Experience at BT is one of those who has learnt a it can also be about talking to the controller huge amount about data management and usage about the use of the data and handling all of the because of this jump forward. Despite having a implications around the legislation, and also the very impressive background in machine learning, ethics of handling data." three patents to his name and a phd in computer There is clearly a responsible approach to data science, he has still found that his experience at protection here and Simon is clear that there is a BT has taught him several important lessons. definite responsibility to use the data that is held I was lucky enough to catch up with Simon ahead in an ethical and productive manner. of his presentation at the Big Data Innovation Another challenge that Simon has found is that summit in London on April 30 and May 1. “ On its own data analysis is useless Despite Simon's impressive background in complex data management and but needs to be used in an innovation within it, BT has thrown up overarching business context and new challenges for Simon. For instance, given the personal data implications of delivered at the right time to much of the data sets held by BT, there the right person � are several safeguards in place that
22 the complexity of data resources has increased. Despite BT not holding the same amount of data as many other companies, the aspect of their data that makes it more complex is the history. BT is a mature company and the data reflects this, in that their data evolves along with the company. Each data set is acquired for a particular purpose, but still manages to have a long history associated with it. Given this long history, it can therefore have several additions or changes to coincide with changes within the company. This often means that simply understanding the data source is challenging, making analysis more difficult than other companies. The last three years have also thrown up several new challenges in the field. For instance, Simon mentions that technological advancements have changed the industry. This has caused the realisation that there is a real opportunity in terms of data processing economics. Whereas processes used to take hours or days, given technological innovations, these same processes can now take minutes or even seconds. Simon mentions Peter Norvig's assertion 5 years ago that we are overwhelmed by the data available to us. This argument focussed around the fact that given the computing power at the time, there was no way to practically analyze the amount of data available. However, with the technological changes that have taken place, the data is now more manageable. Despite this though, there is still the challenge of making sure that your work in analytics can be actioned into a wider business context. So what are the keys to this business thinking? Here Simon makes some fairly profound ascertains about data analysis and the overall business use. On it’s own data analysis is useless but needs to be used in an overarching business
context and delivered at the right time to the right person. Even the most obvious data can have the most profound effect if it is delivered to somebody who can use it at the right time. This has never been truer than today when companies are looking to implement data driven processes and want to have solid numbers and reactions to base their decisions on. The industry itself has been looking at the importance of data delivery to other departments, meaning that many of those working within these areas are beginning to use analytics themselves. Going on from the article in the last issue 'The Big Data Skills Gap', I ask Simon about his thoughts on this and how he sees the industry changing in the next 10 years. Having worked on machine learning and computer science for several years, Simon sees the industry going the same way as web programming went 10 years ago. This being where only a select few people who could write basic HTML, meaning that those who could were very much sought after. “ There is a definite responsibility to After companies realised how it was use the data that is held in an ethical relatively easy to write programming for the web, the pool of people specialising and productive manner � in pure HTML and web coding was diminished as their expertise was single tracked and could be brought in
23 significantly cheaper. This left the web developers who had additional unique elements as well programming such as “ You should always be looking well design or UX/UI experience, which was ahead in order to allow for smoother gained from experience within the field. movements ” Simon sees this as the way in which data analysis will evolve, as doing relatively simple things like creating Hadoop data stacks will eventually be able to be done by the the overall end to end implications of analysis, relatively inexperienced people. However, when creating a fuller picture of what the analysis it comes to advanced analytics and analysis, those actually means and what affect this may have. who have been working within the industry will In addition to this there is likely to be a focus on reap the rewards from their years of experience. real time analysis to allow companies to be far In terms of technological innovations within more reactive. Simon likens this to driving, you the industry Simon also thinks that the speed should always be looking well ahead of you in in which data can be processed will increase. order to allow for smoother movements. If you Focussing on Moores law (that over the history of are always looking just in front of where you computing hardware, the number of transistors are driving the changes will have to be far more on integrated circuits doubles approximately severe and dangerous. every two years) he thinks that we are likely to This realtime analysis will allow companies to see only see two or three more iterations before we further down the road and make changes using move onto different systems. more predictive processes based not on historical This could be moving into quantum computing data, but on data that is collected and analyzed as (as mentioned earlier in this magazine) or other soon as possible. processing options, like the options currently So what innovations have BT customers seen being looked at by Intel and HP. This will not through analytics use? only speed up the analysis process but also allow Simon has been instrumental in the BT social a larger amount of data to be processed at once. CRM operations and has implemented systems There is also likely to be an increased interest in that has allowed the customer care team to filter twitter feeds. This innovative software has allowed the team to correctly identify relevant tweets from customers and respond to them. One of the data issues that existed here was that BT is a common shortening of 'but' used due to the limited characters of twitter. This meant that prior to this implementation the effectiveness of the team was hindered due to the amount of time spent filtering through potential tweets rather than answering important ones. This has helped BT customers as well as the business and incorporates some of the most important aspects that Simon discusses, bringing genuine business improvement through delivering simple but effective information to the right person at the right time.
“ The aspect of their data that makes it more complex is the history ”
GEORGE HILL CHIEF EDITOR