2
CONTENTS The Big Interview - p.13 We spoke to Ashok Srivastava about the impact of big data at NASA and on the wider business community, his presentation in Boston and data breakthroughs
Bridging the Gap - p.5 The current big data skills gap is growing, how can the industry and educators approach and overcome the problem?
Obama’s Big Data win - p.10 With Obama’s recent election victory, we discuss how big data and data mining was vital to his victory
A Chat with Drew Linzer - p.7 Drew Linzer speaks to us about his use of data and analytics to predict the election result four months before it happened
LETTER FROM THE EDITOR Welcome to this issue of Big Data Innovation. As we approach the holidays I thought I would create a present for all of you making big data and analytics one of the most exciting industries to be involved with. We hope you enjoy the content of this issue, we have a fantastic interview with Ashok Srivastava following on from the success of his impressive presentation in Boston and ahead of his highly anticipated presentation in San Francisco. In addition to this we discuss how to bridge the ever growing data skills gap within the industry. Drew Linzer also talks to us about his algorithms for the 2012 elections that saw him correctly predict the winning margins 4 months before the polls even opened.
Video - Big Data Innovation, Boston 2012 Big Data Innovation, Boston 2012 Highlights
We hope you enjoy this edition of Big Data innovation as much as we enjoyed making it.
GEORGE HILL CHIEF EDITOR
WANT TO SEE YOURSELF HERE?
HAVE AN IDEA?
We have several advertising opportunities available to cater to all budgets and can offer bespoke packages to suit you and your business.
What would you like to see our writers covering or do you have any innovative Big data articles that you want to contribute?
Contact Pip at pcurtis@theiegroup.com for more information.
Contact George at ghill@theiegroup.com for more information.
3
WHATS BIG IN BIG DATA? Welcome to this edition of Big Data Innovation where we will be exploring all things big data. In this issue we will be talking to the chief scientist at NASA, discussing the ways to bridge the big data skills gap and how Obama utilized big data during his election campaign.
Gartner Predicts Big Things
BIG DATA WORLD TOUR @IE_BigData Big Data Set to Explode as 40 Billion New Devices Connect to Internet
Gartner has predicted that Big Data will have driven $28 billion of IT spend through 2012. Currently the biggest spend on these solutions is not actually in their use but in the adaptation of existing systems to handle it.
40,000 000,000
It is also coincides with companies looking to utilize social media and having a way of tracking the ways in which users are interacting with their brand.
http://onforb.es/VQBo7p
Obama’s Big Data Success The recent election win by President Obama has been attributed to his thorough use of Big Data in order to target potential voters and make sure his core voters are voting.
Gil Press @GilPress New #DataScience institute promises to stimulate research NY economy http://smrt.io/SrrCIN
With a data team five times the size of that in 2008 it shows that Big Data has become commonplace in successful election campaigns and bodes well for the future of the industry as a whole.
What name would you give to a group of people working in Big Data?
Data Expected to Create 6 million new jobs According to a new study by the Gartner group, Big Data is expected to create around 4.4 million jobs in the tech industry and thus 6 million across other industries too.
Best Answers
Data Whisperers Rock stars Data Diva’s Data Detectives
Around 1.9 million of these tech jobs are likely to be in the US, with the others spread out globally.
After a while, you can call them visually impaired! minning all that data sure is hard on the eyes!
Follow Us at:
@IE Big Data
@iegroup
More Business Insight through Better and Easier Data Analysis
Combine Big Data from multiple sources - put it at your fingertips using our interactive Trillion-Row Spreadsheet and get better answers in seconds.
SM
Join hundreds of blue-chip companies who have found an easy way to get more insight out of more data. For well over a decade our Cloud-based platform has pushed the limits of analytics on large amounts of data. 1010data enables enterprises to combine, analyze, and share any number of huge data sets across corporate boundaries, so that new insights can inform strategy like never before. From routine reporting to advanced analytics, the 1010data system allows business like ours to easily hone your tactics and strategy, while reducing technology overhead, costs, and risk.
Learn more about 1010data, download our Gartner reports here.
www.1010data.com III
5
THE BIG DATA SKILLS GAP
B
ig Data is one of the quickest growing industries in the world and companies are clamoring to get involved to reap the rewards of this new analytics revolution. The new technique has many advocates - from President Obama who has leveraged it famously during his last two election campaigns to Warren Buffet who made his first ever high tech investment in IBM thanks to its new focus on big data. One of the issues that this new industry currently faces is not that people do not necessarily understand the end results, but that there is currently a real lack of qualified and experienced people to use the software and analyse to produce the results. Gartner recently claimed that Big Data will create around 4.4 million IT jobs globally and is likely to surpass $3.7 trillion in company spend in 2013. These are hugely significant numbers in the context of the global economy and especially in countries that are still feeling the pinch from the recession. The disappointing aspect of this report is that they also forecast that there will be a significant shortage of qualified professionals to fill these potential positions, meaning that the investment that will be made may not reach its full potential. According to the report there is likely to be only one third of these IT jobs filled and an even smaller number will be available for the actual number of analysts needed. So why, with this huge potential for global economic improvement, are we falling short of these targets?
There are two main reasons for this. The current public and private schooling systems are not ade-
Big Data could create 4.4 million IT jobs globally quately preparing graduates. Although universities are increasing the offerings in these areas there is still a significant gap in the numbers of courses offering the level of skills required to go straight from the classroom to the office. The best professors are the people who have worked within the industry and know the shortfalls that graduates will have and the ways to overcome them. It is great being able to build models and collect data, but in reality, without the analytical skills to back these up they are not enough.
We can expect a 40-60 per cent projected annual growth in the volume of data generated
4
6 Although analytics have been around for a while, big data has taken these to a whole new level and graduates need a different set of skills to truly work. The industry has grown out of analytics, meaning that the guys who are using big data have only been doing it in its current form for 5 years at most. It is an exciting time to work in the area, so why would people move professions? The industry is moving too quickly Would you have heard of Hadoop 5 years ago? Perhaps if you were on the tip of the wave you would have heard the name, you may have had some practical experience of it. One thing that is for certain is that you would not have the kind of knowledge of it to create a university course on it. People who are therefore graduating right now would not have had a course designed around the technologies they are likely to be using. It is the equivalent of starting a social media marketing course in 2009, learning everything there is to know about myspace and graduating to use Facebook and Twitter. So what are the solutions? This question has been posed on several blogs and Linkedin groups and hundreds of people have made their points. Three major points have come out of this: Improve Education You may not know the ins and outs of hadoop and you may not know your quants from your java but if you come out of education with the basics of what you need to master these then you are likely to be in a good position to excel in the future.
Placements and Patience The likelihood is that when people graduate they know the books, the words and the authors but have little knowledge of what is behind this knowledge. Employers need to have the patience to teach and nurture rather than expecting results straight away. Due to the speed that the in-
Big Data will drive $3.7 trillion in IT spend in 2013 dustry is moving graduates should be seen as clay that needs to be moulded. Employing people based on potential rather than what they have on their CV is going to be vital. Like how the best carpenters and bricklayers learn their trade on the building site, the best analysts will learn their trade through being surrounded by experts and numbers. Passion Einstein did not go to the best university, he was heavily dyslexic and could barely string together a sentence but he is now known as one of the most influential scientists in history. The reason for this is not purely down to talent and intelligence but also his passion and perseverance for the subject. If we can find people who have the passion to succeed in analytics and have a genuine drive to succeed then they should be accepted with open arms.
CHRIS TOWERS BIG DATA LEADER
7
A CHAT WITH DREW LINZER
D
rew Linzer is the analyst who predicted the results of the election four months in advance. His algorithms even correctly predicted the exact number of votes and the winning margin for Obama. Drew has documented his analytical processes in his blog votamatic.org which details the algorithms used, their selections and also the results. He also appeared on multiple national and international news programs due to his results. I caught up with him to discuss not only his analytical techniques but also his opinions on what was the biggest data driven election ever. George: What kind of reaction has there been to your predictions? Drew: Most of the reaction has focused on the difference in accuracy between those of us who studied the public opinion polls, and the "gut feeling" predictions of popular pundits and commentators. On Election Day, data analysts like me, Nate Silver (New York Times FiveThirtyEight blog), Simon
“I wanted the assessment of these forecasts to be as fair and objective as possible and not leave me any wiggle room if they were wrong”
Jackman (Stanford University and Huffington Post), and Sam Wang (Princeton Election Consortium) all placed Obama's reelection chances at over 90%, and correctly foresaw 332 electoral votes for Obama as the most likely outcome. Meanwhile, pundits such as Karl Rove, George Will, and Steve Forbes said Romney was going to win -- and in some cases, easily. This has led to talk of a "victory for the quants" which I'm hopeful will carry through to future elections. How do you evaluate the algorithm used in your predictions? My forecasting model estimated the state vote outcomes and the final electoral vote, on every day of the campaign, starting in June. I wanted the assessment of these forecasts to be as fair and objective as possible -- and not leave me any wiggle room if they were wrong. So, about a month before the election, I posted on my website a set of eight evaluation criteria I would use once the results were known. As it turned out, the model worked perfectly. It predicted over the summer that Obama would win all of his 2008 states minus Indiana and North Carolina, and barely budged from that prediction even after support for Obama inched upward in September, then dipped after the first presidential debate. The amount of data used throughout this campaign both by independent analysts and campaign teams has been huge, what kind of implications does this have for data usage in 2016? The 2012 campaign proved that multiple, diverse sources of quantitative information could be
“The 2012 campaign proved that multiple, diverse sources of quantitative information could be managed, trusted” managed, trusted, and applied successfully towards a variety of ends. We outsiders were able to predict the election outcome far in advance. Inside the campaigns, there were enormous strides made in voter targeting, opinion tracking, fundraising, and voter turnout. Now that we know these methods can work, I think there's no going back. I expect reporters and campaign commentators to take survey aggregation much more seriously in 2016. And although Obama and the Democrats currently appear to hold an advantage in campaign technology, I would be surprised if the Republicans didn't quickly catch up. Do you think that the success of this data driven campaign has meant that campaign managers now need to be an analyst as well as a strategist? The campaign managers may not need to be analysts themselves, but they should have a greater appreciation for how data and technology can be harnessed to their advantage. Campaigns have always used survey research to formulate strategy and measure voter sentiment. But now there are a range of other powerful tools available: social networking
6
8
websites, voter databases, mobile smartphones, and email marketing, to name only a few. And that is in addition to recent advances in polling methodologies and statistical opinion modeling. There is a lot of innovation happening in American campaign politics right now. You managed to predict the election results 4 months beforehand, what do you think is the realistic maximum timeframe to accurately predict a result using your analytics techniques? About four or five months is about as far back as the science lets us go right now; and that's even pushing it a bit. Prior to that, the polls just aren't sufficiently informative about the
eventual outcome: too many people are either undecided or haven't started paying attention to the campaign. The historical economic and political factors that have been shown to correlate with election outcomes also start to lose their predictive power once we get beyond the roughly 4-5 month range. Fortunately, that still gives the campaigns plenty of time to plot strategy and make decisions about how to allocate their resources. If you are interested in hearing more from Drew you can check out his blog at votamatic.org or attend his presentation at Big Data Las Vegas on January 30 & 31.
GEORGE HILL CHIEF EDITOR
WANT TO GET INVOLVED? Do you want to write for Big Data Innovation? We are always looking for new innovative articles or new ideas for our writers to cover. Contact us at ghill@theiegroup.com for more information Do you want to have your company represented in the magazine? We have multiple advertisement opportunities to cater to all budgets. Contact Pip at pcurtis@theiegroup.com for more information
CXO
Flagship
High Tech
Healthcare
Women
Expected
Finance
Government
Partnership Opportunities: Pip Curtis | pcurtis@theiegroup.com | +1415 992 5349 Attendee Invitation: Sean Foreman | sforeman@theiegroup.com | +1415 692 5514
8
10
OBAMA’S BIG DATA ELECTION WIN
previous groupings done through gender, religion or ethnic group became far less relevant, as with these pieces of information the team could drill down and pin point individual needs. The levels of information available to O b a m a ’s t e a m s a w s i g n i f i c a n t reductions in advertising spend whilst also increasing the interaction with t h e i r a u d i e n c e . Tr a d i t i o n a l T V advertisements before and after local news programs were not targeting all of their audience, so they put ads between shows popular with swing voters and the correct demographics. The information for all of these being available through the new integrated database.
spirit of america / Shutterstock.com
I
n October 2012 US political commentators were claiming that the election race between Obama and Romney was going to be one of the tightest contests in living memory. Republicans were claiming a landslide victory and Democrats were constantly looking over their shoulders wondering how things had got this close. In July 2012 Drew Linzer had predicted a 332-206 win for Obama. When asked if he had changed his views on this after Romneys gaffes and Obama’s poor first debate showing, he did not budge on these numbers and turned out to be totally correct on the morning of November 7.
team five times the size of that in 2008 and were using data to fundraise in a way that had not been done before. Emails went out inviting people to join a competition to win a dinner with Sarah Jessica-Parker. When people compared the emails they realised that there were several different variants and these variants created a much wider engagement rate with individuals, increasing the amount of money given.
This shows the power that algorithms and numbers have had in recent elections. Many attributed the power of evangelists to George Bush’s two consecutive elections, few can doubt the power of data in Obama’s.
In 2008 much was made of the attempts from the Obama team to utilize data mining within their campaign. However, the data mining consisted of multiple databases with little or no interaction with one another. The first thing that the new data team built was a database pulling in all of this information into a single usable source.
Before the true campaigning had even begun, the Obama campaign team led by Jim Messina had amassed a data
This source allowed the team to have upto 80 separate information points for individuals, meaning that the
Not only could they use this data for analysis on what they were doing, but it also helped them to build algorithms to help predict the likely actions of particular individuals. The use of predictive analytics is the main reason that Drew Linzer managed to predict the outcome of the election 5 months prior to the actual result and with the campaign team running 66,000
Obama’s data team was five times larger than in 2008 scenarios each night, they had significant insight. Long gone were the old school ways of running campaigns based on assumptions and gut feelings, Obama’s team rarely ran with an idea without numbers and models to back up their actions. This led to strategic targeting and resource allocation which saw significant cuts in the the numbers of ineffective man hours.
11
This campaign was the first to have a dedicated Chief Data Scientist, Rayid Ghani, who managed all of the data and the data scientist team. This meant that objectives and priorities were well directed towards particular outcomes, further increasing the efficiency of labour assignment. There were significant breakthroughs that allowed the campaign to target people in a way that had never been
14% improvement in ad effectiveness done before. For instance through complex analytics in Chicago, it was found that people who signed up for t h e c a m p a i g n ’s Q u i c k D o n a t e programme were likely to donate around 4 times more than donors giving through traditional means. One of the key metrics collected and measured was the ‘persuadeability’ of voters, these could be potential supporters who needed persuading to vote, swing voters to persuade to support Obama or supporters who could be persuaded to make either a financial or voluntary contribution. This metric alone saw a 14% increase in the effectiveness of targeted advertising. Aiming for these people with specific messages in areas where they were more likely to interact with them meant that the effectiveness of the message was increased significantly. The difference between the two parties was best described by Mike Lynch in his article in Computerworld: “For those who had the stamina to watch the election campaign unfold over 22 long months, it became not just a battle of ideologies and campaign issues, but also a rivalry
between old media pundits and new media analysts” The understanding of new media and the analytical implications of this become clear when comparing the use of social media in both campaigns. Taking Youtube as an example, Obama had 240,000 subscribers and 246 million pageviews compared to Romney’s 23,700 subscribers and 26 million pageviews. This increase in social media interaction not only meant that Obama’s message was going out to 10 times more people through this medium, but also that his team were getting 10 times more data about their audience. This further increased the personalised messages and so further increased the numbers of responders. So although the media made endless speculation about the numbers of voters and the effect that gaffes, fluffs and poor debating made, in reality the outcome of the election had been predicted 5 months beforehand thanks to the kind of analytics that ended up winning Obama the election. So where will things go from here? With the well publicized use of analytics within this campaign there will be an inevitable trend towards using them more thoroughly in 2016 and beyond. Obama had the advantage of not having to go through the primaries this time around meaning that his data team had significantly longer to build the algorithms and databases needed to piece together a complex data driven campaign. The republicans had 7 months, meaning that they had little time to prepare to the same extent as the democrats, but in 4 years time when there will be primaries for both parties, this will create the potential for a truly data driven race.
The question will be how this will effect the campaigns overall and what kinds of information will be available to government agencies in 4 years time with new legislation on data privacy being consistently passed. The main lesson we can learn from this campaign is that big data is here and based on this evidence, will be around for a while.
DAVID BARTON ANALYTICS LEADER
Snapshot Agenda
Day 1, San Francisco April 11 & 12, 2013
Big Data Innovation Summit Please check back regularly to see the latest additions to the Big Data Innovation Summit. Please note that this is a snapshot agenda, there are more confirmed speakers please see http://analytics.theiegroup.com/bigdata-sanfrancisco/speakers for a full speaker list.
The Big Data Innovation Summit is the largest gathering of Fortune 500 business executives leading Big Data initiatives. There are four tracks included at the Big Data Innovation Summit:
Agenda snapshots will be released every 10 days.
Time 08.30
t $SPTT *OEVTUSZ t )FBMUIDBSF t 'JOBODF t (PWFSONFOU t )BEPPQ t %BUB 7JTVBMJ[BUJPO
09.00
09.30
10.00
10.30
Track
Big Data Innovation
Data Science & Analytics Facebook
Team Lead, Mobile Data Science LinkedIn
Data Analytics Science BP
Senior Computer Engineer General Electric
Big Data in Healthcare
Chief Technology Officer Dept. of Health
Chief Medical Officer GE Healthcare
Chief Data Officer Seattle Childrens
Sr. Director, Clinical Outcomes & Analytics Walgreens
Big Data in Finance
Chief Data Officer NYSE
Director, Decision Science Barclaycard
VP, IT & Risk Analytics Deutsche Bank
VP, Digital Strategy Wells Fargo
Big Data in Government
Division Director Dept. of Defense
Chief of Information Dept.of Commerce
Principal Scientist NASA
Hadoop
Data Visualization
Director, Database Development San Fran Police Dept.
The Hadoop Event will focus on best practices for Hadoop including implementations, business problems and solutions with real case studies, methods and linking Hadoop to unlocking the value of Big Data. Join speakers from Facebook, LinkedIn & more for two days of interactive sessions and presentations. More details coming soon...
The Data Visualization Summit brings together leaders in Data Viz to explain and clarifiy the numerous benefits of using data visualization. One of the top benefits is that data is easier to understand and is more accessible in this format such that people can better interact with the data and analyze it.
New Speakers Released Soon
subject to change For speaking opportunities please contact Chris Towers on ctowers@theiegroup.com or +1 415 992 5339 For sponsorship opportunities please contact Pip Curtis on pcurtis@theiegroup.com or +1 415 992 5359 To attend as a delegate please contact Robert Shanley on rshanley@theiegroup.com or +1 415 992 7605
11
13
NASA’S BIG DATA: THE INTERVIEW I spoke to Ashok Srivastava about NASA, big data and his presentation at the Big Data Innovation Summit in San Francisco in April
O
n a sunny September morning in Boston this year, Ashok Srivastava was waiting to stand at the podium and present to a room of 600 people at the Big Data Innovation summit - the largest, dedicated big data conference in the world. Giving his perspectives on the growth of big data, its uses in aviation safety and how his employer, NASA, have utilised and innovated through its uses, Ashok came out as one of the most popular speakers at the summit.
14 After the success of the presentation and the summit in general, I was lucky enough to sit down with Ashok to discuss the way that big data has changed within NASA and the success that it has had in the wider business community. So why has big data come to prominence in the last three or four years?
are consuming, what is happening in big data there is going to effect businesses.
NASA in ten years will be dealing with a huge amount of data, on a scale that is currently unimaginable. This
For instance NASA are currently discussing the use of their big data algorithms and systems with companies ranging from medical specialists to CPG organizations. The work that they have done within data in the past few years have created the foundations allowing many companies to become successful.
“It is important to understand the business problem that is being solved”
Ashok argues that this is not a change that has taken place solely over the past three or four years. It is a reaction to society’s change in general to One of the issues that is really effectcould include things like full global becoming more data driven. The last ing companies looking to adopt big observations as well as universe obser25 years have seen people vations, gathering and analyzincreasingly needing data ing petabytes of information Predictive Analytics Through Innovations in Big to either make or back up within seconds. Data Ashok Srivastava, September 2012 decisions. With public money being Recent advancements in spent on these big data protechnology and the ability jects, Ashok makes it clear of people to analyze large that the key benefit should data sets have meant that always boil down to ultithere has been an acceleramately providing value for the tion in the speed at which public. this happens. With new This is a refreshing view of types of databases and the NASA who have traditionally Video - Predictive Analytics ability to record and anabeen seen as secretive due to lyze data quickly the levels Through Innovations in Big Data the highly confidential nature of technology required have of their operations and the been reached. lack of public understanding. NASA has been at the foreAshok also had some pieces of front of technology innovaadvice for people currently data is the current gap in skilled big tion for the past 50 years, bringing us looking to make waves in the big data data professionals. The way to solve everything from the modern computer world: this in Ashok’s opinion is through a difto instant coffee. Ashok explains how ferent set of teaching parameters. “It is important to understand the NASA is still innovating today and with business problem that is being solved” The training for these should revolve the huge amounts of data that they around machine learning and optimiza“Making sure the technologies that tion, allowing people to learn the are being deployed are scalable and “trade of big data” meaning that they efficient” can learn how systems work from the Ashok will be presenting at the 2013 basics upwards, allowing them to Big Data Innovation Summit in San have full insight when analyzing. Francisco on April 11 & 12. Given the relative youth of big data, I wanted to know what Ashok thought would happen with big data at NASA in the next 10 years in addition to the CHIEF EDITOR wider business community.
“This is a change in culture that’s happened not in the last 3 or 4 years but probably over the last 25 years”
GEORGE HILL