Analytics Innovation Issue 2

Page 1

THE

LEADING

VOICE

IN

A N A LY T I C S

INNOVATION

ANALYTICS INNOVATION Analytics in space: INTERNET The Final OF THINGS Frontier THE

LEADING

VOICE

IN

IOT

INNOVATION

NASA’s discovery of water on Mars could have huge implications for space exploration. We examine how they’re using the vast amounts of data they produce | 16

+

The Morality of Machine Learning Can machine learning algorithms reflect biases from Big Data? | 6

Data Mining in the Deep Web The Deep Web is too big for data scientists to ignore, but how do you mine for data if you can’t see it? | 12

DEC 2015 | #1

NOV 2015 | #2


Gaming

ie.

Summit

LONDON

MARCH 3—4 2016

Speakers Include + 44 207 193 0569 sbutton@theiegroup.com www.theinnovationenterprise.com analytics innovation


3

ISSUE 2

EDITOR’S LETTER Welcome to the 2nd Edition of the Analytics Innovation Magazine

Welcome to the second issue of Analytics Innovation. The run-up to Christmas is a particularly stressful time for organizations across all industries, though especially so for retailers, for whom the season is often make or break. Retailers are operating in an increasingly competitive sector with ever-tighter margins. Their major advantage, however, is the sheer volume of data that they are collecting about their customers. This is coming from loyalty cards, as well as an increased use of ecommerce. Interaction with retailers’ own social media pages can also provide a raft of information that they can use. There are numerous issues around the handling of this data, as Olivia Timson discusses with reference to potential solutions in this magazine. Predictive analytics enable shops to use the wealth of data they have to make better and faster decisions,

particularly around supply and demand. Retailers have to predict gaps in supply chain efficiency, supplier stability, customer churn and causes of attrition, in order to minimize risk. By using analytics to pinpoint unusual patterns in the data, they can recognize where issues may arise in the future, and attempt to mitigate against these. They can also adjust stock and storage space accordingly, helping them to cut down on costs. The Fall is also a massive time for the ‘big event’ sales, such as Black Friday. By studying customers’ buying behaviors, predictive analytics can reveal which products have the highest margins to reduce prices on, which demographic to target, and even how to plan logistics for the actual event to avoid stampedes. While analytics can be extremely useful to organizations, it also has its downsides, with machine learning often picking up biases that are inherent to the data they are

analyzing. I explore the issue, and how its impacting the police force, later in this magazine. In this issue, we’ve also interviewed a number of leading data scientists, including Jonathan Morra, Director of Data Science at eHarmony, and Jules Malin, Products Analytics Manager at GoPro to get their take on the current issues around data analytics.

James Ovenden managing editor

As always, if you have any comment on the magazine or if you want to submit an article, please contact me at jovenden@theiegroup.com

analytics innovation


4

Big Data & for Banking Summit

ie.

NEW YORK

DECEMBER 1—2 2015

Speakers Include + 1 415 992 5352 sbutton@theiegroup.com www.theinnovationenterprise.com analytics innovation


5

contents 6 | THE MORALITY OF MACHINE LEARNING

16 | ANALYTICS IN SPACE: THE FINAL FRONTIER

It’s often said that technology is neither inherently good, nor inherently bad, but can machine learning algorithms reflect biases in Big Data?

NASA’s discovery of water on Mars could have tremendous implications for space exploration. We examine how they’re using the huge amounts of data they collect?

9 | COULD ANALYTICS HAVE PREVENTED THE BANKING CRISIS

19 | BIG DATA IS DEAD - LONG LIVE FAST DATA

The financial crisis sent hundreds of millions tumbling into poverty. We look at whether data analytics could have prevented it

Big Data has gone from being just a buzzword to an accepted point of fact, but is it enough to simply be ‘Big’, or will fast data become more important?

12 | DATA MINING IN THE DEEP WEB

21 | HOW CAN FINANCE COMPANIES USE ANALYTICS FOR CYBER SECURITY?

The Deep Web is too big for data scientists to ignore, making up more than 99% of the entire World Wide Web. But how close are they really? 14 | HOW FAR AWAY ARE WE FROM DATA-DRIVEN SMART CITIES?

23 | WILL DATA ANALYTICS SOLVE THE OIL CRISIS?

Data-driven smart cities are said to be just around the corner, and governments have invested huge sums. But are they really that near?

WRITE FOR US

Do you want to contribute to our next issue? Contact: jovenden@theiegroup.com for details

managing editor james ovenden

Data breaches are wreaking havoc for finance companies. We ask if analytics could be the answer

Oil prices are plunging. Can Data save them?

26 | ‘IN CONVERSATION’ INTERVIEW FEATURE

We talk to some of the biggest names in analytics, including Jonathan Morra, Director of Data Science at eHarmony about how the dating site is innovating with data. We also talk to Jules Malin, Products Analytics Manager at GoPro, about the future for data science, and Washington Post’s Director of Big Data, Eui-Hong Han, about how data is driving personalization

| assistant editor simon barton | creative director chelsea carpenter

contributors olivia timson, sam button, euan hunter, alexander lane, elliot pannaman, dave barton analytics innovation


6

The Morality of Machine Learning James Ovenden Managing Editor

In Phillip K. Dick’s science fiction short story Minority Report, three mutants, known as ‘precogs’, have precognitive abilities that enable them to see up to two weeks into the future. The precogs are strapped into machines while a computer listens to their apparent gibberish, which it interprets into predictions of future crimes to take place.

The appeal of such a system to police forces is obvious. Saving a life is clearly of greater benefit than just catching the killer. And thanks to predictive analytics and machine learning, police forces across the US are now some way to achieving this dream. However, in Dick’s story, protagonist John A. Anderton ultimately reveals the central flaw in the whole system - that once people are aware of their future, they can change it. Some are now also arguing that using machine learning as a means to predict likely criminals is flawed too, accusing such algorithms of perpetuating

analytics innovation

discrimination and structural racism in the police force. But are they right? The lack of bias in Big Data is often cited as one of its major plus points, and is central to why it has been taken up by so many of organizations, which are seeking to rid themselves of often-flawed human intuition. The Chicago Police Department recently joined forces with Professor of Electrical Engineering at Illinois Institute of Technology, Miles Wernick, to create a predictive algorithm that generated a ‘heat list’ of 400 individuals that have the highest chance of


7

In the current climate, with racial tensions at boiling point and police mistrust in the black community at an all time high, it is important that the police be as transparent as possible

committing a violent crime. By focussing on likely suspects, the police say that they can concentrate their scarce resources where they are needed.

regarding likely criminals have far more chance of lasting longer, and far more chance of being based on aesthetic factors, from skin color to clothes.

Wernick’s algorithm looks at a variety of factors, including arrest histories, acquaintances’ arrest histories, and whether any associates have previously been the victim of a shooting. Wernick argues that the algorithm uses no ‘racial, neighborhood, or other such information’, and that it is ‘unbiased’ and ‘quantitative.’

Los Angeles, Atlanta, Santa Cruz and many other police jurisdictions have a similar predictive policing tool called PredPol, and have subsequently reported double digit reductions in crime. In the current climate, with racial tensions at boiling point and police mistrust within the black community at an all time high, it is important that the police be as transparent as possible. This includes keeping us as informed as possible as to how they are using predictive algorithms, lest public mistrust could hamper the use of what could be an incredibly beneficial tool.

The danger, however, is that these algorithms may reflect biases inherent in that data. Machine learning is so effective as a framework for making predictions because programs learn from human-provided examples rather than explicit rules and heuristics. Data mining looks to find patterns in data, so if, as Professor of Computer Science Jeremy Kun argues, ‘race is disproportionately (but not explicitly) represented in the data fed to a data-mining algorithm, the algorithm can infer race and use race indirectly to make an ultimate decision’. The defining factor of crime is poverty, and this is an issue that still disproportionately impacts black people. It could also be argued that things like arrest histories have been affected by previously existent structural racism, and by feeding this information into an algorithm all you are doing is scaling stereotypes and reinforcing it with something seemingly unbiased. However, predictive algorithms and machine learning can be easily and quickly adapted to interpret data and leverage predictions in real time. Prejudices that people hold

analytics innovation


8

Business Innovation Summit

ie.

FOUR SEASONS, LAS VEGAS

JANUARY 28—29 2016

Speakers Include + 1 415 992 5352 sbutton@theiegroup.com www.theinnovationenterprise.com chief strategy officer


9

could analytics have prevented the banking crisis? american economist edgar fielder once famously said...

Euan Hunter Industry Expert

...that if you ask five economists the same question, you’ll get five different answers - six if one went to Harvard. Nowhere is this more evident than when people discuss the causes of the 2008 financial crisis. However, if you’d asked them if there was a crisis coming at the start of 2007, as the four horseman of the financial apocalypse were appearing on the horizon, you’d probably have received a far more uniform ‘no’. One of the most baffling elements of the 2008 financial crisis is just how few saw it coming. Those who warned of its likelihood were ignored, and even derided. People like the current Governor of the Reserve Bank of India, Raghuram Rajah, was notoriously labelled a Luddite by Larry Summers for daring to warn of the risks from financial sector managers being encouraged to ‘take risks that generate severe adverse consequences with small probability but, in return, offer generous compensation the rest of the time,’ something which is now widely accepted. Others, such as Meredith Whitney of Meredith Whitney Advisory Group, also called early how bad things would get when the housing bubble popped.

Economic theories, while so often considered a science by practitioners, are all based on political and ethical assumptions. Predictive analytics is a tool that has really come into its own postrecession, driven by the rise of big data. It is, arguably, free from the bias that economists have, and could have been used to reinforce the soothsayers’ warnings that the prosperity of the early and mid noughties was built on foundations made of sand. There are a number of things that predictive analytics could have seen that might have helped governments and organizations to at least brace for impact, if not alter the course of the ship exactly,

analytics innovation


10

At the time, quantitative financial analysis was available only to high frequency traders, but a substantial drop in the cost of the technology in recent years has made it available to investors too

analytics innovation

One of the worst affected groups during the crisis were small investors. At the time, quantitative financial analysis was available only to high frequency traders, but a substantial drop in the cost of the technology in recent years has made it available to investors too, giving them far greater insights into where they should put their money and reducing the chances they will suffer massive losses. In terms of putting an end to subprime mortgages - one of the main drivers of the crash - it is not clear how much predictive analytics could have prevented them. Theoretically, it should make it clearer which borrowers are a risk, though lenders at the time had a pretty good idea of who was a risk and lent to them anyway because it made them

money. In future, regulators may be able to use predictive analytics to leverage insights that could help see when excessive sub-prime lending is occurring, and by who, so that they can try and minimize the damage. Predictive analytics is an extremely useful tool. However, it is unlikely, given the amount of cash being made by those about to bring down the economy, as well as the level of ideological conviction held by people who could have prevented it like Larry Summers and Alan Greenspan, that it would have produced warnings that would have been heeded. At the moment, Wall Street and other such industries are looking to predictive analytics in ways that would have been useful if they’d been around then, looking for trends and deviations that could imply a future drop in the markets. If the warnings of the past are heeded, and predictive analytics are taken notice of, they might just be able to prevent another Lehmann Brothers fiasco.


11

Looking for insights to drive your business forward?

channels.theinnovationenterprise.com Inform. Inspire. Innovate analytics innovation


12

Data Mining in the Deep Web Dave Barton Head of Analytics

The Deep Web has a bad reputation, mostly thanks to its association with the Dark Web. However, while the Dark Web may take place in the Deep Web, there is far more there than just drugs and illegal pornography, and most of it is completely lacking malevolence. The Deep Web actually includes any website that cannot be indexed by a search engine. This is any page that cannot be detected by the ‘crawlers’ used by Google and its competitors to search the web for sites to fill its results pages. It consists primarily of database-driven websites, and any part of a website that’s past a login page. It also includes sites blocked by local webmasters, sites with special formats, and ephemeral sites. Google and other engines cannot reach these because it isn’t programmed to fill out search forms and click on the submit button, rather, they must interact with the web server that’s presenting the form, and send it the information that specifies the query and other

analytics innovation

data that the web server needs. Estimates vary as to how much of the internet the Deep Web accounts for, but some top university researchers say it is more than 99% of the entire World Wide Web. There are tens of trillions of pages in the Deep Web, dwarfing the number of those that search engines can find, of which there are mere billions. For Data Scientists, the Deep Web presents a huge problem. Obviously, there are a number of difficulties inherent in mining something that’s hidden, but the data held is so vast that failure to at least try would be a huge folly. There is great value held within the Deep Web too, particularly searchable databases.


13

The Deep Web actually includes any website that cannot be indexed by a search engine. This includes any page that cannot be detected by the ‘crawlers’ used by Google and its competitors to search the web for pages to fill its results pages

There are thousands of highquality, authoritative online specialty databases, and they are extremely useful for focused search. PubMed is one example. PubMed consists of documents that focus on very specific medical conditions that are authored by professional writers and published in professional journals. There is also the Tor network, which hosts much of the Dark Web, the mining of which is of obvious benefit to law enforcement agencies. However, browsers such as Tor do not use Javascript, which is what most analytics programmes need to run, making it all the harder for analytics software to mine it. To mine the Deep Web manually would be an impossible task. There are currently a number of bots available that attempt to solve the problem. Such crawlers must be designed to automatically parse, process, and interact with form-based search interfaces that are designed primarily for human consumption. They must also provide input in the form of search queries, raising the issue of how best to equip crawlers with the necessary input values for use in constructing search queries. Stanford has built a prototype engine named the Hidden Web Exposer (HiWE), which tries to scrape the Deep Web for information using a task-specific, human-assisted approach to overcome such issues. Others that are publicly accessible include Infoplease, PubMed and the University of California’s Infomine. There is also BrightPlanet’s Big Data Mining tool, called the Deep Web Monitor, which allows you to set a specific query such as a location or keyword, and harvest the entire web for relevant information.

analytics innovation


14

How far away are we from Data-Driven Smart Cities? there is a clamor for them, but how close are they really?

Sam Button Analytics Evangelist

analytics innovation

Metropolitan areas around the world are looking for new ways to be ‘smart’. They are investing heavily in technology that makes them more sustainable, improves the efficiency of public services, and enhances their citizens’ quality of life. It is estimated that by 2020, governments across the world will be spending in excess of $400bn a year building smart cities, incorporating the Internet of Things and Big Data on an unprecedented scale. The latest report from Navigant Research forecasts that annual smart city technology investment will amount to $27.5 billion by 2023 in the US alone, and a total cumulative investment over

the next decade of nearly $175 billion. Big Data is already being used to develop systems for waste management agencies, public transportation, law enforcement, and energy use. Sensors are being installed citywide to monitor all aspects of public life. In waste management, for example, garbage trucks are set be alerted to the location of refuse that needs collecting. Such methods allow for resources to be directed to where they are needed, cutting back the wastage of unecessary journeys.


15 Sustainability is one of the key areas where big data will have a tremendous impact, particularly in helping to improve energy efficiency. Sensors attached to street lights and other outside urban furniture will measure footfall, noise levels and air pollution so that they are only used when necessary, and so that strategies can be put in place to keep them at an acceptable level. The City of Seattle recently announced that it was joining forces with Microsoft and Accenture to reduce the area’s energy usage by 25%. The project will collect and analyze hundreds of data sets collected from four downtown buildings’ management systems and store them on Microsoft’s Azure cloud. Predictive analytics will then be used to discover what’s working and what’s not, for example where energy can be used less, or not at all.

$27.5 $175

(In billions) Estmated annual smart city technology investment in 2023

(In billions) Estmated total cumulative investment in the next decade

There are also numerous implications for public transport in a city, which could be of huge benefit to the environment and convenience for people. Sensors in our cars will direct us towards available parking spaces. Ben Wellington’s popular TED Talk went into detail about how much big data could tell us about city transportation. He looked at average taxi trip speed to discover that they peak at 5.18am, at 24mph, then they decline until 8.35am when they level out for the entire day. Such discoveries may seem trivial, but they have the capability to revolutionize how city transportation is managed if leveraged by municipalities correctly.

This will not happen overnight. Many have valid concerns about the ability of hackers to get into the technology and wreak havoc. There are also privacy issues around the collection of data from some of the most intimate aspects of people’s lives. Even with the massive investment expected, there is still much that needs to change in order for it to work. Installing sensors across an existing city is terrifically time consuming and expensive, and an element of selectivity is required. In a new city in China, for example, it makes sense to build from the ground up and add sensors to everything as they go. Big data is brilliant at finding correlations, but it still needs humans to establish causation, and applying analytics on such a scale required to find the number of problems in a city, with a huge, complex population and billions of moving parts, could take years more. We are, however, on our way there.

One example of a city using such measures is Minneapolis. They are now using IBM’s Smarter Cities analytics solution for a growing list of functions, and are even looking to target landlords who violate city codes by pulling data from multiple, formerly unlinked databases, and looking it in conjunction with citizen complaint reports.

analytics innovation


16

analytics in space: the final frontier

What can data from the great unknown tell us?

James Ovenden Managing Editor

analytics innovation

Frank White, author of ‘The Overview Effect: Space Exploration and Human Evolution’, once noted that ‘if fish could think at our level of intelligence, back before humanity existed, and some fish were starting to venture up on land, a lot of them would be saying, just as we do now about space: ‘Why would we want to go there? What’s the point? ’And they’d have literally no idea of what venturing onto land was going to mean.’


17 When man first landed on the moon in 1969, many hoped and believed that it would herald a new era of space exploration. For people who thought that we’d be living like the Jetsons now, it may still be unclear how it has impacted humanity. It is important to remember though that it is a huge evolutionary step. There is also now a vast amount of data being collected by the thousands of satellites and telescopes. This is being leveraged for insights that could help us in many ways. In 2018, a group of organizations from across the world will start constructing what will be the largest radio telescope ever built, the Square Kilometre Array (SKA). SKA will generate 700 terabytes of data per second and when working at full capacity, its aperture arrays are expected to produce 100 times more data than the entire Internet. This is to say nothing of the hundreds of terabytes of data generated by NASA’s unmanned space missions every hour, and the thousands of satellites. NASA faces a number of challenges with this data. Firstly, in finding where to store such huge quantities, and how to go about making sense of it. Amazon and NASA have joined forces to produce the NASA Earth Exchange (NEX) platform. NEX is a collaboration and analytical tool that combines state-of-the-art supercomputing, Earth system modeling, workflow management and NASA remote-sensing data. It opens up the possibilities of storing information in the Cloud, which data visualization tools can then make accessible to a wide range of people with different skill sets to analyze.

year after a decade. It is the first space project to have sent a vehicle to explore a world so far away from Earth, three billion miles away, and NASA estimates that it is going to take roughly 16 months for it to send back all the data it’s been storing for the past ten tears. Is overcoming these challenges worth it? Using all of the information sent back from Pluto, NASA can create a geological history of the former planet, building finely detailed topographic maps to determine the depths and heights of Pluto’s terrain, and how that terrain has been shaped over the years. And it is not just other worlds that we can learn about, satellites are also looking down on us providing data to leverage for insights about Earth itself. The reflections of microwaves beamed at forests can show where their ecosystems are under stress, map differences in gravity on the Earth’s surface to calculate how much groundwater is stored in aquifers, and also warn of issues with soil conditions.

Most current space missions use radio frequency to transfer data, which is relatively slow roughly the same speed as a 1990s phone modem

Space is infinite, and there is always new data being sent. Analyzing this will take more than just Governments, it will likely also need substantial private investment, and there is a huge opportunity for tech start-ups looking to find ways to help with it. And it’s vital they do. As science fiction writer Larry Niven once said, ‘the dinosaurs became extinct because they didn’t have a space program. And if we become extinct because we don’t have a space program, it’ll serve us right!’

Another challenge is beaming the data back to Earth from space. Most current space missions use radio frequency to transfer data, which is relatively slow - roughly the same speed as a 1990s phone modem. The New Horizons spacecraft, for instance, finally reached Pluto this

analytics innovation


18

Need to Engage your Audience? Digital Lead Generation

Online Advertising

Marketing Services

YOU CREATE THE MESSAGE,

LEADING BRANDS CHOOSE

CREATING USER ENGAGEMENT

WE DO THE REST

CHANNEL VOICE

ACROSS ALL MEDIA

Exclusive one hour webinars to a targeted demographic

Join an integrated, content-sharing platform

Maximize your exposure through our services;

Increase your companies visibility and host your white paper on our platform

Enables brands to demonstrate their thought leadership & expertise

Social media promotion Digital & print advertising E-mail marketing

ggb@theiegroup.com analytics innovation

Innovation Enterprise Media Pack

+ 1 (415) 692 5498


19

big data is dead - long live fast data Elliot Pannaman Deputy Head of Analytics

Companies have really embraced big data over the last few years. They’ve had to, or they’ve fallen by the wayside. However, while it’s clearly still useful, it is no longer enough simply for big data to be ‘big’. Data quality and realtime insight are equally - if not more - important.

Big Data is often created by tremendous amounts of data produced at tremendous speed. Financial ticker data is one example, as is sensor data. The Internet of Things is also set to increase volume, variety and velocity of data in the future, as it becomes a more prominent feature across society. Events in these data streams often occur thousands to tens of thousands of times per second, requiring what has become known as ‘Fast Data’.

To be of real benefit to organizations, big data has to be processed and actionable insights garnered in real time. This has been enabled by huge leaps in ‘stream processing’. Up until just a few years ago, building a stream processing system was too complex for most businesses to tackle, but thanks to innovations by firms such as Typeface, it is steadily becoming a ubiquitous tool for companies that employ big data.

analytics innovation


20 Stream processing solutions are designed to handle high volumes of data in real time thanks to a scalable, highly available and fault tolerant architecture. Live data is captured and processed as it flows into applications, in contrast to traditional database models in which it has to be stored and indexed before being processed by queries. The solutions can then power realtime dashboards and on-the-fly analytics. Stream processing also connects to external data sources, which enables applications to incorporate selected data into the application flow, or to update an external database using processed information. These solutions analyze and act on real-time streaming data using what is know as ‘continuous queries’. These are SQL-type queries that operate over time and buffer windows, allowing users to receive new results as and when they become available.

Fraud detection and cyber security is one area that it can be especially useful. Banks need to monitor machine-driven algorithms, and look for suspicious patterns A recent development in the stream processing industry is the invention of the ‘live data mart’. The live data mart gives end-users adhoc continuous query access to streaming data that’s aggregated in memory. For business user-oriented analytics tools, this means that they can then go into the data mart and see a continuously live view of streaming data. The applications for Fast Data are many, basically aiding any business that requires data in real time. Fraud analytics innovation

detection and cyber security is one area where it can be especially useful. Banks need to monitor machine-driven algorithms, and look for suspicious patterns. Whereas before, the data that was needed for finding these patterns was loaded into a DWH and reports checked daily, the stream processing implementation now intercepts the data before it hits the DWH by connecting directly to the source of trading. This also helps firms abide by new regulations in the capital markets, which require businesses to understand trading patterns in real time.


21

How can Finance Companies use Analytics for Cyber Security?

Olivia Timson Analytics Thought Leader

Many are referring to the Ashley Madison leak as one of the worst acts of cyber crime to have ever taken place, with devastating consequences on a personal level for nearly everyone involved. It even has the potential to destabilize entire nation states, with people in positions of power on the list open to blackmail - either by those seeking money or by security services looking to leverage the information for intelligence. The leak has already caused at least two suicides and will likely lead to multiple divorces and broken families.

For the finance industry, cyber crime on this scale is nothing new. The main motivations for hackers are money, protest, and simply proving that they can do it. Banks fill the brief on all these fronts. The volume of attacks and the devastating consequences they can have mean that their systems must always be at the cutting edge, however despite heavy investment, they have still often been found lacking. John F. Kennedy once said that ‘if anyone is crazy enough to want to kill a president of the United States, he can do it. ’Which, other

analytics innovation


22

More than 500 million records have been stolen from US financial institutions over the past year as a result of cyber attacks, with the average consolidated total cost of a data breach now $3.8 million according to IBM - up 23% on 2013

of cybersecurity professionals say they could detect a breach in progress using Big Data

say they could improve their monitoring streams of data in real time

say they could conduct a conclusive root-causes analysis following a breach

analytics innovation

than being oddly sexist, also applies to today’s cyber threat landscape. There is no way, in the current climate, that companies are able to prevent determined adversaries from getting into their systems. According to an FBI official quoted in The USA Today, more than 500 million records have been stolen from US financial institutions over the past year as a result of cyber attacks, with the average consolidated total cost of a data breach now $3.8 million according to IBM - up 23% on 2013. Big data Analytics could, however, provide a solution for finance companies. Brendan Hannigan, General Manager at IBM Security, has claimed that, ‘with the rate, pace and sophistication of cyber-attacks continuing to grow exponentially, security has become a big data problem. Real-time analytics are required as the foundation of today’s security strategy’. The International Institute of Analytics (IIA), meanwhile, has predicted that big data analytics tools are set to become the first line of defence - bringing together machine learning, text mining and ontology modelling that can provide holistic and integrated security threat prediction, detection, and deterrence and prevention programs. Securely authenticating who is coming into the network is the primary issue facing companies’ IT security, as is the identification of anomalies that occur in the network in real time. Once in, though, hackers can spend months at a time in companies’ systems undetected, with perimeter-based defences often responsible. Government cybersecurity professionals estimate that cyber threats exist on government networks for an average of 16 days. According to the ‘Go Big Security’ report, underwritten by Splunk, 61% of government cybersecurity professionals say they could better detect a breach already in progress using big data and analytics, 51% say they could improve their monitoring of data

streams in real time, and 49% say they could conduct a conclusive root-causes analysis following a breach. The increasing reliance on big data for dealing with threats is being recognized, and firms like Rapid7, a provider of security analytics software and service, are getting heavy investment from VCs. FBR Capital Markets has predicted a 20% increase in ‘next-generation cybersecurity spending’ in 2015, and financial organizations looking to defend their data should be looking to adopt it if they want to stay ahead of those trying to infiltrate their systems.


23

Will Data Analytics Solve the Oil Crisis? Alexander Lane International Events Director

Fracking, while highly controversial, has had an undeniably huge impact on the oil industry. United States domestic production has nearly doubled over the last six years, causing traditional big oil producers like Saudi Arabia to look elsewhere for new markets to export their wares to. This is easier said than done though. The economies of Europe and developing countries are weakening, and vehicles are becoming increasingly energyefficient. This drop off is likely to continue, as companies invest huge sums in developing electric cars and governments attempt to wean their countries off fossil fuels. BP posted a $6.3 billion loss in the most recent quarter, and other firms such as ConocoPhillips are also laying off staff in huge numbers. Oil companies, used to having things their own way, are now having to cut costs and manage their resources more efficiently than ever before. As part of the effort to do this, a report by Lux Research has revealed that many are utilizing data analytics - using smarter sensors and big data to manage risks, cut costs by

increasing efficiency, and increase revenues. BP, for one, has embraced big data in response to its losses. The oil behemoth has been besieged by scandal in recent years, and is also liable for massive compensation payouts as a result of the Deepwater Horizon spill. It has now joined forces with GE adopting its software Predix to get 650 wells connected. Each well will dump around half a million data points every 15 seconds into GE’s software for analysis, which can then be leveraged for a variety of purposes, especially for optimizing equipment efficiency. Hardware sensors also provide data backups for high-value measurements of equipment-performance data.

analytics innovation


24

Data analytics also enables the Internet of Things, which is becoming particularly important in the oil industry, helping to reduce the number of staff needed and removing many from extremely dangerous roles

analytics innovation

BP has also got a substantial amount of data power in-house, owning what it calls the world’s largest supercomputer for commercial research. The computer has 2.2 petaflops of computing power, which handles enough data to ‘fill 30 miles worth of 1 gigabyte memory sticks lined up end to end.’ Data Analytics also enables the Internet of Things, which is becoming particularly important in the oil industry, helping to reduce the number of staff needed and removing many from extremely dangerous roles. Automation adds substantially to the upstream value chain of exploration, development, and production. But some of the biggest opportunities are in production operations, for example in reducing unplanned downtime. Given the oil and gas industry’s substantial increases in upstream capital investment, optimizing production efficiency is essential. Automation also helps to maximize asset and well integrity, increasing field recovery, and improving oil throughput.

Embracing data analytics does not completely remove the need for humans from decision making on oil fields. Creating a digital oilfield doesn’t just need algorithms. Companies must create a harmony between the two and use that harmony to their advantage, adjusting their recruitment policy and helping adapt the skills of those staff already in their roles, to develop a greater understanding of the data and how to use it.


25

Turn over for our Special ‘In Conversation’ Feature with some of the Speakers

Predictive Innovation Summit

ie.

MARRIOT MARQUIS, SAN DIEGO

FEBRUARY 18—19 2016

Speakers Include + 1 415 992 5352 sbutton@theiegroup.com www.theinnovationenterprise.com analytics innovation


26

in con ver sat ion Ahead of the Predictive Analytics Innovation Summit in San Diego, we sat down with three of the speakers to discuss their thoughts on the future for data science

name:

Jonathan Morra position:

Director, Data Science company:

eHarmony

How did you get into data science? I have a Ph.D. In Biomedical Engineering. For my dissertation work, I studied subcortical segmentation of brain MRI’s using machine learning methods. This was before data science was a term used as ubiquitously as today, so I thought of myself as a computer scientist with machine learning knowledge who focused in the medical domain. After noticing a huge increase in the number of uses of machine learning in other domains, I got excited to join a company that could help me fill in the gaps in my data science knowledge, including grid computing. Fortunately for me, eHarmony proved to be a place where I could learn and experiment very quickly and translate my engineering and machine learning background into a modern definition of a data scientist. How does eHarmony approach using Data Science? It’s clearly central to how you match couples, but is it used across all facets of your organization? We are currently seeing data science creep into many different facets of eHarmony’s business. A machine learning approach was adopted for optimizing our matching system to predicting communication, but it has seen growth in other areas as well, including match delivery, fraud

analytics innovation

prevention, email marketing, churn prediction, and our new product, Elevated careers. What do you feel your greatest challenge has been at eHarmony? Does working with concepts such as attraction present a new challenge compared to your previous roles? Some of our greatest challenges are moulding legacy systems to gather and produce data which can provide us new insights today. eHarmony has been around since 2000 and some of the systems weren’t designed with post hoc data analysis in mind. We’ve had to work hard to make sure our data is available when we need it and properly validated. As far as attraction is concerned I don’t think that this particular concept is that different from other concepts. We are interested in matching people such that we maximize the total communication amongst all matches made. Attraction is certainly one element of our matching algorithms, but it doesn’t receive any special treatment. However, because of our domain our job is very difficult because people’s preferences for their romantic relationships vary so much. This is why one of our big pushes for this year is personal matchmaking whereby we include each individual user ID in the model. This will allow us to hone in on everyone’s specific desires and not the optimal global trend.


27 You’ve had over 11,000 marriages as a result of people meeting on eHarmony Australia since its launch in 2007. Do you think that data can better judge who you are likely to be attracted to? That’s a tricky question because our data has shown that whom you are attracted to and whom you’ll have a long lasting relationship with are not necessarily the same person. Because of this divergence in needs we use two different systems for matching. Our compatibility system creates pairings (potential matches) for long term compatibility. These are models made by our team of psychologists who have studied marriages and are used to predict long term marriage satisfaction. Only those pairs whom are psychologically compatible are then sent to our affinity scoring system to predict their probability of communication. We use two way communication as a signal for mutual attractiveness. Therefore our matching system uses long term success as a gateway, and if you meet someone on eHarmony and get married, your marriage should be happier than a strong majority of marriages in the wild.

Using extracted features, though, has proven successful. I think image analysis is currently making great strides with all the work on deep learning, and I think that definitely has a place at eHarmony. I think the next big steps forward are a more unified data model. eHarmony is very good at extracting psychological information from individuals, but we are not good at assessing other characteristics such as musical taste, food preferences, or career ambitions. If we could partner with other data sources we could get a unified user representation that is much deeper than what we currently have and create even more satisfying matches in both the short and long terms. What will you be discussing at the summit? At the summit I’ll be going over our data science framework at eHarmony, focusing on both our data translation layer using a newly open sourced project called Aloha (eharmony.github.io/aloha), and how this leads to various models we are currently using. I’ll then speak in depth about some of those models.

Do you currently apply deep learning to images to get an idea of who people may be physically attracted to? Do you think this is a direction the industry could go in as the technology evolves? What else do you see as being a game changer for the use of data in the industry - and for data science in general? We do ingest information from images when doing affinity matching. I will talk on some of the things that we are using, but we attempt to extract information about users’ faces including hair color, eye color, and facial hair. Judging attractiveness based on images in general is very hard and very subjective. We have done it in the past and found limited success.

analytics innovation


28

in con ver sat ion

name:

Jules Malin position:

Product Analytics Manager company:

GoPro

How did you get started in data science? My journey into the world of data science began in 2012 when I started working on my Masters Degree in Predictive Analytics at Northwestern University. Since then it has really been the combination of new job roles, work projects, the Masters program, MOOCs, and Meet-ups that have enabled me to develop the skills and explore the complex world of data science. There is so much to learn and so many applications that no single program, course or job can fully prepare you for work in advanced analytics or data science. A Masters or PhD degree in data science for example is just the warm up act to a long journey of learning and discovery. What are the unique challenges of working at GoPro? A few of the challenges I’m facing include keeping up with the exponential growth of our data and implementing the tools that can handle performing analytics at scale. For example, an Impala query that works today and runs in 5 minutes may take 5 hours the following week because of the massive data growth. Therefore, it becomes imperative to think exponentially in terms of data growth, data diversity and the tools that are capable of keeping up with that exponential growth and

analytics innovation

diversity of data. I’ve tested and broken more tools at GoPro than all other companies I’ve worked for combined. And that’s a good thing. How much has predictive analytics changed since you’ve been working in the field? Predictive analytics has gone from an interesting set of tools and techniques used by insurance companies and data scientists that work in obscurity in basements, to being mainstream and an imperative for all companies across all industries that want to compete on analytics, or compete period. In the next 10 years, companies that are not in the process of operationalizing advanced analytics today, including predictive analytics, will be out of business or on their way out. The change that I’ve observed is not in the predictive analytics models and techniques themselves but in the sense of urgency to apply them to solve business problems and to answer business questions and to enable strategic and tactical data informed decision-making. What do you see as the next game changer in data science and predictive analytics? My fellow colleagues and I are observing two things. A dramatic shift in the adoption and implementation of massively scalable and flexible data architecture technologies and development and


29 adoption of advanced AI based technologies like IBM’s Watson and self driving car tech for example. So, I would say AI and deep learning being applied to business applications is a game changer. They will also unlock hidden value and enable currently hidden business models no one has thought of today, like Uber did for example. What will you be discussing at the summit? My presentation will cover the type and characteristics of data architecture and tools that are necessary to enable analytics on massive data sets. In addition, I will cover how the analytics provided by the data architecture and tools can drive and enable data-informed decision making across an entire organization with a focus on how it can enable informed product design and engineering decisions.

analytics innovation


30

in con ver sat ion

name:

Eui-Hong (Sam) Han position:

Director, Big Data and Personalization company:

The Washington Post

How did you get started in data science? When I was working at IBM in the early 1990s, one of the key future areas that IBM wanted to focus on was Data Mining. When I decided to pursue a PhD program in Computer Science, I chose Data Mining as the research focus. I was a founding member of the Data Mining Research Group at the University of Minnesota with Professors Vipin Kumar and George Karypis. That’s how I started my work in data science. Do you feel the Post has an advantage over other papers in terms of how it uses data because of Amazon’s influence? While we are a completely separate company from Amazon, our engineering team at The Post certainly has the kind of engineering DNA you might expect to see at a technology company. Data drives our decision making. We are experts at A/B testing, rapid prototyping and more. We do believe this gives us a unique advantage when it comes to giving users the best possible experience, and keeping them coming back for more. The Post’s unique visitors increased 65% year-over-year in March, according to comScore, and pageviews were up 96%. How much influence does data

analytics innovation

have on the kind of content that you put out on various platforms? I read that you recently changed your video platform, moving away from longer videos to more social media appropriate lengths. How much was data an influence on the decision? The Post is committed to understanding user behaviors, the makeup of its audience, and associating measures and KPIs that allow us to quantify and track that over time. This data gives us powerful insight into how the site is performing, which we use to hypothesize on improvements and new features, define success criteria for those enhancements, and ultimately test. The decision to focus more on shorter length videos was one of the examples of how we used KPIs to make a decision. How much is data helping to personalize content for the site, and for the app? Is this the direction the industry is going? We use a wide variety of data points to personalize content for the site, and have seen 3-4 times better click through rate in the article recommendation modules. I think that personalization technology is an essential part of news media in a networked society, where news outlets are finding their audience instead of the audience finding news outlets.


31 And finally, what will you be discussing at the summit? Improving the user experience by predicting article virality. The Washington Post newsroom publishes more than 1,000 pieces of content a day. What if it was possible to predict which stories, out of all that content, were likely to go viral? Editors could target those stories to add photos, videos, links to related content and more, in order to more deeply engage the new and occasional readers clicking through to a popular story. I will explore how the media organization is approaching predictive virality for the benefit of its readers and will discuss features, algorithms and evaluation methods of a tool currently in development.

Sound Interesting? Hear Jonathan, Jules and Eui-Hong talk about these topics and a lot more at the Predictive Analytics Innovation Summit in San Diego on February 18 & 19 2016

Contribute to our Channels channels.theinnovationenterprise.com

analytics innovation


32

For more great content, go to ieOnDemand Over 4000 hours of interactive on-demand video content

View today’s presentations and so much more

All attendees will recieve trial access to ieOnDemand

Stay on the cutting edge

Listen

Watch

Learn

Innovative, convenient content updated regularly with the latest ideas from the sharpest minds in your industry.

+1 (415) 692 5514

www.ieondemand.com 1

analytics innovation

sforeman@theiegroup.com


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.