Big Data Innovation, Issue 11

Page 1

1


2

Editor’s Letter

Welcome to this issue of Big of this edition, we are moving to a bi-monthly magazine Data Innovation Science has always used timetable. This gives us data as a way of finding new a better opportunity to innovations and scientific create better content for the discoveries, but with the magazines as well as develop widespread use of Big Data new avenues for the magazine technologies, it has been to work in.

possible to utilize more data We are pleased to announce and speed up the pace of that we are going to be launching a brand new discovery. This issue is dedicated to the website for this magazine use of Big Data in science as well and its partner and the ways in which it is publications. This is going to helping to solve the world’s give us the opportunity to most elusive problems as create more Big Data stories well as allowing scientists for our readers. It will allow us to investigate things in a to quickly find and post news, way that was never possible views and features rather than needing to wait until the before. magazine is released. We will be looking at the ways it is used in oceanography, life As always, if you like the sciences and even the ways it magazine please share it, the is being put to use in finding magazine will always be a hub for Big Data knowledge the cure for cancer. and having as many people Before, we have had articles interested in data is the about how data is being used ultimate goal. on an industrial scale to help with breaking genetic codes If you are interested in and helping with animal contributing, please get in conservation projects. We contact. wanted to take a more in depth look at the ways in which George Hill scientists are utilizing the new technologies available to Managing Editor them. Regular readers may have For advertising opportunities noticed that the gap between please contact Hannah at this issue and the previous hsturgess@theiegroup.com has been longer than usual. The reason for this is that as

§

Managing Editor George Hill Assistant Editors Simon Barton Chris Towers President Josie King Art Director Gavin Bailey Advertising Hannah Sturgess

hsturgess@theiegroup.com

Contributors Baiju Nt Christopher Hebert Heather James Gil Press General Enquiries ghill@theiegroup.com


Contents

Contents

5 10 14 17 24 28 33 37 40

We talk to Anirudh Todi about his work with open source software at Twitter Heather James talks to John Hogue, Data Scientist at General Mills, ahead of his presentation in Boston We talk to Jack Norris, CMO of MapR about their recent $110 million funding Gil Press gives us the lowdown on the history of The Internet of Things Christoper Hebert from Tech Advice asks ‘how much data do you need to invest in a data warehouse?’ As In-Memory becomes the go-to tech for many, we speak to expert Stephen Dillon from Schneider Electric How is Big Data being used in oceanography? Chris Towers looks at how deep dive data is helping under water Simon Barton investigates how Big Data is changing the way that people are combating cancer Baiju Nt gives us an insight into how life science companies are utilizing Big Data

GOT SOMETHING TO SAY? GOT SOMETHING SAY? GOT SOMETHING TOTO SAY?

We are always looking for new contributors, if you have an interesting idea or passion for a subject, please contact George at ghill@theiegroup.com

3


News #PAChicago

4

Predictive Analytics Innovation Summit "Achieve Actionable Data Insight"

November 12 & 13 Speakers Include:

Chicago, 2014

For more information contact Euan Hunter +1 (415) 992 5510 ehunter@theiegroup.com theinnovationenterprise.com/summits theinnovationenterprise.com/summits/predictive-analytics-summit-chicago-2014


Twitter Interview

Interview With Anirudh Todi, Software Engineer, Twitter George Hill Managing Editor

5


6

Twitter Interview

The importance of open source data on this project cannot be overstated, as Anirudh points out, the platform was created almost exclusively on it. Having worked on this project with the core based around open source software, I wanted to know what Anirudh’s opinion was on its The new Cloud Dataflow influence in the rise software is incorporating the of Big Data and technology utilized by Google, the popularity of arguably the largest data data led company consumer in the world. The initiatives. majority of this has been built ‘In my opinion, on open source software. The Open Source fact that this kind of analytical S o f t w a r e been power is going to be offered has to to their customers is a major crucial breakthrough, but they are the rise of certainly not the first company Big Data. In the last two decades to utilize internal processes and and beyond, its adoption has systems that have been initially been one of the most significant cultural developments in the built on open source platforms. To get a more in depth software industry, and has knowledge of this I spoke to shown that individuals, working Anirudh Todi, Software Engineer together over the Internet, can at Twitter, who was instrumental create products that rival and in the creation of the new and sometimes beat proprietary much discussed TSAR system ones.’ In a recent article by Ashlee Vance in Businessweek, it was claimed that due to Google moving past Hadoop and offering Cloud Dataflow to its customers, including pipelines that incorporate both batch and stream-processing capabilities, it was essentially making Hadoop redundant.

that has been implemented It has clearly been one of the most important aspects in terms recently at the social network. Anirudh described to me of the speed in which Big Data how this new software works; has spread, not only in data’s ‘TSAR is used at Twitter to importance but also the speed count billions of events per in which it has been adopted. day and has been built from Going from almost an unknown the ground up, almost entirely to a vital business function within on open-source technologies 5 years has been made possible (Storm, Summingbird, Kafka, by the collaboration fostered by open source technologies. Aurora, and others).’


Twitter Interview

Anirudh describes the impact that this open source approach has had on the ways that companies have approached these kinds of projects; ‘It has shown how companies can become more innovative, more nimble and more cost-effective by building on the efforts of community work’. Much of his work on open source platforms has taken place at both Twitter and Facebook, where Anirudh has worked in the past. As these are widely acknowledged as being two of the most data focussed companies, I wanted to know what set them apart from others and why they have been so successful in their data projects. ‘Facebook and Twitter are two companies that are squarely focussed on providing the best user experience that they possibly can. There is an intense focus on collecting as much data as possible and then building an entire ecosystem around it. Making it possible to analyze the petabytes of data and transform it into useful

7

information. It is about more than simply monetising data, but about utilizing the data to make themselves better. Specifically with Twitter, I was curious about how scaling occurred, given the steep upward curve in the amount of data produced and collected by the company. Anirudh said that there are two main aspects to the scalability challenge ‘Building infrastructure to be able to seamlessly collect and store so much data’ and ‘Building tools to make it easy to analyze this data and produce useful information that can then be used to drive Twitter products’. Twitter has managed these first through the creation of ‘Manhattan’ a next generation distributed database, and Summingbird, which allows for real time distribution and writing streaming MapReduce programs. These programmes have allowed Anirudh and the team at Twitter to effectively manage, analyze and collect data. The systems they have created already, and the future iterations of these represent Twitter’s bright future in data management.


8

Twitter Interview

Before we finished the interview I was curious to hear from Anirudh about his thoughts on the future of data as a whole:

more efficient access to data that will let us live and work better’.

‘For the last five years or so, there+1 has858 been312 an 1075| intense Email: enquiries@bridg Phone: ‘Data in aggregate is growing focus on Big Data analytics so fast that everything about and I only see it increasing in www.bridgei2i.com how we think about data now the years to come’. is going to change radically in the next ten years. If you’re a software engineer Anirudh will be speaking at or work in technology in any Big Data Innovation, Boston, way, this should sound like in September. an opportunity. Everything from hardware to networking to database technology to presentation layer is already changing rapidly to allow us

$

IMPACT INSIGHT INFORMATION Customer Intelligence Solutions

Customer Experience Management Analytics Personalized Lifecycle Marketing Analytics

Phone: +1 858 312 1075| Email: enquiries@bridgei2i.com

© 2014 BRIDGEi2i Analytics Solutions | www.bridgei2i.com


9

Connect With Decision Makers Through The Innovation Enterprise E-Newsletters

Digital Magazines Email Marketing

Webinars

On-Demand White Papers

Reach a targeted, localised and engaged community of decision makers through our customisable suite of online marketing services.

For further information CONTACT: +1 415 992 7502 US +44 207 193 1346 UK

hsturgess@theiegroup.com

@IE_Hannah


10

John Hogue Interview

Interview With John Hogue, Data Scientist, General Mills Heather James Big Data Leader


John Hogue Interview

Ahead of his presentation at the Big Data Innovation Summit in Boston, we spoke to John Hogue, Data Scientist at General Mills about how he sees the future of data in business and society.

and putting the right message in front of the right person at the right time.

John also points out how the effective use of data is not about understanding all of it, John has a wide array of but about being able to sort it interests within data and and understand the right parts: analytics. He currently works ‘With the deluge of data that as a Data Scientist at one of companies are storing, I see the world’s largest consumer an increased need to sift and goods companies, General Mills understand which pieces are following several years at US significant to drive business Bank. It is clear that despite decisions. I also see an increased both companies working within need for technically savvy different sectors and to different people to critically think, debate rules, they have some key and interpret the quantitative similarities. According to John insights they get from Big Data ‘both roles highlighted the need as the barriers to create them for companies to look for new, become lower and lower’. more effective ways engage customers while keeping costs With this increased need to sift through data comes the low’. importance of effective and safe Big Data has gone beyond storage for this information, the experimentation phase much of which will be personal and into the useable business and financial data that needs tool stage. It is not just about to be properly stored and safely finding new products or monitored. ‘I also see a innovations but as John points need for transparency out ‘The advent of mobile as consumers need devices has given companies to understand what like US Bank and General Mills data will be used ways to individually influence for, who it will be customers at transaction time. shared with and Look at a recipe; get a coupon. how long it will be Make a banking transaction; retained. I think get a recommendation for we sit at a point a banking service. Relevant where there is a recommendations are key growing concern to building brand loyalty’. by consumers that Therefore, the future of Big Data, their information may well fall into brand loyalty will be used against

11


12

John Hogue Interview

them by corporations or their government’. John has a strong opinion on the ways in which companies store their data and the implications that the NSA scandal has had on this important corporate role. John holds a belief that many within the industry share, which is that they trust the companies who store the data more than the government. This is not new within Big Data, but has gained added relevance since the NSA scandal where gagging orders and secret data requests have allowed governments to seemingly access any data they want. ‘I am gravely concerned with the NSA mass surveillance revelations. Information is not secure if companies that hold the keys are secretly being forced to hand them over to government agencies without a warrant. I worry that with court gag orders we will never know what is happening until someone comes forward and says ‘this is wrong’. The Constitution is already in place to protect us from this, instead we need to enforce and hold those accountable for breaching it’. Although the truth is that there are

several issues surrounding data security, one of which is the NSA revelations, there is the belief that companies may also be able to abuse the data held on their customers. Even from seemingly innocuous data, companies can uniquely identify somebody. John points out that companies do walk a thin line when utilizing data and that perhaps government legislation is needed in order to make sure that there isn’t an abuse of personal data within corporations. ‘I am less worried about corporations misusing personal data; however the danger is there. Latanya Sweeney’s working paper found that 87% of the people in the United States were uniquely identified by their birthdate, 5-digit zip and gender alone’ John says, ‘By singling out individuals companies walk a razor thin line between helping customers and discriminating against them. Unfortunately, I think there is little to do to prevent this from happening and instead government legislation should be in place to severely punish companies found to be employing discriminatory practices’. Aside from his work at General Mills, John has also used data in humanitarian and disaster response scenarios. He has worked in mapping earthquake data, something that in time could help to predict


John Hogue Interview

earthquakes. Although this may be thousands of years in the future, it is the baby steps now that will set the foundations to make this a possibility.

analyze. However I do think that Big Data can be used to great effect to model weather systems and their potential impact. Likewise I think that Big Data may be able to help ‘After speaking with some detect and mitigate outbreaks of my geology friends about of infectious diseases by my work on researching listening for signals in the noise earthquakes I’ve come to the of humanity and responding conclusion that the patterns to them appropriately. Google that I’m looking for may span Flu Trends is a proof of concept millennia. Perhaps in a few of this although it is far from hundred thousand years perfect’. we might have enough data Talking to John about the to predict earthquakes, or future of data in society is a perhaps we are just missing reminder that although we some key attributes to think that we know quite a

13

bit about how to use data, in reality, we are in it’s infancy and in future it will become something much bigger. John’s work is currently setting the bar for the future of data, be it in his professional capacity with General Mills or in his work on earthquake data, either way, they are setting the standard for what we need to know about data in the coming years.

#DATAPHARMA

Big Data & Analytics for Pharma Summit "Driving Business Success Through Big Data"

November 5 & 6 Philadelphia, 2014


14

MapR Funding

Interview With Jack Norris, CMO, MapR George Hill Managing Editor


MapR Funding

Big Data Innovation and Innovation Enterprise are always happy when we hear about the good fortunes of our long term partners. So we were glad to hear the news that MapR Technologies, who we have worked with for a long time, have recently had a $110 million round of funding. An added bonus to the money is that one of the main investors is Google Capital, stemming from the most data driven company in the world Google. I spoke to Jack Norris, Chief Marketing Officer at MapR about the funding and what this represents for MapR. Given the data centric elements of MapR, being given such a ringing endorsement by Google is particularly pleasing for the company as it shows an appreciation for their work from the data community. ‘Google has been the inspiration behind a lot of Big Data technologies. Google became Google because of their backend architecture’, said Jack, it is very much a nod to not only the importance of the internet giant, but also to M.C Srivas, the co-founder of MapR. Srivas had been instrumental in the early data architecture of Google, helping them to dominate the search space before founding MapR.

according to Jack one of the most important aspects was a redesign of the storage layer, creating the groundwork for the MapR systems that we know today. Talking about the cycle at Google Jack says ‘If you look at the innovation system, they went from Google file system to map reduce framework, big table and then did Dremel. It is quite an innovation path’, it is similar to the innovation path that Srivas has set for the company. Having a founder who understands the way that Google works has given the company a similar approach, with the core product at the centre of the company ethos, allowing users to do what they need to whilst also allowing for quick and easy scaling. The expansion of Google and the way they innovate within their use of data is also close, something that is likely to have played a major part in t h e i r decision to invest.

Jack also told me about why he believed that Google invested so Srivas then took this had heavily in MapR and understanding of data and the importance of backend the process that the architecture to MapR, investors went through in the due diligence process.

15


16

MapR Funding

One of the key components of this, aside from the financials that every due diligence process goes through, is the access to customers to ask about the product and their thoughts on the company as a whole. After talking to customers, Google found that there had been a significant number who had driven business success through the use of the system and the ways in which it allowed their data to be processed.

operational costs that will come with the increased expansion globally.

However, the question that everybody wants to know is, what are they going to be spending the money on?

work with MapR, as well as our other partners, is that they have clearly put their money in safe hands.

An aspect that I was particularly interested in was the investment in open source software. As a company using Hadoop, it was great to hear that they are not turning their back on open source, but instead investing heavily in it. The next iteration of Hadoop (2.2) is in the works and Apache Drill, which is soon to be released, have been two areas A point that Jack was keen to in particular where Jack and make was the development of the people at MapR have been their ‘MapR DB’ a technology interested. that allowed them to process What this investment really data in real time. This creates an communicates though, is not environment for reduced writing only that MapR is making times and instant saves rather the right noises, but that big than waiting for data to be companies have faith in Big written in the traditional sense. Data. Companies like Google, This is similar to many of the data who have one of the strongest innovations used by Google, and pedigrees in data clearly still one of the key reasons that saw believe in the power of data them become the dominant across the globe and have put search engine within two years their money where their mouth of inception. is. From what we know of our

According to Jack ‘The fund Come see MapR at Big Data really allows us to accelerate Innovation in Boston, on our growth’. It is going to allow September 25 & 26. MapR to make real strides forwards in their international expansion, offering services across more than the 11 countries in which they currently operate. Investments will take place across the company, from speeding up innovations to


History Of IoT

A Very Short History Of The Internet Of Things Gil Press Managing Partner, gPress

17


18

History Of IoT

There have been visions of smart, communicating objects even before the global computer network was launched forty-five years ago. As the Internet has grown to link all signs of intelligence (i.e. software) around the world, a number of other terms associated with the idea and practice of connecting everything to everything have made their appearance, including machine-to-machine (M2M), Radio Frequency Identification (RFID), context-aware computing, wearables, ubiquitous computing, and the Web of Things. Here are a few milestones in the evolution of the mashing of the physical with the digital.

wristwatch by Dick Tracy and members of the police force, makes its first appearance and becomes one of the comic strip’s most recognizable icons.

1932 Jay B. Nash writes in Spectatoritis: ‘Within our grasp is the leisure of the Greek citizen, made possible by our mechanical slaves, which far outnumber his twelve to fifteen per free man… As we step into a room, at the touch of a button a dozen light our way. Another slave sits twenty-four hours a day at our thermostat, regulating the heat of our home. Another sits night and day at our automatic refrigerator. They start our car; run our motors; shine our shoes; and cut our hair. They practically eliminate time and space by their very fleetness.’

1955 Edward O. Thorp conceives of the first wearable computer, a cigarette pack-sized analog device, used for the sole purpose of predicting roulette wheels. Developed further with the help of Claude Shannon, it was tested in Las Vegas in the summer of 1961, but its existence was revealed only in 1966.

1949 The bar code is conceived when 27 year-old Norman Joseph Woodland draws four lines in the sand on a Miami beach. Woodland, who later became an IBM engineer, received (with Bernard Silver) the first patent for a linear bar code in 1952. More than twenty years later, another IBMer, George Laurer, was one of those primarily responsible for refining the idea for use by supermarkets.

October 4, 1960 Morton Heilig receives a patent for the first-ever head-mounted display.

January 13, 1946 The 2-Way 1967 Hubert Upton invents an Wrist Radio, worn as a analog wearable computer with


History Of IoT

eyeglass-mounted display to Early 1980s Members of the Carnegie-Mellon Computer aid lip reading. Science department install October 29, 1969 The first message is sent over the micro-switches in the Coke machine and ARPANET, the predecessor of vending connect them to the PDP-10 the Internet. departmental computer January 23, 1973 Mario so they could see on their Cardullo receives the computer terminals how first patent for a passive, many bottles were present read-write RFID tag. in the machine and whether June 26, 1974 A Universal they were cold or not. Product Code (UPC) label is 1981 While still in high school, used to ring up purchases at a Steve Mann develops a supermarket for the first time. backpack-mounted ‘wearable 1977 CC Collins develops an aid for the blind, a five-pound wearable with a head-mounted camera that converted images into a tactile grid on a vest.

19

infrared, will be so ubiquitous that no one will notice their presence.’ 1993 MIT’s Thad Starner starts using a specially-rigged computer and heads-up display as a wearable.

1993 Columbia University’s Steven Feiner, Blair MacIntyre, and Dorée Seligmann develop KARMA--Knowledge-based Augmented Reality for Maintenance Assistance. KARMA overlaid wireframe schematics and maintenance personal computer-imaging instructions on top of whatever system and lighting kit.’ was being repaired. 1990 Olivetti develops 1994 Xerox EuroPARC’s an active badge system, Mik Lamming and Mike using infrared signals to Flynn demonstrate the communicate a person’s Forget-Me-Not, a wearable location. device that communicates September 1991 Xerox PARC’s Mark Weiser publishes ‘The Computer in the 21st Century’ in Scientific American, using the terms ‘ubiquitous computing’ and ‘embodied virtuality’ to describe his vision of how ‘specialized elements of hardware a n d software, connected by wires, r a d i o wa ve s a n d

via wireless transmitters and records interactions with people and devices, storing the information in a database. 1994 Steve Mann develops a wearable wireless webcam, considered the first example of lifelogging.

September 1994 The term ‘context-aware’ is first used by B.N. Schilit and M.M. Theimer in ‘Disseminating active map information to mobile hosts,’ Network, Vol. 8, Issue 5. 1995 Siemens sets up a dedicated department inside its mobile phones business unit to develop and launch


20

History Of IoT

a GSM data module called ‘M1’ for machine-to-machine (M2M) industrial applications, enabling machines to communicate over wireless networks. The first M1 module was used for point of sale (POS) terminals, in vehicle telematics, remote monitoring and tracking and tracing applications. December 1995 MIT’s Nicholas Negroponte and Neil Gershenfeld write in ‘Wearable Computing’ in Wired: ‘For hardware and software to comfortably follow you around, they must merge into s of t we a r… T h e difference in time between loony ideas and shipped products is shrinking so fast that it’s now, oh, about a week.’

objects to the Internet through the RFID tag. 1999 Neil Gershenfeld writes in When Things Start to Think: ‘Beyond seeking to make computers ubiquitous, we should try to make them unobtrusive…. For all the coverage of the growth of the Internet and the World Wide Web, a far bigger change is coming as the number of things using the Net dwarf the number of people. The real promise of connecting computers is to free people, by

embedding the means to solve problems in the things around us.’

January 1, 2001 David Brock, co-director of MIT’s October 13-14, 1997 Carnegie- Auto-ID Center, writes in a white Mellon, MIT, and Georgia paper titled ‘The Electronic Tech co-host the first IEEE Product Code (EPC): A Naming International Symposium Scheme for Physical Objects’: on Wearable Computers, in ‘For over twenty-five years, the Universal Product Code Cambridge, MA. 1999 The Auto-ID (for (UPC or ‘bar code’) has helped Automatic Identification) Center streamline retail checkout and is established at MIT. Sanjay inventory processes… To take Sarma, David Brock and Kevin advantage of [the Internet’s] we propose Ashton turned RFID into a infrastructure, networking technology by linking a new object identification


History Of IoT

JXTA-C: Enabling a Web of Things’ in HICSS ‘03 Proceedings of the 36th Annual Hawaii International Conference on System Sciences. They write: ‘The March 18, 2002 Chana open-source Project JXTA was Schoenberger and Bruce Upbin initiated a year ago to specify publish ‘The Internet of Things’ in a standard set of protocols for Forbes. They quote Kevin Ashton ad hoc, pervasive, peer-to-peer of MIT’s Auto-ID Center: ‘We computing as a foundation of need an internet for things, a the upcoming Web of Things.’ standardized way for computers October 2003 Sean Dodson writes in the Guardian: ‘Last to understand the real world.’ April 2002 Jim Waldo writes month, a controversial network in ‘Virtual Organizations, to connect many of the millions Pervasive Computing, and an of tags that are already in the Infrastructure for Networking world (and the billions more on at the Edge,’ in the Journal of their way) was launched at the Information Systems Frontiers: McCormick Place conference ‘…the Internet is becoming the centre on the banks of Lake Roughly 1,000 communication fabric for devices Michigan. to talk to services, which in turn delegates from across the talk to other services. Humans worlds of retail, technology and are quickly becoming a minority academia gathered for the on the Internet, and the majority launch of the electronic stakeholders are computational product code (EPC) entities that are interacting with network. Their aim other computational entities was to replace the global without human intervention.’ barcode June 2002 Glover Ferguson, with a chief scientist for Accenture, universal writes in ‘Have Your Objects s y s t e m Call My Objects’ in the Harvard t h a t Business Review: ‘It’s no c a n exaggeration to say that a tiny tag may one day transform your own business. And that day may not be very far off.’ scheme, the Electronic Product Code (EPC), which uniquely identifies objects and facilitates tracking throughout the product life cycle.’

January 2003 Bernard Traversat et al. publish ‘Project

21


22

History Of IoT

provide a unique number for every object in the world. Some have already started calling this network ‘the internet of things’.’ August 2004 Science-fiction writer Bruce Sterling introduces the concept of ‘Spime’ at SIGGRAPH, describing it as ‘a neologism for an imaginary object that is still speculative. A Spime also has a kind of person who makes it and uses it, and that kind of person is somebody called a ‘Wrangler.’ … The most important thing to know about Spimes is that they are precisely located in space and time. They have histories. They are recorded, tracked, inventoried, and always associated with a story… In the future, an object’s life begins on a graphics screen. It is born digital. Its design specs accompany it throughout its life. It is inseparable from that original digital blueprint, which rules the material world. This object is going to tell you – if you ask – everything that an expert would tell you about it. Because it WANTS you to become an expert.’ September 2004 G. Lawton writes in ‘Machine-to-machine technology gears up for growth’ in Computer: ‘There are many more machines—defined as things with mechanical, electrical, or electronic properties­ —in the world than people. And a growing number of machines are networked… M2M is based

on the idea that a machine has more value when it is networked and that the network becomes more valuable as more machines are connected.’ October 2004 Neil Gershenfeld, Raffi Krikorian and Danny Cohen write in ‘The Internet of Things’ in Scientific American: ‘Giving everyday objects the ability to connect to a data network would have a range of benefits: making it easier for homeowners to configure their lights and switches, reducing the cost and complexity of building construction, assisting with home health care. Many alternative standards currently compete to do just that—a situation reminiscent of the early days of the Internet, when computers and networks came in multiple incompatible types.’ October 25, 2004 Robert Weisman writes in the Boston Globe: ‘The ultimate vision, hatched in university laboratories at MIT and Berkeley in the 1990s, is an ‘Internet of things’ linking tens of thousands of sensor mesh networks. They’ll monitor the cargo in shipping containers, the air ducts in hotels, the fish in refrigerated trucks, and the lighting and heating in homes and industrial plants. But the nascent sensor industry faces a number of obstacles, including the need for a networking standard that can encompass its diverse


History Of IoT

applications, competition from other wireless standards, security jitters over the transmitting of corporate data, and some of the same privacy concerns that have dogged other emerging technologies.’ 2005 A team of faculty members at the Interaction Design Institute Ivrea (IDII) in Ivrea, Italy, develops Arduino, a cheap and easy-to-use single-board microcontroller, for their students to use in developing interactive projects. Adrian McEwen and Hakim Cassamally in Designing the Internet of Things: ‘Combined with an extension of the wiring software environment, it made a huge impact on the world of physical computing.’ November 2005 The International Telecommunications Union publishes the 7th in its series of reports on the Internet, titled ‘The Internet of Things.’ June 22, 2009 Kevin Ashton writes in ‘That ‘Internet of Things’ Thing’ in RFID Journal: ‘I could be wrong, but I’m fairly sure the phrase ‘Internet of Things’ started life as the title of a presentation I made at Procter & Gamble (P&G) in 1999. Linking the new idea of RFID in P&G’s supply chain to the then-red-hot topic of the Internet was more than just a good way to get executive attention. It summed up an important insight—one

that 10 years later, after the Internet of Things has become the title of everything from an article in Scientific American to the name of a European Union conference, is still often misunderstood.’

23


24

Data Warehousing

How Much Data Do You Need to Invest In A Data Warehouse? Christopher Hebert technologyadvice.com


Data Warehousing

As companies collect more data and leave it stored in their source locations, whether it’s a CRM, ERP, or POS, they may reach a point where they want to consolidate all that data into a single, consistently structured location.

actions taken by users, which allows you to keep a closer eye on users who may, when given access to a less monitored system, perform ill-intended queries.

The question, then, is at what point is a full-blown data warehousing solution necessary?

If the security demands of monitoring user access across various systems is spreading your information security team too thin, a data warehouse can help.

The primary function of a data warehouse is to bring data from multiple sources into one system, either to be analyzed in that centralized location or to be transferred more efficiently between systems. The amount of data that warrants a data warehouse is less dependent on the sheer volume of data in giga, tera, or petabytes than on the number of sources that need to be integrated and managed.

Data warehouses can also serve as a centralized backup tool, for your company’s disparate systems. By design, all modifications, additions, and removals in a data warehouse can be recorded and backed-up per your specifications. This is beneficial when the backup systems provided by your various data sources are either insufficient or too difficult to keep up individually.

Data Warehouse as a Central Repository

In short, if the backup process for your data sources is fractured difficult to maintain, You may want to start and collecting all your data in one consolidating that data in central location for several a data warehouse allows reasons, including security for complete backup in concerns, backup, and business one location. If new business intelligence. (BI) With security concerns, bringing intelligence all of the reporting functionality efforts are on of your data into one centralized your horizon, it location such as a data may be easier warehouse allows for greater for the sake control over who has access of consistent to to what information. Data analysis warehouses typically log the bring all your

25


26

Data Warehousing

data into one format under one roof. This may also open up options when selecting business intelligence software, as it’s typically easier to integrate your BI software with one source than with your multiple, disparate sources. Many popular data warehousing tools also have prebuilt connectors that allow for easy integration with media services and CRM platforms.

stored in tables. The volume of data is small enough to be analyzed in a single database (or even spreadsheet perhaps), and the transfer of data into that system can be done manually without the large investment of a data warehouse. Data Warehouse as Transactional Platform

a

Some data warehouses exist not only for the purpose of If a BI tool you’re looking to use having all data in one location, does not integrate with your but also to ease the transfer of various data sources, it may be data between various business cheaper and more efficient in systems. the long run to bring all data into Extract, Transform, one reservoir, so at most only The Load (ETL) process of a data one integration is necessary, and you can change your sources warehouse means that once the data from one system is without changing your BI tool. extracted, transformed into You don’t need a data warehouse a normalized format and as a central repository if the structure, then loaded into the number of systems between data warehouse, it is consistent which data needs to be with all other data of its kind. aggregated is a mere handful, This consistency allows data and the quantity of data in from different sources to be them is only tens of gigabytes. exported together out into In this case it may be more another system. appropriate to have a simple database For example, patient records with the data from an EHR may enter a data from each warehouse via ETL, and then s y s t e m from the data warehouse to be exported to an accounting software. This would be beneficial if there are multiple EHR systems in use. If one of those EHR systems were switched out for a new one (which research suggests happens often among medical organizations), then only


Data Warehousing

the schema for transforming data from that EHR to the data warehouse would have to be redone. The schema for transforming data for export to the accounting software would not be affected. A data warehouse as a transactional platform is unnecessary if there are few enough systems that the transfer of data between systems can be reasonably done either manually or with automated processes developed by analysts. Such solutions are less

27

capital intensive than the management, and other emerging implementation of a data technology. Connect with him on Google+. warehouse. Beyond situational descriptions of typical business needs, quantifying a minimum data requirement is difficult. Data warehouses unique to each scenario, and it’s best to thoroughly research your needs and requirements before committing to an installation. Author Bio: Christopher Hebert is a content writer at TechnologyAdvice. He covers business intelligence, project

Social Data Innovation Summit November 6 & 7, Miami, 2014

#SocialDataMiami

"Drive Success Through Innovative Digital Analytics"


28

Talking In-Memory

Talking In-Memory Computing With Stephen Dillon, Engineering Fellow at Schneider Electric Simon Barton Assistant Editor


Talking In-Memory

When we talk about the use of in-memory databases (IMDB), we often discuss them in relatively vague terms. People often know what they are, the advantages of using them and the names of some of the large platforms that use them. However, an in depth knowledge is rare and finding somebody with considerable knowledge on the stubject is rare. They are comparatively new, which has seen many look at them with a degree of trepidation especially when previous iterations of the technology we see today have been relatively unreliable. So we wanted to talk to somebody who knows an almost unparalleled amount about it. Stephen Dillon is an Engineering Fellow and Data Architect at Schneider Electric, a company that is world renowned for its innovative use of data in the energy sector. This has given Stephen a unique insight, especially because energy companies were some of the first and widest adopters of IMDB. The reason for more widespread use of this kind of technology

29

in the energy sector is the need for real-time data, something that IMDBs excel in compared to more traditional databases. They solve the issues that many companies have found in the new big data era, where it is not only the volume, but velocity of the data that has caused an issue with management.

‘In-memory databases (IMDBS), especially those in the NewSQL domain, are solving the issues that many companies have. Many of the next generation of solutions are being built atop in-memory databases. Just look at anything with SmartCity or the Internet of Things in the title and often So, I felt compelled to pick you will find an in-memory Stephen’s brain on the database solution within the subject, as well as the future technology stack where low and rise of a data led business latency is required. There is landscape, when I spoke to still a long way to go however’. him before his presentation Stephen is right when at the Big Data Innovation discussing the potential that Summit in Boston. IMDBs have and we have seen many companies taking advantages of this new database option, but it has not taken off to the extent that some had thought. As mentioned before, this could be down to pre-existing ideas about the system, a misunderstanding or simply not knowing about it. However, Stephen believes that there are two major obstacles to the more widespread use of IMDBS. ‘First is that they are often perceived as highly disruptive to users of traditional relational databases. In my experience, many folks are inexperienced with modern IMDBS and the


30

Talking In-Memory

associated architectures. They may have been familiar with older offers from five years to a decade ago and are not familiar with the modern capabilities including the Cloud which many of these IMDBS are designed for and poses a learning curve for many users.’ ‘The second obstacle is that people do not truly understand their own use cases. One size database technology does not fit all needs and I often see people seeking the wrong solution to their problems. Companies need to not only better understand the tools in their toolbox and know when to use a hammer versus a power drill but they need to first understand what they are building in the first place.’ Stephen tells us that IMDBs are solving business problems today and it is going to be solving even more problems in the future. The main barrier to that is not the technology itself, but in the education of the community to the benefits of it and the ways that it can be used. Stephen has developed expert status with IMDS due to his career within the energy sector, where he has gained extensive and practical knowledge of IMDS and real time data streaming, however Stephen has other several other examples of where IMDBs are being implemented.

sensor readings on an oil pipeline which allows maintenance personnel to monitor vast expanses of piping in remote locations and to determine where an issue occurred for optimized responses.’ This allows companies to pinpoint potential problem areas where maintenance is needed, saving time on the costs associated with sending a large maintenance team to find technical issues. It also means that if a problem occurs in a particular area, pipes can be turned off to avoid ecological accidents. Stephen also believes that ‘If an IMDB was not a solution for one domain they tend to cast it aside despite having many potential use cases elsewhere. A big advantage of IMDBS is that they allow you to solve problems you could not previously do with traditional technologies. T h e challenge is to expand a business’ potential and not confine yourself to ‘how it has been done before’’.

‘There are systems that analyze Much like the spread of data initiatives in general, it seems


Talking In-Memory

that there needs to be an impetus to push IMDBs forward. I wanted to know Stephen’s thoughts on the spread of data driven thinking in society and why he believed this to be the case. Much like others he agrees that a combination of the Cloud, Big Data and analytics have all played important parts, but ‘since I am biased towards data technologies and distributed architectures I will say it starts with the need to store and process all of this data we are dealing with. Analytics is certainly not new but it has gained in popularity because of the availability of all of this data we’ve begun to accumulate; perhaps mistakenly so’.

of data collection. This further goes into the need for IMDBs as ‘legacy database systems simply have not been able to support our needs or they’ve been unable to do so at the same metrics of time, effort, and costs as newer offers in the NoSQL and NewSQL domains. The fact we can do more with these new architectures and technologies has also been a catalyst to many such initiatives’.

Finally, I wanted to discuss the particular business functions that Stephen found to be most amenable to using data. As we have seen across a multitude of industries, some such as banking and retail, have taken to it more than The increased amount of others. Stephen is in a unique data can also be attributed position at Schneider Electric, to sensors and the Internet of with his role meaning that he Things allowing for increased works across almost every business unit at the company, velocity so he has a thorough insight. ‘In

my experience, the marketing and traditional C-level business functions really lend themselves to data primarily because of analytics. The use of analytics allows those businesses to tell a story and make an impact.’

31

He has found that the more technical minded (especially those who are more experienced) are the slowest adopters of new data technologies and initiatives. He has had several experiences of this but the one that sticks in the mind is when ‘Recently I gave a presentation at a Big Data event on the topic of in-memory databases and an executive, from a major IT company, asked me why one should even bother storing data at all. As comical as many of the other attendees found that question, it bemoans the level of misunderstanding about data that is still prevalent circa 2014’. Stephen puts it best when describing the importance of adoption of data technologies - ‘In myopinion, ifone is resistant to at least understanding the new types of technologies and architectures then they are quickly antiquating themselves’.


32

Big Data In Science

BIG DATA IN SCIENCE


Big Data In Oceanography

Big Data In Oceanography Chris Towers Big Data Leader

33


34

Big Data In Oceanography

The vast ocean that encompasses 71% of the Earth’s surface is one of the most unforgiving and fascinating places known to man. But with areas still yet to be explored in territories such as Papa New Guinea, The Amazon rainforest and Antarctica, some feel we are better off exploring on land than diving head first into the sea. Despite this, the sea has been at the forefront of exploration attempts for hundreds of years. In 2013 when Marine ecologist, Jon Copley embarked on the first manned mission five thousand meters underwater, he described it as: ‘My journey to another world’. We’re finding more out about

aquatic ecosystems all the time and it’s through this better understanding that we’re developing a clearer idea of how the ocean operates. As with Jon Copley’s mission, huge strides are being made in this area, and continual advances in technology have made this once unfeasible task, highly achievable. In April 2014, a team of researches, funded by the National Science Foundation, embarked on a 40-day expedition exploring t h e Ke r m a d e c t re n c h , located just off New Zealand. Due to the extreme pressure and temperatures in these deep sea


Big Data In Oceanography

trenches, they remain among the world’s least explored areas, with little known about the potential life forms circulating down there. In the words of Tim Shank, a biologist at the Woods Hole Oceanographic Institution, who was part of the team of researchers: ‘We didn't have the technology to do these kinds of detailed studies before. This will be a first-order look at community structure, adaptation and evolution of how life exists in the trenches’. This continual advancement in technology has manifested itself in the Nereus, a remotely operated vehicle, which records the darkened ocean floor. However, machine mechanics are not the only scientific areas that are beginning to map the oceans; data is also becoming imperative. Data is important but it isn’t

without its limits. The tragic events that unfolded after the take-off of Flight MH370 prompted a number of rescue efforts, but many of us were left wondering why in our data-rich society, a plane emitting large amounts of information couldn’t be found. The disappearance of Flight MH370 throws up far more questions about the efficacy of Big Data than it answers. ‘Connecting the dots’ has been the overriding issue, with a palpable inability to organize and structure the information at the fingertips of governments. What it boils down to is the fact that the sheer volume of data across the globe overwhelms the ability of any one authority to make sense of it. This explosion of information has been coined ‘the Big Data problem’ and it seems that the search for Flight MH370 encapsulates these fears, as there are clearly serious limitations in regard to data

35


36

Big Data In Oceanography

coordination. These doubts ocean, it has yet to bear any are unlikely to be quelled until real fruits. the aircraft has been found. Outside the search for Other data-centric efforts Flight MH370, Big Data is have been seen in the form contributing to our knowledge of crowdsourcing – with of the seas. There have been a DigitalGlobe at the centre lot of improvements in regard of the developments. The to our knowledge of water company turns its satellites to temperature, wind speed a particular part of the world, and even other data points gathers images and creates like chlorophyll levels. These a mosaic of information findings have come to the that is then investigated by fore through oceanographic volunteers. DigitalGlobe’s instruments that are Tomnod platform processes aggregating and making data in around 90 minutes sense of information. This so that it can be of assistance information is being made to government officials comprehensible through the as quickly as possible. Marinexplore project, the It’s certainly a useful tool brainchild of Rainer Sternfeld, and as we have discussed a board member of the World before, crowdsourcing has Ocean Council. This data isn’t been a really exciting digital just about having information development, but in relation for information’s sake - having to locating an item in the access to real-time weather

and climate data could be a vital tool for ships embarking on long and dangerous voyages and is a tangible benefit for Big Data and oceanography. Through technology, the ocean is becoming a far more approachable landscape for exploration, and despite Big Data’s failure to find flight MH370, its ability to help measure climate and water temperature are real breakthroughs. With technology and data working in unison the possibilities for discovery are incredibly exciting, and with new expeditions being embarked on regularly, who knows what findings will be uncovered.


Data To Cure Cancer

Big Data To Cure Cancer Simon Barton Assistant Editor

37


38

Data To Cure Cancer

Cancer is one of the most feared and deadly diseases in the world. For many of us it has either affected us personally or a member of our family. Within cancer treatment there is a lot of emphasis placed on catching the root of the problem quickly so that it doesn’t spread. Unfortunately for doctors, this early detection often requires the patient to be alert and ultimately aware of the ominous signs that cancer can bring. For patients, this is the real-life equivalent of walking a tight rope - but do analytics have the capacity to reshape the way we think about cancer prevention? According to Jennifer Quigley, Director, Registry & Bio, they do and she is keen to point out that cancer care and analytics can work in perfect unison together.

especially when including the needs of sufferers. For Jennifer, continually improving upon and developing cancer care should have data at its heart. She says; ‘it’s not a product or a program or even a study, it’s simply an idea - but it will require co-operation and collaboration across organizations’ Cancer can be described as an uncontrolled growth of cells, but that is a very sweeping definition - in reality every cancer case is unique to the individual. In order to counteract this uncertainty, Jennifer has put together a proposal which she feels will benefit cancer care greatly and she was kind enough to share it with us at the summit.

The principal of the project is to collect patient data from pre-treatment and At the Big Data and Analytics pre-diagnosis, to diagnosis and for Pharma Summit in Boston, the end stage. By aggregating Jennifer gave us a heartfelt this data and coupling it with account of cancer care from clinical and diagnostic data both the perspective of a data predicting cancer in a patient enthusiast and someone who can become more feasible. has been deeply affected by the The reason why predictive disease. She expresses a desire analytics are so important for ‘to put names to faces as people the improvement of cancer often get lost in science and treatment is because no two data’. cancers are alike - they may Jennifer’s objective is simple well be categorised as the same, and it can be articulated in but like humans every cancer is six words - to improve cancer unique. It’s just like us, no one care and medicine. It’s a human is the same, but we may simple sentence, but have similarities that allow us to a much more difficult be categorised and understood. concept to implement, Jennifer states ‘no longer is


Data To Cure Cancer

stage 3 colon cancer the same Data collaboration must include in every individual’ - there are structure to normalise data and a way to match records always unique differences. Jennifer points her attention and protect the patient’s PHI to Target, the U.S. retailer. She - this must also include health refers to an article written by organisations labs, hospitals Charles Duhig a Journalist at and clinics collaborating to make the New York Times who wrote sure data is correct’ about how, by using extremely extensive data gathering techniques, the supermarket found out a girl was pregnant before her own father. This story is often mentioned and heralded by Big Data pessimists as a step too far in data privacy, but whether you agree with Targets operations or not, if it were transferred to cancer prevention it could be a game changer. Imagine if data could predict whether when someone was most likely to be hit by cancer and the implications that would have for their health and continued cancer care. The longitudinal data process that Target implements could be used for cancer care. Clearly it’s easier said than done, but the collection of patients characteristics, demographics, medical history, locational data, molecular and genetic data is already happening and could help in creating predictive algorithms. As Jennifer states; ‘These predictions could be the catalyst for a discovery or a road map for successful studies and trials - but first the ground rules need to be laid.

With the amount of data available to us rising exponentially, now is the time to invest in healthcare analytics. As Jennifer states; ‘there has been an explosion of testing that is creating a lot of data - if we are to realise its potential this has to be started today’. Perhaps Target have influenced one the major breakthroughs in cancer prevention, but whatever the outcome, Jennifer’s data-centric ideas are both refreshing and exciting for the people who are affected by cancer everyday.

39


40

Data In Life Sciences

Big Data In Life Science Baiju NT Editor & Product Manager, BigData-MadeSimple.com


Data In Life Sciences

a critical role in the future of life science companies, especially in research and development (R&D), finance, marketing and risk management, experts and researchers seek to find answer to fundamental Life science companies have a been early adopters of Big Data question: How do because of the rapid generation we harvest and of large and complex biomedical leverage data to data created by devices across gain a competitive the globe every day. Even advantage? most life smaller machines are capable While of producing piles of data science organizations sets, and many scientists are seek new ways to gain concerned that over the next 10 competitive advantages in years, the magnitude, diversity the marketplace, many experts and the dispersed nature of believe no single approach this data deluge will escalate, is optimal for all analyzes. making it increasingly difficult They believe a combination of to find relevant data and to heterogeneous computing and derive meaningful patterns and cloud computing is emerging insights. Now, think of its impact as a powerful new option, and the challenges accompanied as it meets three by the detonation of biomedical critical needs for life data accumulated in computers sciences computing: and servers around the crunching more data flexibility world. As a matter of fact, the faster, management of data presents to shift from one to increasingly difficult issues, architecture starting from data privacy to another in a secure infrastructure that is needed public or private to securely generate, maintain, cloud environment increasing transfer and analyze data in and enormous volumes in largely access to accelerated performance. disconnected environments. We live in an era of digital revolution. The result? Big Data. Typically, the word 'Big Data' refers to data sets so large and complex that they are difficult to process using traditional applications, it has the potential to impact almost everything around us. Travel, shopping, banking, education, sports and Life Science is no exception.

As data will continue to play This

new

approach

41


42

Data In Life Sciences

provides so many other benefits for cloud computing, such as lower costs (pay as you go), reduced IT support, ease to adapt technology upgrades, rapid scalability and dedicated 24/7 public or private cloud service to a single organisation, in the places of greater cost, integration, and management challenges. This architecture can provide even small research groups with affordable access to diverse computing resources. Besides, it also allows the research and clinical collaborators worldwide to work together and establish statistically significant patient cohorts and use expertise across different institutions. Data management and resource utilization across departments in shared research HPC cluster environments, analytics clusters, storage archives, and external collaborators become easy and affordable. It can even address critical shortages in bioinformatics expertise and burst capacity for high-performance computing (HPC) clusters.

healthcare i.e most biologists, physicians and other users are usually not IT experts. Although some life scientists have significant computational skills, others do not understand computer semantics enough to know that in the tech world, Python is not a snake and Perl is not a gem (they are programming languages). For them, it's challenging enough to choose the best tech solutions. Even today, a vast majority of bioinformatics and healthcare applications runs on standard clusters, but luckily, the trend is changing, as many research organisations have already hit technical and financial roadblocks which prevent the obtaining of sufficient HPC resources for analysing all the output of high-data-rate experimental instruments.

Now, there is a confounding aspect in life sciences and

1. A secure environment for

There will be no shortage of challenges when companies begin to adopt Big Data strategies into their operations. However, the benefits of Big Data that come with an effective, robust and holistic Big Data management strategy are too great for any company to ignore. Many life sciences organizations who were early adopters of Big Data as a core asset of their operations have already staked out a distinct advantage in the marketplace, reaping tangible benefits that include:


Data In Life Sciences

approved researchers to analyze anonymized patient and clinical data, from a variety of sources, in a controlled and auditable manner. 2. Several million dollars of savings in R&D and new hire training by reducing knowledge workers. 3. Improved and unified sales and marketing process in multiple geographies through integration of clinical data from multiple healthcare data providers. Big Data isn’t just hype; it’s here. Companies know they can't escape Big Data, though perhaps can hide for now. They have already taken steps to understand their use cases — the questions that were buried in Big Data for decades.

43


44

Email cgomez@theiegroup.com for more information


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.