Trends in big data analytics

Page 1

ISSN 2395-695X (Print) ISSN 2395-695X (Online) Available online at www.ijarbest.com International Journal of Advanced Research in Biology Ecology Science and Technology (IJARBEST) Vol. I, Special Issue I, August 2015 in association with VEL TECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALA ENGINEERING COLLEGE, CHENNAI th National Conference on Recent Technologies for Sustainable Development 2015 [RECHZIG’15] - 28 August 2015

Trends in Big Data Analytics 1

LEELAMBIKA.K.V, 2VISHAL.G, 3HASHIR AHAMED.R

1

leelambika@velhightech.com, 2vishalnishanth3@gmail.com, 3hashir.ahamed93@gmail.com 1 1,2,3 1,2,3

Assistant Professor, 2,3 UG Scholars

Department of Computer Science and Engineering

Vel Tech High Tech Dr.Rangarajan Dr.Sakunthala Engineering College

focus on emerging trends to highlight the hardware,

ABSTRACT One of the major applications of future generation

software, and application

landscapeof big-data

parallel and distributed systems is in big-data

analytics. in the cloud often span multiple data

analytics.Data repositories for such applications

centers The compute back-end of cloud made and

currently exceed exabytes and are rapidly increasing

environments typically relies on efficient and

in size.Beyond their sheer magnitude, these datasets

resilient data centerarchitectures built on virtualized

and associated applications’ considerations pose

compute and storage technologies,

significantchallenges

abstracting

for

method

and

software

commodity

hardware

efficiently components.

development. Datasets are often distributed and their

Current data centerstypically scale to tens of

size and privacyconsiderations warrant distributed

thousandsWith the development of critical Internet

techniques. Data often resides on platforms with

technologies, the visionof computing as a utility took

widely

network

shape in the mid 1990sThese early efforts on Grid

fault-tolerance,

computing typically viewed hardwareas the primary

security, and access control arecritical in many

resource. Grid computing technologies focusedon

applications

sharing, selection, andaggregation of a wide variety

varyingcomputational

capabilities.

Apache

Considerations

and of

such asDean and Ghemawat, 2004;

hadoop.

Analysis

tasks

often

have

ofgeographically

distributed

resources.

These

harddeadlines, and data quality is a major concern in

resources include variety of function of the system of

yet other use applications. data-driven models and

supercomputers,storage, and other devices for solving

methods, capable of operating at scale, are as-yet

large-scalecompute-intensive problems in science,

unknown. Even when knownmethods can be scaled,

engineering,

validation of results is a major issue. Characteristics

domain-crossing

of hardware platforms and the

stack

resourcemanagementcapability.The concept of ‘data

fundamentally impact data analytics. In this article,

as a resource’ was popularized by peerto-peer

we provide an overview of the state of the art and

systems

software

and

various

commerce.transparent

administration

and

Networks such as Napster, Gnutella,

258


ISSN 2395-695X (Print) ISSN 2395-695X (Online) Available online at www.ijarbest.com International Journal of Advanced Research in Biology Ecology Science and Technology (IJARBEST) Vol. I, Special Issue I, August 2015 in association with VEL TECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALA ENGINEERING COLLEGE, CHENNAI th National Conference on Recent Technologies for Sustainable Development 2015 [RECHZIG’15] - 28 August 2015

andBitTorrent allowed peer nodes to share content –

sense of the causal relationships that are present in

typically multimediadata – directly among one

the observed stimulus–response loops. Thus, it is

another in a decentralized manner.These frameworks

appropriate to embrace methodology-focused system

emphasized interoperability and dynamic,ad-hoc

collaboration to develop an interdisciplinary, multi-

communication

cost

perspective approach that will yield the most

aggregation.

interesting results for the new decision support and e-

However, in many of these platforms,considerations

commerce issues, and other social science problems

of anonymity or privacy issues and scalabilitywere

of interest. Recent hardware advances have played a

secondary. Perhaps, the most visible application of

major role in realizing the distributed software

big-data analytics hasbeen in business enterprises. It

platforms

is estimated that a retailer fully utilizing the power of

hardware innovations — in processor technology,

analytics can increase its operating margin by 60%.

newer

Utilizing new opportunities (for e.g., location-aware

network architecture software-defined networks will

and location-based services) leads to significant

continue to drive software innovations. Strong

potential

for new revenues.A comprehensive

emphasis in design of these systems will be on

analytics framework would require integrationof

minimizing the time spent in moving the data from

supply chain management. and many models of

storage to the processor or between storage/compute

customer management,after-sales support advertising,

nodes in a distributed setting.

and

reduction,resource

collaboration

sharing,

and

for

needed

kinds

of

for

big-data

analytics.Future

memory/storage

orhierarchies,

etc. Business enterprises collectvast amounts of multi-modal

data,

transactions,inventory

including

customer

management,

store-based

1. Introduction With

the

development

of

critical

Internet

video feeds, advertising and customer relations,

technologies, the vision of computing as a utility took

customer

shape in the mid 1990s [19].

preferences

and

sentiments,sales

management infrastructure, and financial data, among

These early efforts on Grid computing [35] typically

others.The aggregate of all data for large retailers is

viewed hardware as the primary resource. Grid

easily estimated to be in the exabytes, currently. With

computing

comprehensive deployment of

to track

selection, and aggregation of a wide variety of

inventory, links to suppliers databases, integration

geographically distributed resources. These resources

with customer preferences and profiles . people in

included supercomputers, storage, and other devices

the social sciences andhave been trained to work

for solving large-scale compute-intensive problems in

with. Also user behavior in social networks may be

science, engineering, and commerce.

RFIDs

technologies

focused

on

sharing,

correlated, which requires new methods to make

259


ISSN 2395-695X (Print) ISSN 2395-695X (Online) Available online at www.ijarbest.com International Journal of Advanced Research in Biology Ecology Science and Technology (IJARBEST) Vol. I, Special Issue I, August 2015 in association with VEL TECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALA ENGINEERING COLLEGE, CHENNAI th National Conference on Recent Technologies for Sustainable Development 2015 [RECHZIG’15] - 28 August 2015

A key feature of these frameworks was their support

tens of thousands of nodes and computations in the

fortransparent domain-crossing administration and

cloud often span multiple data centers.

resource management capability.

The

Networks such as Napster, Gnutella, anBitTorrent

environments with distributed data-centers hosting

allowed peer nodes to share content – typically

large data repositories, while also providing the

multimedia data – directly among one another in a

processing resources for analytics strongly motivates

decentralized manner.

need for effective parallel/distributed algorithms. The

These frameworks emphasized interoperability and

underlying socio-economic benefits of big-data

dynamic,ad-hoc communication and collaboration for

analytics

cost reduction, resource sharing, and aggregation.

characteristics pose significant challenges. In the rest

However, in many of these platforms, considerations

of this article, we highlight the scale and scope of

of anonymity or privacy issues and scalability

data analytics problems.

were secondary. More recently, Cloud computing

We describe commonly used hardware platforms for

environments [95] target reliable, robust services,

executing analytics applications, and associated

ubiquitously accessible (often throughbrowsers) from

considerations of storage, processing, networking,

clients ranging from mass-produced mobile devices

and energy. We then focus on the software substrates

to general purpose computers. Cloud computing

for applications, namely virtualization technologies,

generalizes

runtime

prior

notions

of

service

from

emerging

and

landscape

the

of

diversity

systems/execution

of

cloud-based

application

environments,

and

infrastructure-as-a-service (computing

programming models. We conclude with a brief

resources available in the cloud), and data-as-a-

discussion of the diverse applications of data

service (dataavailable in the cloud) to software-as-a-

analytics, ranging from health and human welfare

service (access to programs that execute in the

to computational modeling and simulation.

cloud). This offers considerable benefits from points-

1.1. Scale and scope of data analytics

of-view of service providers (cost reductions in

Recent conservative studies estimate that enterprise

hardware and administration), overall resource

server systems in the world have processed

utilization, and better client interfaces. The compute

9.57×1021 bytes of data in 2008

back-end of cloud environments typically relies on

[83]. This number is expected to have doubled every

efficient and resilient data center architectures, built

two years from that point. As an example, Walmart

on virtualized compute and storage technologies,

servers handle more than one million customer

efficiently

hardware

transactions every hour, and this information is

components. Current data centers typically scale to

inserted into databases that store more than 2.5

abstracting

commodity

petabytes of data—the equivalent of 167 times the

260


ISSN 2395-695X (Print) ISSN 2395-695X (Online) Available online at www.ijarbest.com International Journal of Advanced Research in Biology Ecology Science and Technology (IJARBEST) Vol. I, Special Issue I, August 2015 in association with VEL TECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALA ENGINEERING COLLEGE, CHENNAI th National Conference on Recent Technologies for Sustainable Development 2015 [RECHZIG’15] - 28 August 2015

number of books in the Library of Congress [94]. The

share the potential for providing invaluable insights,

Large Hadron Collider at CERN will produce

if organized and analyzed appropriately.

roughly 15 petabytes of data annually—enough to fill

Applications requiring effective analyses of large

more than 1.7 million dual-layer DVDs per year [60].

datasets

Each day, Facebook operates on nearly 500 terabytes

applications include health care analytics (e.g.,

of user log data and several hundreds of terabytes of

personalized

genomics),

image data. Every minute, 100 h of video are

optimization,

and

uploaded on to YouTube and upwards of 135,000 h

recommendations. However, projections suggest that

are watched [98]. Over 28,000 multi-media (MMS)

data

messages are sent every second [3]. Roughly 46

improvements in the cost and density of storage

million mobile apps were downloaded in 2012, each

technologies, the available computational power for

app collecting more data. Twitter [87] serves more

processing it, and the associated energy footprint. For

than 550 million active users, who produce 9100

example, between 2002 and 2009 data traffic grew

tweets every second. eBay systems process more than

56-fold, compared to a corresponding 16-fold

100 petabytes of data every day [64]. In other

increase in computing power (largely tracking

domains, Boeing jet engines can produce 10 terabytes

Moore’s law). In comparison, between 1998 and

of operational information for every 30 min of

2005 data centers grew in size by 173%

operation.

per year [68]. Extrapolating these trends, it will take

This corresponds to a few hundred terabytes of data

about 13 years for a 1000-fold increase in

for a single Atlantic crossing, which, if multiplied by

computational power (or theoretically 1000× more

the 25,000 flights each day, highlights the data

energy). However, energy efficiency is not expected

footprint

to increase by a factor of over 25 over the same time

of

sensor

and

machine-produced

are

growth

widely

will

recognized

largely

today.

business

Such

process

social-network-based

outpace

foreseeable

nformation.

period. This generates a severe mismatch of almost a

These examples provide a small glimpse into the

40-fold increase in the data analytics energy

rapidly expanding ecosystem of diverse sources of

footprint.

massive datasets currently in existence. Data can be

Workload characteristics. A comprehensive study of

structured (e.g., financial, electronic medical records,

big-data workloads can help understand their

government statistics), semi-structured (e.g., text,

implications on hardware and software design.

tweets, emails), unstructured (e.g., audio and video),

Inspired

and

computation [11], Mehul Shah et al. [82] attempt to

real-time

(e.g.,

network

traces,

monitoring logs). All of these applications

generic

by

the

seven

dwarfs

of

numerical

define a set of ‘‘data dwarfs’’—meaning key data

261


ISSN 2395-695X (Print) ISSN 2395-695X (Online) Available online at www.ijarbest.com International Journal of Advanced Research in Biology Ecology Science and Technology (IJARBEST) Vol. I, Special Issue I, August 2015 in association with VEL TECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALA ENGINEERING COLLEGE, CHENNAI th National Conference on Recent Technologies for Sustainable Development 2015 [RECHZIG’15] - 28 August 2015

processing kernels—that provide current and future

being hot; however as time progresses, it becomes

coverage of data-centric workloads. Drawing from

archival, cold, most suitable for storage in NVMs.

an extensive set of workloads, they establish a set of

However, there are notable exceptions of periodicity

classifying dimensions (response time, access pattern,

or churn in access patterns (season-related topics,

working set, data type, read vs write, processing

celebrity headlines) and concurrently hot massive

complexity) and conclude that five workload

datasets (comparative genomic calculations) that

models

should be taken into consideration.

could

satisfactorily

cover

data-centric

workloads as of 2010: (i) distributed sort at petabytes

Furthermore, latent correlations among dimensions

scale,

can arise with hardware stack projections: a single

(ii)

in-memory

index

search,

(iii)

recommendation system, featuring high processing

video, due to multiple formats or language subtitles,

load and regular communication patterns, (iv)

results in many versions. These could either be

sequential-access based data de-duplication and (v)

generated offline and stored (thus needing ample

video uploading and streaming server at interactive

storage) or generated on the fly (transcoding and

response rates. While Online Analytic Processing

translation on demand) putting pressure on the

(OLAP) workloads can readily be expressed as a

computing infrastructure of the data-center, or

combination of (i), (iii) and (iv), Online Transaction

alternatively on the user’s device (client-side

Processing (OLTP) workloads can only be partially

computing).

captured and might require another category in the

Alternatively, one might have to rethink the relative

future; in-memory index and query support captures

prioritization of advances in processor designs over

some facets of these workloads, but the working sets

the performance of the I/O subsystem—a common

can become too large to fit in memory.

assumption in current architecture design. At the

1.2. Design considerations

extreme of such an alternative, an option would be (workload

the consideration of a possible ‘‘inversion’’: a

characteristics) of big data analytics applications,

hierarchy of compute elements supporting the data

individually, provide interesting insights into the

store instead of today’s designs of

design and architecture of future hardware and

hierarchies to serve the compute element. Gradually

software systems.

collapsing

Impact on hardware. The data access patterns and

smoothen such a transition and further provide

more specifically the frequency of how data is

savings in energy consumption.

accessed (cold versus hot data) can drive future

Understanding the workloads could also identify

memory hierarchy optimizations: data generally starts

opportunities for implementing special purpose

The

scale,

scope

and

nature

existing

storage

hierarchies

memory

would

262


ISSN 2395-695X (Print) ISSN 2395-695X (Online) Available online at www.ijarbest.com International Journal of Advanced Research in Biology Ecology Science and Technology (IJARBEST) Vol. I, Special Issue I, August 2015 in association with VEL TECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALA ENGINEERING COLLEGE, CHENNAI th National Conference on Recent Technologies for Sustainable Development 2015 [RECHZIG’15] - 28 August 2015

processing elements directly K. Kambatla et al. / J.

database and file systems [27,12], all integral

Parallel Distrib. Comput. 74 (2014) 2561–2573 2563

components of bigdata analytics.

in hardware. GPUs, field-programmable gate arrays

2.2. Processing landscape for data analytics

(FPGAs), specialized application-specific integrated

Chip multiprocessors (CMPs) are expected to be the

circuits

computational work-horses for big data analytics. It

(ASICs),

and

dedicated

video

encoders/decoders warrant consideration. Such

seems however that there is no consensus on the

hardware accelerators drastically reduce the energy

specifics of the core ensemble hosted on a chip.

consumption, compared to their general-purpose

Basically there are two dimensions of differentiation,

processing counterparts. These could be integrated

operations with suitably engineered, virtualization

on-chip,

friendly equivalents. However, altering the source

leading

to

families

of

data-centric

asymmetric multiprocessors [66].

code of an operating system can be problematic due

NVMs offer latency reduction of approximately three

to licensing issues, and it potentially introduces

orders of magnitude (microseconds) over this time.

incompatibilities. In an alternate approach, a binary

There are proposals for using flash-based solid-state-

translator runs the non-virtualizable, privileged parts

disks (SSDs) to support key-value store abstractions,

and

for workloads that favor it. Yet others propose to

instructions, also retaining in a trace cache the

organize SSDs as caches for conventional disks

translated blocks for optimization purposes.

(hybrid design).

For memory management, VMM maintains a shadow

Ideally the persistence of NVMs should be exposed

of each virtual machine’s memory-management data

at the instruction-set level (ISA), so that the operating

structure, its shadow page table. VMM updates these

systems can utilize them efficiently (e.g., by

structures reflecting operating system’s changes and

redesigning parts that assume memory volatility or

establishes the mapping to actual pages in the

provide, to the upper layers, an API for placing

hardware memory. Challenges here include enabling

archival data on energy-efficient NVM modules). On

the VMM to leverage the operating system’s internal

the other hand, the capability of durable memory

state for efficient paging in/out and sharing identical

writes reduces isolation; this issue could be addressed

physical pages across multiple virtual machines

via durable memory transactions [93,24]. From

monitored by a single VMM. This sharing will be

the perspective of algorithm design and related data

particularly important for homogeneous pools (in

structures,

towards

terms of software configuration) of virtual machines

alternate, optimized designs and implementations of

executing, over multicore multiprocessors on-a-chip,

index structures, [22], key-value stores [91],

the workloads of big data analytics in the future.

non-volatility

could

push

dynamically

patches

the

‘‘offending’’

263


ISSN 2395-695X (Print) ISSN 2395-695X (Online) Available online at www.ijarbest.com International Journal of Advanced Research in Biology Ecology Science and Technology (IJARBEST) Vol. I, Special Issue I, August 2015 in association with VEL TECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALA ENGINEERING COLLEGE, CHENNAI th National Conference on Recent Technologies for Sustainable Development 2015 [RECHZIG’15] - 28 August 2015

References

Savage, Ion Stoica (Eds.), SIGCOMM, ACM, 2008,

[1] Daniel J. Abadi, Don Carney, Ugur Çetintemel,

pp. 63–74.

Mitch Cherniack, Christian

[6] Mohammad Alizadeh, Albert G. Greenberg,

Convey, Sangdon Lee, Michael Stonebraker, Nesime

David A. Maltz, Jitendra Padhye,

Tatbul, Stan Zdonik,

Parveen Patel, Balaji Prabhakar, Sudipta Sengupta,

Aurora: a new model and architecture for data stream

Murari Sridharan, Data

management, VLDB J.

center TCP (DCTCP), in: Shivkumar Kalyanaraman,

12 (2) (2003) 120–139.

Venkata N. Padmanabhan,

[2] Azza Abouzeid, Kamil Bajda-Pawlikowski,

K.K. Ramakrishnan, Rajeev Shorey, Geoffrey M.

Daniel Abadi, Alexander Rasin, Avi

Voelker (Eds.), SIGCOMM, ACM,

Silberschatz, HadoopDB: an architectural hybrid of

2010, pp. 63–74.

MapReduce and DBMS

[7] David G. Andersen, Jason Franklin, Michael

technologies for analytical workloads, in: VLDB,

Kaminsky, Amar Phanishayee,

2009.

Lawrence Tan, Vijay Vasudevan, FAWN: a fast array

[3] http://agbeat.com/tech-news/how-carriers-gather-

of wimpy nodes, Commun.

track-and-sell-yourprivate-

ACM 54 (7) (2011) 101–109.

data/.

[8] Rasmus Andersen, Brian Vinter, The scientific

[4] Yanif Ahmad, Bradley Berg, Uˇgur Cetintemel,

byte code virtual machine, in:

Mark Humphrey, Jeong-Hyon

GCA, 2008, pp. 175–181.

Hwang, Anjali Jhingran, Anurag Maskey, Olga

[9] H. Andrade, B. Gedik, K.L. Wu, P.S. Yu,

Papaemmanouil, Alexander

Processing high data rate streams in

Rasin, Nesime Tatbul, Wenjuan Xing, Ying Xing,

system S, J. Parallel Distrib. Comput. 71 (2) (2011)

Stan Zdonik, Distributed

145–156.

operation in the borealis stream processing engine,

[10] Luiz André Barroso, Urs Hölzle, The Datacenter

in: Proceedings of the 2005

as a Computer: An Introduction

ACM

SIGMOD

International

Conference

on

to the Design of Warehouse-Scale Machines, in:

Management of Data, SIGMOD’05,

Synthesis Lectures on

ACM, New York, NY, USA, 2005, pp. 882–884.

Computer

[5] Mohammad Al-Fares, Alexander Loukissas,

Publishers, 2009.

Amin Vahdat, A scalable, commodity

[11] Krste Asanovic, Ras Bodik, Bryan Christopher

data center network architecture, in: Victor Bahl,

Catanzaro, Joseph James Gebis,

Architecture,

Morgan

&

Claypool

David Wetherall, Stefan

264


ISSN 2395-695X (Print) ISSN 2395-695X (Online) Available online at www.ijarbest.com International Journal of Advanced Research in Biology Ecology Science and Technology (IJARBEST) Vol. I, Special Issue I, August 2015 in association with VEL TECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALA ENGINEERING COLLEGE, CHENNAI th National Conference on Recent Technologies for Sustainable Development 2015 [RECHZIG’15] - 28 August 2015

Parry Husbands, Kurt Keutzer, David A. Patterson,

Symposium on Priniciples of Distributed Computing,

William Lester Plishker, John

PODC, 2000, pp. 7–10.

Shalf, Samuel Webb Williams, Katherine A. Yelick,

[18] Mike Burrows, The chubby lock service for

The Landscape of Parallel

loosely-coupled distributed

Computing Research: A View from Berkeley.

systems, in: Proceedings of the 7th Symposium on

Technical Report UCB/EECS-

Operating Systems Design

2006-183,

EECS

Department,

University

of

and Implementation, OSDI’06, USENIX Association,

California, Berkeley, 2006.

Berkeley, CA, USA, 2006,

[12] M. Athanassoulis, A. Ailamaki, S. Chen, P.

pp. 335–350.

Gibbons, R. Stoica, Flash in a DBMS: where and how? Bull. IEEE Comput. Soc. Tech. Committee Data Eng. (2010). [13] Jason Baker, Chris Bond, James C. Corbett, J.J. Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh, Megastore: Providing Scalable, Highly Available Storage for Interactive Services, in: CIDR’11, 2011. [14] J. Baliga, R.W.A. Ayre, K. Hinton, R.S. Tucker, Green cloud computing: balancing energy in processing, storage, and transport, Proc. IEEE 99 (1) (2011) 149–167. [15] Costas Bekas, Alessandro Curioni, A new energy aware performance metric, Comput. Sci.-Res. Dev. 25 (3–4) (2010) 187–195. [16] K. Birman, D. Freedman, Qi Huang, P. Dowell, Overcoming cap with consistent soft-state replication, Computer 45 (2) (2012) 50–58. [17] E.A. Brewer, Towards robust distributed systems, in: Proc. 19th Annual ACM

265


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.