ISSN 2395-695X (Print) ISSN 2395-695X (Online) Available online at www.ijarbest.com International Journal of Advanced Research in Biology Ecology Science and Technology (IJARBEST) Vol. I, Special Issue I, August 2015 in association with VEL TECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALA ENGINEERING COLLEGE, CHENNAI th National Conference on Recent Technologies for Sustainable Development 2015 [RECHZIG’15] - 28 August 2015
Trends in Big Data Analytics 1
LEELAMBIKA.K.V, 2VISHAL.G, 3HASHIR AHAMED.R
1
leelambika@velhightech.com, 2vishalnishanth3@gmail.com, 3hashir.ahamed93@gmail.com 1 1,2,3 1,2,3
Assistant Professor, 2,3 UG Scholars
Department of Computer Science and Engineering
Vel Tech High Tech Dr.Rangarajan Dr.Sakunthala Engineering College
focus on emerging trends to highlight the hardware,
ABSTRACT One of the major applications of future generation
software, and application
landscapeof big-data
parallel and distributed systems is in big-data
analytics. in the cloud often span multiple data
analytics.Data repositories for such applications
centers The compute back-end of cloud made and
currently exceed exabytes and are rapidly increasing
environments typically relies on efficient and
in size.Beyond their sheer magnitude, these datasets
resilient data centerarchitectures built on virtualized
and associated applications’ considerations pose
compute and storage technologies,
significantchallenges
abstracting
for
method
and
software
commodity
hardware
efficiently components.
development. Datasets are often distributed and their
Current data centerstypically scale to tens of
size and privacyconsiderations warrant distributed
thousandsWith the development of critical Internet
techniques. Data often resides on platforms with
technologies, the visionof computing as a utility took
widely
network
shape in the mid 1990sThese early efforts on Grid
fault-tolerance,
computing typically viewed hardwareas the primary
security, and access control arecritical in many
resource. Grid computing technologies focusedon
applications
sharing, selection, andaggregation of a wide variety
varyingcomputational
capabilities.
Apache
Considerations
and of
such asDean and Ghemawat, 2004;
hadoop.
Analysis
tasks
often
have
ofgeographically
distributed
resources.
These
harddeadlines, and data quality is a major concern in
resources include variety of function of the system of
yet other use applications. data-driven models and
supercomputers,storage, and other devices for solving
methods, capable of operating at scale, are as-yet
large-scalecompute-intensive problems in science,
unknown. Even when knownmethods can be scaled,
engineering,
validation of results is a major issue. Characteristics
domain-crossing
of hardware platforms and the
stack
resourcemanagementcapability.The concept of ‘data
fundamentally impact data analytics. In this article,
as a resource’ was popularized by peerto-peer
we provide an overview of the state of the art and
systems
software
and
various
commerce.transparent
administration
and
Networks such as Napster, Gnutella,
258
ISSN 2395-695X (Print) ISSN 2395-695X (Online) Available online at www.ijarbest.com International Journal of Advanced Research in Biology Ecology Science and Technology (IJARBEST) Vol. I, Special Issue I, August 2015 in association with VEL TECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALA ENGINEERING COLLEGE, CHENNAI th National Conference on Recent Technologies for Sustainable Development 2015 [RECHZIG’15] - 28 August 2015
andBitTorrent allowed peer nodes to share content –
sense of the causal relationships that are present in
typically multimediadata – directly among one
the observed stimulus–response loops. Thus, it is
another in a decentralized manner.These frameworks
appropriate to embrace methodology-focused system
emphasized interoperability and dynamic,ad-hoc
collaboration to develop an interdisciplinary, multi-
communication
cost
perspective approach that will yield the most
aggregation.
interesting results for the new decision support and e-
However, in many of these platforms,considerations
commerce issues, and other social science problems
of anonymity or privacy issues and scalabilitywere
of interest. Recent hardware advances have played a
secondary. Perhaps, the most visible application of
major role in realizing the distributed software
big-data analytics hasbeen in business enterprises. It
platforms
is estimated that a retailer fully utilizing the power of
hardware innovations — in processor technology,
analytics can increase its operating margin by 60%.
newer
Utilizing new opportunities (for e.g., location-aware
network architecture software-defined networks will
and location-based services) leads to significant
continue to drive software innovations. Strong
potential
for new revenues.A comprehensive
emphasis in design of these systems will be on
analytics framework would require integrationof
minimizing the time spent in moving the data from
supply chain management. and many models of
storage to the processor or between storage/compute
customer management,after-sales support advertising,
nodes in a distributed setting.
and
reduction,resource
collaboration
sharing,
and
for
needed
kinds
of
for
big-data
analytics.Future
memory/storage
orhierarchies,
etc. Business enterprises collectvast amounts of multi-modal
data,
transactions,inventory
including
customer
management,
store-based
1. Introduction With
the
development
of
critical
Internet
video feeds, advertising and customer relations,
technologies, the vision of computing as a utility took
customer
shape in the mid 1990s [19].
preferences
and
sentiments,sales
management infrastructure, and financial data, among
These early efforts on Grid computing [35] typically
others.The aggregate of all data for large retailers is
viewed hardware as the primary resource. Grid
easily estimated to be in the exabytes, currently. With
computing
comprehensive deployment of
to track
selection, and aggregation of a wide variety of
inventory, links to suppliers databases, integration
geographically distributed resources. These resources
with customer preferences and profiles . people in
included supercomputers, storage, and other devices
the social sciences andhave been trained to work
for solving large-scale compute-intensive problems in
with. Also user behavior in social networks may be
science, engineering, and commerce.
RFIDs
technologies
focused
on
sharing,
correlated, which requires new methods to make
259
ISSN 2395-695X (Print) ISSN 2395-695X (Online) Available online at www.ijarbest.com International Journal of Advanced Research in Biology Ecology Science and Technology (IJARBEST) Vol. I, Special Issue I, August 2015 in association with VEL TECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALA ENGINEERING COLLEGE, CHENNAI th National Conference on Recent Technologies for Sustainable Development 2015 [RECHZIG’15] - 28 August 2015
A key feature of these frameworks was their support
tens of thousands of nodes and computations in the
fortransparent domain-crossing administration and
cloud often span multiple data centers.
resource management capability.
The
Networks such as Napster, Gnutella, anBitTorrent
environments with distributed data-centers hosting
allowed peer nodes to share content – typically
large data repositories, while also providing the
multimedia data – directly among one another in a
processing resources for analytics strongly motivates
decentralized manner.
need for effective parallel/distributed algorithms. The
These frameworks emphasized interoperability and
underlying socio-economic benefits of big-data
dynamic,ad-hoc communication and collaboration for
analytics
cost reduction, resource sharing, and aggregation.
characteristics pose significant challenges. In the rest
However, in many of these platforms, considerations
of this article, we highlight the scale and scope of
of anonymity or privacy issues and scalability
data analytics problems.
were secondary. More recently, Cloud computing
We describe commonly used hardware platforms for
environments [95] target reliable, robust services,
executing analytics applications, and associated
ubiquitously accessible (often throughbrowsers) from
considerations of storage, processing, networking,
clients ranging from mass-produced mobile devices
and energy. We then focus on the software substrates
to general purpose computers. Cloud computing
for applications, namely virtualization technologies,
generalizes
runtime
prior
notions
of
service
from
emerging
and
landscape
the
of
diversity
systems/execution
of
cloud-based
application
environments,
and
infrastructure-as-a-service (computing
programming models. We conclude with a brief
resources available in the cloud), and data-as-a-
discussion of the diverse applications of data
service (dataavailable in the cloud) to software-as-a-
analytics, ranging from health and human welfare
service (access to programs that execute in the
to computational modeling and simulation.
cloud). This offers considerable benefits from points-
1.1. Scale and scope of data analytics
of-view of service providers (cost reductions in
Recent conservative studies estimate that enterprise
hardware and administration), overall resource
server systems in the world have processed
utilization, and better client interfaces. The compute
9.57×1021 bytes of data in 2008
back-end of cloud environments typically relies on
[83]. This number is expected to have doubled every
efficient and resilient data center architectures, built
two years from that point. As an example, Walmart
on virtualized compute and storage technologies,
servers handle more than one million customer
efficiently
hardware
transactions every hour, and this information is
components. Current data centers typically scale to
inserted into databases that store more than 2.5
abstracting
commodity
petabytes of data—the equivalent of 167 times the
260
ISSN 2395-695X (Print) ISSN 2395-695X (Online) Available online at www.ijarbest.com International Journal of Advanced Research in Biology Ecology Science and Technology (IJARBEST) Vol. I, Special Issue I, August 2015 in association with VEL TECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALA ENGINEERING COLLEGE, CHENNAI th National Conference on Recent Technologies for Sustainable Development 2015 [RECHZIG’15] - 28 August 2015
number of books in the Library of Congress [94]. The
share the potential for providing invaluable insights,
Large Hadron Collider at CERN will produce
if organized and analyzed appropriately.
roughly 15 petabytes of data annually—enough to fill
Applications requiring effective analyses of large
more than 1.7 million dual-layer DVDs per year [60].
datasets
Each day, Facebook operates on nearly 500 terabytes
applications include health care analytics (e.g.,
of user log data and several hundreds of terabytes of
personalized
genomics),
image data. Every minute, 100 h of video are
optimization,
and
uploaded on to YouTube and upwards of 135,000 h
recommendations. However, projections suggest that
are watched [98]. Over 28,000 multi-media (MMS)
data
messages are sent every second [3]. Roughly 46
improvements in the cost and density of storage
million mobile apps were downloaded in 2012, each
technologies, the available computational power for
app collecting more data. Twitter [87] serves more
processing it, and the associated energy footprint. For
than 550 million active users, who produce 9100
example, between 2002 and 2009 data traffic grew
tweets every second. eBay systems process more than
56-fold, compared to a corresponding 16-fold
100 petabytes of data every day [64]. In other
increase in computing power (largely tracking
domains, Boeing jet engines can produce 10 terabytes
Moore’s law). In comparison, between 1998 and
of operational information for every 30 min of
2005 data centers grew in size by 173%
operation.
per year [68]. Extrapolating these trends, it will take
This corresponds to a few hundred terabytes of data
about 13 years for a 1000-fold increase in
for a single Atlantic crossing, which, if multiplied by
computational power (or theoretically 1000× more
the 25,000 flights each day, highlights the data
energy). However, energy efficiency is not expected
footprint
to increase by a factor of over 25 over the same time
of
sensor
and
machine-produced
are
growth
widely
will
recognized
largely
today.
business
Such
process
social-network-based
outpace
foreseeable
nformation.
period. This generates a severe mismatch of almost a
These examples provide a small glimpse into the
40-fold increase in the data analytics energy
rapidly expanding ecosystem of diverse sources of
footprint.
massive datasets currently in existence. Data can be
Workload characteristics. A comprehensive study of
structured (e.g., financial, electronic medical records,
big-data workloads can help understand their
government statistics), semi-structured (e.g., text,
implications on hardware and software design.
tweets, emails), unstructured (e.g., audio and video),
Inspired
and
computation [11], Mehul Shah et al. [82] attempt to
real-time
(e.g.,
network
traces,
monitoring logs). All of these applications
generic
by
the
seven
dwarfs
of
numerical
define a set of ‘‘data dwarfs’’—meaning key data
261
ISSN 2395-695X (Print) ISSN 2395-695X (Online) Available online at www.ijarbest.com International Journal of Advanced Research in Biology Ecology Science and Technology (IJARBEST) Vol. I, Special Issue I, August 2015 in association with VEL TECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALA ENGINEERING COLLEGE, CHENNAI th National Conference on Recent Technologies for Sustainable Development 2015 [RECHZIG’15] - 28 August 2015
processing kernels—that provide current and future
being hot; however as time progresses, it becomes
coverage of data-centric workloads. Drawing from
archival, cold, most suitable for storage in NVMs.
an extensive set of workloads, they establish a set of
However, there are notable exceptions of periodicity
classifying dimensions (response time, access pattern,
or churn in access patterns (season-related topics,
working set, data type, read vs write, processing
celebrity headlines) and concurrently hot massive
complexity) and conclude that five workload
datasets (comparative genomic calculations) that
models
should be taken into consideration.
could
satisfactorily
cover
data-centric
workloads as of 2010: (i) distributed sort at petabytes
Furthermore, latent correlations among dimensions
scale,
can arise with hardware stack projections: a single
(ii)
in-memory
index
search,
(iii)
recommendation system, featuring high processing
video, due to multiple formats or language subtitles,
load and regular communication patterns, (iv)
results in many versions. These could either be
sequential-access based data de-duplication and (v)
generated offline and stored (thus needing ample
video uploading and streaming server at interactive
storage) or generated on the fly (transcoding and
response rates. While Online Analytic Processing
translation on demand) putting pressure on the
(OLAP) workloads can readily be expressed as a
computing infrastructure of the data-center, or
combination of (i), (iii) and (iv), Online Transaction
alternatively on the user’s device (client-side
Processing (OLTP) workloads can only be partially
computing).
captured and might require another category in the
Alternatively, one might have to rethink the relative
future; in-memory index and query support captures
prioritization of advances in processor designs over
some facets of these workloads, but the working sets
the performance of the I/O subsystem—a common
can become too large to fit in memory.
assumption in current architecture design. At the
1.2. Design considerations
extreme of such an alternative, an option would be (workload
the consideration of a possible ‘‘inversion’’: a
characteristics) of big data analytics applications,
hierarchy of compute elements supporting the data
individually, provide interesting insights into the
store instead of today’s designs of
design and architecture of future hardware and
hierarchies to serve the compute element. Gradually
software systems.
collapsing
Impact on hardware. The data access patterns and
smoothen such a transition and further provide
more specifically the frequency of how data is
savings in energy consumption.
accessed (cold versus hot data) can drive future
Understanding the workloads could also identify
memory hierarchy optimizations: data generally starts
opportunities for implementing special purpose
The
scale,
scope
and
nature
existing
storage
hierarchies
memory
would
262
ISSN 2395-695X (Print) ISSN 2395-695X (Online) Available online at www.ijarbest.com International Journal of Advanced Research in Biology Ecology Science and Technology (IJARBEST) Vol. I, Special Issue I, August 2015 in association with VEL TECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALA ENGINEERING COLLEGE, CHENNAI th National Conference on Recent Technologies for Sustainable Development 2015 [RECHZIG’15] - 28 August 2015
processing elements directly K. Kambatla et al. / J.
database and file systems [27,12], all integral
Parallel Distrib. Comput. 74 (2014) 2561–2573 2563
components of bigdata analytics.
in hardware. GPUs, field-programmable gate arrays
2.2. Processing landscape for data analytics
(FPGAs), specialized application-specific integrated
Chip multiprocessors (CMPs) are expected to be the
circuits
computational work-horses for big data analytics. It
(ASICs),
and
dedicated
video
encoders/decoders warrant consideration. Such
seems however that there is no consensus on the
hardware accelerators drastically reduce the energy
specifics of the core ensemble hosted on a chip.
consumption, compared to their general-purpose
Basically there are two dimensions of differentiation,
processing counterparts. These could be integrated
operations with suitably engineered, virtualization
on-chip,
friendly equivalents. However, altering the source
leading
to
families
of
data-centric
asymmetric multiprocessors [66].
code of an operating system can be problematic due
NVMs offer latency reduction of approximately three
to licensing issues, and it potentially introduces
orders of magnitude (microseconds) over this time.
incompatibilities. In an alternate approach, a binary
There are proposals for using flash-based solid-state-
translator runs the non-virtualizable, privileged parts
disks (SSDs) to support key-value store abstractions,
and
for workloads that favor it. Yet others propose to
instructions, also retaining in a trace cache the
organize SSDs as caches for conventional disks
translated blocks for optimization purposes.
(hybrid design).
For memory management, VMM maintains a shadow
Ideally the persistence of NVMs should be exposed
of each virtual machine’s memory-management data
at the instruction-set level (ISA), so that the operating
structure, its shadow page table. VMM updates these
systems can utilize them efficiently (e.g., by
structures reflecting operating system’s changes and
redesigning parts that assume memory volatility or
establishes the mapping to actual pages in the
provide, to the upper layers, an API for placing
hardware memory. Challenges here include enabling
archival data on energy-efficient NVM modules). On
the VMM to leverage the operating system’s internal
the other hand, the capability of durable memory
state for efficient paging in/out and sharing identical
writes reduces isolation; this issue could be addressed
physical pages across multiple virtual machines
via durable memory transactions [93,24]. From
monitored by a single VMM. This sharing will be
the perspective of algorithm design and related data
particularly important for homogeneous pools (in
structures,
towards
terms of software configuration) of virtual machines
alternate, optimized designs and implementations of
executing, over multicore multiprocessors on-a-chip,
index structures, [22], key-value stores [91],
the workloads of big data analytics in the future.
non-volatility
could
push
dynamically
patches
the
‘‘offending’’
263
ISSN 2395-695X (Print) ISSN 2395-695X (Online) Available online at www.ijarbest.com International Journal of Advanced Research in Biology Ecology Science and Technology (IJARBEST) Vol. I, Special Issue I, August 2015 in association with VEL TECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALA ENGINEERING COLLEGE, CHENNAI th National Conference on Recent Technologies for Sustainable Development 2015 [RECHZIG’15] - 28 August 2015
References
Savage, Ion Stoica (Eds.), SIGCOMM, ACM, 2008,
[1] Daniel J. Abadi, Don Carney, Ugur Çetintemel,
pp. 63–74.
Mitch Cherniack, Christian
[6] Mohammad Alizadeh, Albert G. Greenberg,
Convey, Sangdon Lee, Michael Stonebraker, Nesime
David A. Maltz, Jitendra Padhye,
Tatbul, Stan Zdonik,
Parveen Patel, Balaji Prabhakar, Sudipta Sengupta,
Aurora: a new model and architecture for data stream
Murari Sridharan, Data
management, VLDB J.
center TCP (DCTCP), in: Shivkumar Kalyanaraman,
12 (2) (2003) 120–139.
Venkata N. Padmanabhan,
[2] Azza Abouzeid, Kamil Bajda-Pawlikowski,
K.K. Ramakrishnan, Rajeev Shorey, Geoffrey M.
Daniel Abadi, Alexander Rasin, Avi
Voelker (Eds.), SIGCOMM, ACM,
Silberschatz, HadoopDB: an architectural hybrid of
2010, pp. 63–74.
MapReduce and DBMS
[7] David G. Andersen, Jason Franklin, Michael
technologies for analytical workloads, in: VLDB,
Kaminsky, Amar Phanishayee,
2009.
Lawrence Tan, Vijay Vasudevan, FAWN: a fast array
[3] http://agbeat.com/tech-news/how-carriers-gather-
of wimpy nodes, Commun.
track-and-sell-yourprivate-
ACM 54 (7) (2011) 101–109.
data/.
[8] Rasmus Andersen, Brian Vinter, The scientific
[4] Yanif Ahmad, Bradley Berg, Uˇgur Cetintemel,
byte code virtual machine, in:
Mark Humphrey, Jeong-Hyon
GCA, 2008, pp. 175–181.
Hwang, Anjali Jhingran, Anurag Maskey, Olga
[9] H. Andrade, B. Gedik, K.L. Wu, P.S. Yu,
Papaemmanouil, Alexander
Processing high data rate streams in
Rasin, Nesime Tatbul, Wenjuan Xing, Ying Xing,
system S, J. Parallel Distrib. Comput. 71 (2) (2011)
Stan Zdonik, Distributed
145–156.
operation in the borealis stream processing engine,
[10] Luiz André Barroso, Urs Hölzle, The Datacenter
in: Proceedings of the 2005
as a Computer: An Introduction
ACM
SIGMOD
International
Conference
on
to the Design of Warehouse-Scale Machines, in:
Management of Data, SIGMOD’05,
Synthesis Lectures on
ACM, New York, NY, USA, 2005, pp. 882–884.
Computer
[5] Mohammad Al-Fares, Alexander Loukissas,
Publishers, 2009.
Amin Vahdat, A scalable, commodity
[11] Krste Asanovic, Ras Bodik, Bryan Christopher
data center network architecture, in: Victor Bahl,
Catanzaro, Joseph James Gebis,
Architecture,
Morgan
&
Claypool
David Wetherall, Stefan
264
ISSN 2395-695X (Print) ISSN 2395-695X (Online) Available online at www.ijarbest.com International Journal of Advanced Research in Biology Ecology Science and Technology (IJARBEST) Vol. I, Special Issue I, August 2015 in association with VEL TECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALA ENGINEERING COLLEGE, CHENNAI th National Conference on Recent Technologies for Sustainable Development 2015 [RECHZIG’15] - 28 August 2015
Parry Husbands, Kurt Keutzer, David A. Patterson,
Symposium on Priniciples of Distributed Computing,
William Lester Plishker, John
PODC, 2000, pp. 7–10.
Shalf, Samuel Webb Williams, Katherine A. Yelick,
[18] Mike Burrows, The chubby lock service for
The Landscape of Parallel
loosely-coupled distributed
Computing Research: A View from Berkeley.
systems, in: Proceedings of the 7th Symposium on
Technical Report UCB/EECS-
Operating Systems Design
2006-183,
EECS
Department,
University
of
and Implementation, OSDI’06, USENIX Association,
California, Berkeley, 2006.
Berkeley, CA, USA, 2006,
[12] M. Athanassoulis, A. Ailamaki, S. Chen, P.
pp. 335–350.
Gibbons, R. Stoica, Flash in a DBMS: where and how? Bull. IEEE Comput. Soc. Tech. Committee Data Eng. (2010). [13] Jason Baker, Chris Bond, James C. Corbett, J.J. Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh, Megastore: Providing Scalable, Highly Available Storage for Interactive Services, in: CIDR’11, 2011. [14] J. Baliga, R.W.A. Ayre, K. Hinton, R.S. Tucker, Green cloud computing: balancing energy in processing, storage, and transport, Proc. IEEE 99 (1) (2011) 149–167. [15] Costas Bekas, Alessandro Curioni, A new energy aware performance metric, Comput. Sci.-Res. Dev. 25 (3–4) (2010) 187–195. [16] K. Birman, D. Freedman, Qi Huang, P. Dowell, Overcoming cap with consistent soft-state replication, Computer 45 (2) (2012) 50–58. [17] E.A. Brewer, Towards robust distributed systems, in: Proc. 19th Annual ACM
265