![](https://assets.isu.pub/document-structure/211012182440-e078a8fd9ddc5c70aa257a8cb0beac4a/v1/25aec5a628a2f46334b4b036e468657a.jpeg?width=720&quality=85%2C50)
7 minute read
The Apache Spark Big Data Analytics Tool: Fast and Interactive
Apache Spark is an open source data analytics cluster computing framework that is extremely fast. It is versatile and can be used in tandem with other applications like Hadoop. Spark became an Apache Top Level Project in February 2014.
We have already witnessed the power of the Hadoop framework in Big Data analytics. It has proven its enormous capability to solve the most critical dataintensive challenges. While the Hadoop framework is perfect for the batch and stored kind of Big Data job processing, on its own it proves inadequate when it comes to real time and interactive Big Data analytics.
Advertisement
In this article, we are going to explore the power of Apache Spark, which has been recently developed by researchers from the open source community. It works on in-memory data concepts that make it a 100 times faster than the Hadoop MapReduce framework. It works very effectively for querying very large data sets and also provides sub-second latency of processed data sets.
What is interactive Big Data analytics?
When users need to perform interactive ad-hoc queries based on machine learning, graph processing algorithms and fast processing in real time—all involving massive volumes of data—it is known as interactive Big Data analytics.
After the advent of the Hadoop MapReduce framework, this is a very widespread area of research. Nowadays, more research is focused on highly interactive and fast Big Data processing.
Here are some iterative, interactive and streaming tools of Big Data.
Dremel: A Google product launched in 2010, it works on a novel approach. It offers a highly interactive, real-time and ad-hoc query interface, which is not possible with MapReduce.
ClouderaImpala: This is an SQL query engine that gets executed in Apache Hadoop, and is a leading enterprise product with scalable parallel database technology and the strength of Hadoop. It allows users to directly raise queries stored in HDFS (Hadoop distributed file system) and Apache HBase (no data transfer is needed).
The Apache Spark approach
The finest approach to Apache Spark is in-memory computing, which means data gets stored or cached in the distributed main memory instead of the disk.
In-memory computing: Apache Spark introduces a new data primitive called RDDS (resilient distributed datasets), with the help of which data is stored across the cluster's memory. When data is in-memory, there is no need for the replication of data sets. This approach automatically provides fault tolerance capabilities. This RDDS provides a read and write capability that is 40 times faster than HDFS or any other distributed file system.
Hadoop interoperability: Apache Spark is also fully interoperable with Hadoop and is, therefore, being adopted by many research projects. Read and write can be performed easily by any existing storage system that is supported by Hadoop like HDFS, HBase, S3 (simple storage system) and even with the input/output APIs of Hadoop. Therefore, Apache Spark can be used very effectively for non-interactive applications as well. Apart from this, Spark can also be integrated with Mlib (a machine learning tool), Spark streaming, Shark and Graphx.
Advantages of Apache Spark
The following are the key advantages of Apache Spark. Fastest data processing: Initially, the main purpose of deploying Spark was to enhance efficiency in the current
MapReduce applications. Actually, MapReduce is a general framework and is not specifically implemented in core Hadoop. Spark also enables MapReduce, as it can use memory optimally (even in recovering from failure cases). Some features work faster in Spark's
MapReduce compared to Hadoop's MapReduce, even without making efficient use of caches in iterations. The iterative algorithm: By using the cache() function,
Spark provides users and applications the facility to provide cached datasets in explicit form. It means that all the applications can take data from the RAM rather than the disk. This dramatically increases the performance of the iterative algorithm, which reaches the same dataset again and again.
Content recommendations and ad targeting are the results of leveraging Spark’s efficient iterative execution, which has been done by two of the world’s leading companies. There is a good comparison between machine learning algorithms and general algorithms regarding speed. Machine learning algorithms (e.g., based on Spark logistic regression) run 100x more rapidly than previously implemented Hadoop-based algorithms, while other algorithms (collaborative filtering or other alternative multipliers) run 15x faster. Real time streaming: Spark offers the advantage of single semantics and full recovery of stateful operators.
It also has the unique facility of providing you the same
Spark APIs to process streams, including reusing the regular application code of Spark. It has an API for working with streams with low latency data analysis systems, and it can easily be extended further to process and analyse live data streams. Quick design making: Reducing latency and improving quality with efficiency are the advantages of Spark.
Many companies use Big Data for recommendation systems, ad targeting or predictive analytics. Effectiveness increases by reducing decision latency. This leads to dramatic improvements in the company’s ROI (return on investment). Spark deployment is ideal for ad targeting and improved video delivery over the Web.
Applications of Apache Spark
Apache Spark is currently used by many research projects at Berkeley related to machine learning, and it is also being studied at some Internet companies. Interactive queries on data streams: Quantified (a startup) used Spark for the exploration of time series data. It was used to create a user friendly interactive interface. Quantified specialises in predictive analysis on time series data. The process contains three steps: to upload new data from external sources (e.g., Facebook or Twitter streams) at fixed periods, executing an entity extraction algorithm (for extracting the data), and in the end, it builds an in-memory table of each mentioned entity. With the help of Web applications, users can raise queries in a user friendly manner. Spam detection on Twitter: This is a research project at
Berkeley. Apache Spark was used to investigate spam links in Twitter posts. A regression classifier for investigation purposes was developed and it was placed on top of Spark.
With the help of Spark, the research team could develop an accurate and extremely fast classifier for spam detection. Hive on Spark: Spark offers full support to the HiveQL language. It can be run over Hive data warehouses as well. It enhances the capability of Hive by facilitating the loading of tables in-memory for much faster access. Apart from SQL support, it also provides machine learning functions like clustering and classification. Spark streaming: Spark streaming is a sub-project of
Spark. It extends the functionality of online processing by providing the MapReduce and filter functionality of Spark.
Setting up Spark on Amazon EC2
To set up a Spark cluster on Amazon EC2, you need to follow the simple steps listed below.
Pre-requisites: In order to access any service of the Amazon cloud, you first have to sign up with Amazon and then navigate to its EC2 service. After that you need to click on the ‘Create key pair’ tab (in the key pair section). You will be given the option of creating and downloading the key pair.
Downloading Spark: You can download Spark on your local machine by running the following command:
$ wget http://spark-project.org/files/spark- 0.7.2-prebuilthadoop1.tgz $ tar xvfz spark-0.7.2-prebuilt-hadoop1.tgz
The above command will create one EC2 directory, which contains the cluster set up script:
$ spark-0.7.2/ec2/spark-ec2 -k <keypair-name> -i <key-file> -s <num-slaves> launch <cluster-name>
Here <keypair-name> is the name of the EC2 key pair and <key-file> is the private key.
You can give the number of the slave that you want to launch in the <num-slaves> options, e.g., 1, 2, 3 or 4, etc.
You can give any suitable name to your cluster in the <cluster-name> option.
Launching a Spark cluster: Now, just log in to the master instance with the following command:
$ ./spark-ec2 -k key -i key.pem login <cluster-name>
For specific configurations, there is the /root/spark/conf/ spark-env.sh file on each machine, which lets you edit that file.
Terminating a cluster: If you want to terminate the cluster, run the following command:
$./spark-ec2 destroy <cluster-name>.
References
[1] http://docs.sigmoidanalytics.com/index.php/Running_Shark_on_EC2 [2] https://spark.incubator.apache.org/docs/latest/ec2-scripts.html
By: Rishabh Sharma
The author is a solutions delivery analyst with an MNC, involved in research projects in cloud computing and Big Data. He has in-depth knowledge of distributed systems and cloud computing research issues in the industry. He has authored three text books -‘Advanced Computing Technology' for Gujarat Technical University, ‘Software Quality Engineering’ and ‘Mobile Computing’ for Uttar Pradesh Technical University, apart from many research papers in international journals and IEEE. Reach him via email at er.rishabh.sharma@gmail.com.
![](https://assets.isu.pub/document-structure/211012182440-e078a8fd9ddc5c70aa257a8cb0beac4a/v1/1810e48cfbaf483ec061fe5012f2bef9.jpeg?width=720&quality=85%2C50)
![](https://assets.isu.pub/document-structure/211012182440-e078a8fd9ddc5c70aa257a8cb0beac4a/v1/1d042168818c6ab90f48be4276910cb0.jpeg?width=720&quality=85%2C50)