Cost-Effective Cloud Server Provisioning for Predictable Performance of Big Data Analytics
Abstract: Cloud datacenters are underutilized due to server over-provisioning. To increase datacenter utilization, cloud providers offer users an option to run workloads such as big data analytics on the underutilized resources, in the form of cheap yet revocable transient servers (e.g., EC2 spot instances, GCE preemptible instances). Though at highly reduced prices, deploying big data analytics on the unstable cloud transient servers can severely degrade the job performance due to instance revocations. To tackle this issue, this paper proposes iSpot, a cost-effective transient server provisioning framework for achieving predictable performance in the cloud, by focusing on Spark as a representative Directed Acyclic Graph (DAG)-style big data analytics workload. It first identifies the stable cloud transient servers during the job execution by devising an accurate Long ShortTerm Memory (LSTM)-based price prediction method. Leveraging automatic job profiling and the acquired DAG information of stages, we further build an analytical performance model and present a lightweight critical data check pointing mechanism for Spark, to enable our design of iSpot provisioning strategy for guaranteeing the job performance on stable transient servers. Extensive prototype
experiments on both EC2 spot instances and GCE preemptible instances demonstrate that, iSpot is able to guarantee the performance of big data analytics running on cloud transient servers while reducing the job budget by up to 83:8% in comparison to the state-of-the-art server provisioning strategies, yet with acceptable runtime overhead. Existing system: Also, iSpot is able to cut the monetary cost of job executions by up to 83:8% compared to the state-of-the-art instance provisioning strategies (e.g., OptEx , Ernest ), while guaranteeing the performance of Spark jobs yet with practically acceptable runtime overhead. Furthermore, we demonstrate that iSpot is effective and flexible enough to replace our critical data checkpointing with existing fault tolerance mechanisms for big data analytics, to gracefully handle the revocations of cloud instances. The rest of the paper is organized as follows. Sec. 2 experimentally illustrates the severe variation of instance prices and performance degradation of big data analytics on cloud transient servers. Sec. 3 and Sec. 4 devise a LSTMbased prediction model for spot prices and a DAG-based performance model for Spark, respectively. Complementary to our preliminary work. Proposed system: There have been investigations on deploying EC2 spot instances to meet the deadline of HPC applications and matrix-based workflows. Recent efforts are devoted to guaranteeing performance goals for big data analytics running on ondemand servers by performance-aware job scheduling or machine learning-based job performance modeling. Nevertheless, these techniques above all rely on the quality of training dataset and thus bring heavy training overhead (e.g., training data collection) to the model construction. Moreover, the coarse-grained machine learning performance model imprecisely predicts the performance of data analytics jobs with complex dataflow execution graphs, e.g., Directed Acyclic Graph (DAG) in Spark. As a result, there has been scant research devoted to predicting the job performance in a lightweight manner by explicitly considering the job execution graph, and to providing predictable performance for big data analytics particularly using cloud transient servers.
Advantages: We first characterize the price variation of different types of EC2 spot instances, and then illustrate how the revocations of transient servers cause the performance degradation of big data analytics in the cloud. We focus on Spark as a representative DAG-style big data analytics workload in the cloud, because the execution of Spark jobs follows the DAG information of stages, similar to several other big data applications, e.g., Map Reduce, Dryad, and Tensor Flow. Disadvantages: In particular, the cloud server provisioning with different instance types is an NPcomplete problem. To obtain an efficient and feasible solution in polynomial time, we practically simplify our cloud server provisioning for big data analytics using the same type of cloud transient servers.. By optimizing the resource provisioning problem as a Markov decision process, Cumulon- D achieves predictable performance for the matrix based data analysis with EC2 spot instances. A more recent work named Tributary guarantees the latency SLOs for elastic Web services by allocating an appropriate number and type of spot instances. Modules: Transient resource provisioning for predictable performance: Recent research has been devoted to guaranteeing the performance of cloud applications on EC2 spot instances using fault tolerance mechanisms. For example, check pointing and execution replication techniques are deployed and compared when deploying the MPI applications on EC2 spot instances. By optimizing the resource provisioning problem as a Markov decision process, Cumulon-D achieves predictable performance for the matrix based data analysis with EC2 spot instances. A more recent work named Tributary guarantees the latency SLOs for elastic Web services by allocating an appropriate number and type of spot instances. It further adopts resource over-provisioning to handle the instance revocations and workload bursts. In contrast, iSpot achieves the predictable
performance especially for DAG-style big data analytics through designing a delicate job performance model and a critical data check pointing mechanism, while reducing the monetary cost of job executions. Fault tolerance mechanisms of big data analytics: To mitigate the performance impact of cloud instance revocations, the traditional fault tolerance mechanisms such as proactive periodical check pointing, proactive or reactive migration, and computation replication are effective for batched data processing jobs. To avoid the frequent check pointing of jobs, Pado identifies and assigns the critical computation part of DAG-style data processing workloads to reliable resources. In addition, check pointing the data partitions computed on the revoked instances and the frontier RDDs on the lineage graph can also alleviate the check pointing overhead for Spark jobs. Orthogonal to the prior work, iSpot adopts a critical data check pointing mechanism in Sec. 5.1, which selectively checkpoints the RDD data that causes heavy recomputation overhead to the remote stable storage. Accordingly, to deal with the revocations of cloud instances, iSpot can work with these techniques above in a complementary manner, as demonstrated. Directed Acyclic Graph: There have been investigations on deploying EC2 spot instances to meet the deadline of HPC applications and matrix-based workflows. Recent efforts are devoted to guaranteeing performance goals for big data analytics running on ondemand servers by performance-aware job scheduling or machine learning-based job performance modeling .Nevertheless, these techniques above all rely on the quality of training dataset and thus bring heavy training overhead (e.g., training data collection) to the model construction. Moreover, the coarse-grained machine learning performance model imprecisely predicts the performance of data analytics jobs with complex dataflow execution graphs, e.g., Directed Acyclic Graph (DAG) in Spark. As a result, there has been scant research devoted to predicting the job performance in a lightweight manner by explicitly considering the job execution graph, and to providing predictable performance for big data analytics particularly using cloud transient servers.