ETPL BD -001
H2Hadoop: Improving Hadoop Performance using the Metadata of Related Jobs
Cloud Computing leverages Hadoop framework for processing BigData in parallel. Hadoop has certain limitations that could be exploited to execute the job efficiently. These limitations are mostly because of data locality in the cluster, jobs and tasks scheduling, and resource allocations in Hadoop. Efficient resource allocation remains a challenge in Cloud Computing MapReduce platforms. We propose H2Hadoop, which is an enhanced Hadoop architecture that reduces the computation cost associated with BigData analysis. The proposed architecture also addresses the issue of resource allocation in native Hadoop. H2Hadoop provides a better solution for “text data�, such as finding DNA sequence and the motif of a DNA sequence. Also, H2Hadoop provides an efficient Data Mining approach for Cloud Computing environments. H2Hadoop architecture leverages on NameNode’s ability to assign jobs to the TaskTrakers (DataNodes) within the cluster. By adding control features to the NameNode, H2Hadoop can intelligently direct and assign tasks to the DataNodes that contain the required data without sending the job to the whole cluster. Comparing with native Hadoop, H2Hadoop reduces CPU time, number of read operations, and another Hadoop factors.
ETPL BD - 002
Privacy Preserving Rack-Based Dynamic Workload Balancing for Hadoop Map Reduce
Hadoop has two components namely HDFS and MapReduce. Hadoop stores user data based on space utilization of datanodes on the cluster rather than the processing capability of the datanodes. Furthermore Hadoop runs in a heterogeneous environment as all datanodes may not be homogeneous. For these reasons, workload imbalances will occur when jobs run in a Hadoop cluster resulting in poor performance. In this paper, we propose a dynamic algorithm to balance the workload between different racks on a Hadoop cluster based on the log files of Hadoop. However, if the tasks are executing on critical or sensitive data in a secured rack, the data transfer to an unsecured cluster will result in privacy being compromised. We propose an approach to transfer data between racks without disclosing private information. Moving tasks from the most overloaded rack to another rack improves the performance of MapReduce jobs. Our simulations indicate that the proposed algorithm decreases running time of a job by more than 50% running on the most overload rack.
ETPL BD -003
BeTL: Map Reduce Checkpoint Tactics Beneath the Task Level
Big data analysis has gained significant popularity within the last few years. The MapReduce framework presented by Google makes it easier to write applications that process vast amount of data. MapReduce targets at large commodity clusters where failures are not exceptions. However, Hadoop, the most popular implementation of MapReduce performs poorly under failures. Hadoop implements the fault tolerance strategy at the task level, as a result, a task failure will require a re-execution of the whole task regardless of how much input has already been processed. In this paper, we present BeTL which introduces slight changes to the execution flow of MapReduce, and makes it possible to gain a finer-grained fault tolerance. Map tasks can create checkpoints so that a retrying task doesn't have to start from scratch and thus saves much time. Speculation strategy can also benefit from this. The new execution flow involves less IO operations and performs better than Hadoop even under no failures. In our experiments, BeTL outperforms Hadoop by 6.6 percent on average under no failures and 4.6 to 51.0 percent under different failure densities.
ETPL BD - 004
An efficient key partitioning scheme for heterogeneous Map Reduce clusters
Hadoop is a standard implementation of MapReduce framework for running data-intensive applications on the clusters of commodity servers. By thoroughly studying the framework we find out that the shuffle phase, all-to-all input data fetching phase in reduce task significantly affect the application performance. There is a problem of variance in both the intermediate key's frequencies and their distribution among data nodes throughout the cluster in Hadoop's MapReduce system. This variance in system causes network overhead which leads to unfairness on the reduce input among different data nodes in the cluster. Because of the above problem, applications experience performance degradation due to shuffle phase of MapReduce applications. We develop a new novel algorithm; unlike previous systems our algorithm considers a node's capabilities as heuristics to decide a better available trade-off for the locality and fairness in the system. By comparing with the default Hadoop's partitioning algorithm and Leen algorithm, on the average our approach achieve performance gain of 29% and 17%, respectively.
ETPL BD -005
Hybrid Job-Driven Scheduling for Virtual Map Reduce Clusters
It is cost-efficient for a tenant with a limited budget to establish a virtual MapReduce cluster by renting multiple virtual private servers (VPSs) from a VPS provider. To provide an appropriate scheduling scheme for this type of computing environment, we propose in this paper a hybrid job-driven scheduling scheme (JoSS for short) from a tenant's perspective. JoSS provides not only job-level scheduling, but also map-task level scheduling and reduce-task level scheduling. JoSS classifies MapReduce jobs based on job scale and job type and designs an appropriate scheduling policy to schedule each class of jobs. The goal is to improve data locality for both map tasks and reduce tasks, avoid job starvation, and improve job execution performance. Two variations of JoSS are further introduced to separately achieve a better map-data locality and a faster task assignment. We conduct extensive experiments to evaluate and compare the two variations with current scheduling algorithms supported by Hadoop. The results show that the two variations outperform the other tested algorithms in terms of map-data locality, reduce-data locality, and network overhead without incurring significant overhead. In addition, the two variations are separately suitable for different MapReduce-workload scenarios and provide the best job performance among all tested algorithms.
ETPL BD - 006
Attribute Based Access Control for Big Data Applications by Query Modification
We present concepts which can be used for the efficient implementation of Attribute Based Access Control (ABAC) in large applications using maybe several data storage technologies, including Hadoop, NoSQL and relational database systems. The ABAC authorization process takes place in two main stages. Firstly a sequence of permissions is derived which specifies permitted data to be retrieved for the user's transaction. Secondly, query modification is used to augment the user's transaction with code which implements the ABAC controls. This requires the storage technologies to support a highlevel language such as SQL or similar. The modified user transactions are then optimized and processed using the full functionality of the underlying storage systems. We use an extended ABAC model (TCM2) which handles negative permissions and overrides in a single permissions processing mechanism. We illustrate these concepts using a compelling electronic health records scenario.
ETPL BD -007
Distributed storage design for encrypted personal health record data
DSePHR is proposed in this work in order to manage the encrypted PHR data on a cloud storage. HBase and Hadoop are utilized in this work. The objective is to provide an API for any PHR system to upload/download the encrypted PHR data from a cloud storage. The DSePHR resolves the Namenode memory issues of HDFS when storing a lot of small files by classifying the encrypted PHR data into small and large files. The small files will be handled by HBase schema that is proposed in this work. The memory consumption and the processing time of the proposed DSePHR are evaluated using real data sets collected from various healthcare communities.
ETPL BD - 008
FiDoop: Parallel Mining of Frequent Item sets Using Map Reduce
Existing parallel mining algorithms for frequent itemsets lack a mechanism that enables automatic parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to this problem, we design a parallel frequent itemsets mining algorithm called FiDoop using the MapReduce programming model. To achieve compressed storage and avoid building conditional pattern bases, FiDoop incorporates the frequent items ultrametric tree, rather than conventional FP trees. In FiDoop, three MapReduce jobs are implemented to complete the mining task. In the crucial third MapReduce job, the mappers independently decompose itemsets, the reducers perform combination operations by constructing small ultrametric trees, and the actual mining of these trees separately. We implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on the cluster is sensitive to data distribution and dimensions, because itemsets with different lengths have different decomposition and construction costs. To improve FiDoop's performance, we develop a workload balance metric to measure load balance across the cluster's computing nodes. We develop FiDoop-HD, an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis. Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution is efficient and scalable.
ETPL BD - 009
A Map Reduce-Based Nearest Neighbour Approach for Big-Data-Driven Traffic Flow Prediction
In big-data-driven traffic flow prediction systems, the robustness of prediction performance depends on accuracy and timeliness. This paper presents a new MapReduce-based nearest neighbor (NN) approach for traffic flow prediction using correlation analysis (TFPC) on a Hadoop platform. In particular, we develop a real-time prediction system including two key modules, i.e., offline distributed training (ODT) and online parallel prediction (OPP). Moreover, we build a parallel -nearest neighbor optimization classifier, which incorporates correlation information among traffic flows into the classification process. Finally, we propose a novel prediction calculation method, combining the current data observed in OPP and the classification results obtained from large-scale historical data in ODT, to generate traffic flow prediction in real time. The empirical study on real-world traffic flow big data using the leave-one-out cross validation method shows that TFPC significantly outperforms four state-of-the-art prediction approaches, i.e., autoregressive integrated moving average, NaĂŻve Bayes, multilayer perceptron neural networks, and NN regression, in terms of accuracy, which can be improved 90.07% in the best case, with an average mean absolute percent error of 5.53%. In addition, it displays excellent speedup, scaleup, and sizeup.
ETPL BD - 010
Toward a cloud-based security intelligence with big data processing
As the adoption of Cloud Computing is growing exponentially, a huge sheer amount of data is generated therefore needing to be processed in order to control efficiently what is going within the infrastructure, and also to respond effectively and promptly to security threats. Herein, we provide a highly scalable plugin based and comprehensive solution in order to have a real-time monitoring by reducing the impact of an attack or a particular issue in the overall distributed infrastructure. This work covers a bigger scope in infrastructure security by monitoring all devices that generate log files or generate network traffic. By applying different Big Data techniques for data analysis, we can ensure a responsive solution to any problem (security or other) within the infrastructure and acting accordingly.
ETPL BD - 011
eHSim: An Efficient Hybrid Similarity Search with Map Reduce
In this paper, we study the problems of scalability and performance for similarity search by proposing eHSim, an efficient hybrid similarity search with MapReduce. More specifically, we introduce clustering schemes that partition objects into different groups by their length. Additionally, we equip our proposed schemes with pruning strategies that quickly discard irrelevant objects before truly computing their similarity. Moreover, we design a hybrid MapReduce architecture that deals with challenges from big data. Furthermore, we implement our proposed methods with MapReduce and make them compatible with the hybrid MapReduce architecture. Last but not least, we evaluate the proposed methods with real datasets. Empirical experiments show that our approach is considerably more efficient than state-of-the-arts in terms of query processing, batch processing, and data storage.
ETPL BD - 012
Phurti: Application and Network-Aware Flow Scheduling for Multitenant Map Reduce Clusters
Traffic for a typical MapReduce job in a data center consists of multiple network flows. Traditionally, network resources have been allocated to optimize network-level metrics such as flow completion time or throughput. Some recent schemes propose using application-aware scheduling which can shorten the average job completion time. However, most of them treat the core network as a black box with sufficient capacity. Even if only one network link in the core network becomes a bottleneck, it can hurt application performance. We design and implement a centralized flow-scheduling framework called Phurti with the goal of improving the completion time for jobs in a cluster shared among multiple Hadoop jobs (multi-tenant). Phurti communicates both with the Hadoop framework to retrieve joblevel network traffic information and the OpenFlow-based switches to learn about the network topology. Phurti implements a novel heuristic called Smallest Maximum Sequential-traffic First (SMSF) that uses collected application and network information to perform traffic scheduling for MapReduce jobs. Our evaluation with real Hadoop workloads shows that compared to application and network-agnostic scheduling strategies, Phurti improves job completion time for 95% of the jobs, decreases average job completion time by 20%, tail job completion time by 13% and scales well with the cluster size and number of jobs.
ETPL BD - 013
On improving recovery performance in erasure code based geo-diverse storage clusters
Erasure code based distributed storage systems are increasingly being used by storage providers for big data storage since they offer the same reliability as replication with significant decrease in the amount of storage required. But, when it comes to a storage system with data nodes spread across a very large geographical area, the code's recovery performance is affected by various factors, both network and computation related. In this paper, we expose an XOR based code supplemented with the ideas of parity duplication and rack awareness that could be adopted in such storage clusters to improve the recovery performance during node failures. We have implemented them on the erasure code module of the XORBAS version of the Hadoop Distributed File System (HDFS). For evaluating the performance of the proposed ideas, we employ a geo-diverse cluster on the NeCTAR research cloud. The experimental results show that the techniques aid in bringing down the data read for repair by a factor of 85% and repair duration by a factor of 57% during node failures, though resulting in an increased storage requirement of 21% as compared to the traditional Reed-Solomon codes used in HDFS. The sum of all these ideas could offer a better solution for a code based storage system spanning a wide geographical area that has storage constraints such that a triple replicated system is not affordable and at the same time has strict requirements on ensuring the minimal recovery time.
ETPL BD - 014
CaCo: An Efficient Cauchy Coding Approach for Cloud Storage Systems
Users of cloud storage usually assign different redundancy configurations (i.e.,) of erasure codes, depending on the desired balance between performance and fault tolerance. Our study finds that with very low probability, one coding scheme chosen by rules of thumb, for a given redundancy configuration, performs best. In this paper, we propose CaCo, an efficient Cauchy coding approach for data storage in the cloud. First, CaCo uses Cauchy matrix heuristics to produce a matrix set. Second, for each matrix in this set, CaCo uses XOR schedule heuristics to generate a series of schedules. Finally, CaCo selects the shortest one from all the produced schedules. In such a way, CaCo has the ability to identify an optimal coding scheme, within the capability of the current state of the art, for an arbitrary given redundancy configuration. By leverage of CaCo's nature of ease to parallelize, we boost significantly the performance of the selection process with abundant computational resources in the cloud. We implement CaCo in the Hadoop distributed file system and evaluate its performance by comparing with “Hadoop-EC� developed by Microsoft research. Our experimental results indicate that CaCo can obtain an optimal coding scheme within acceptable time. Furthermore, CaCo outperforms Hadoop-EC by 26.68-40.18 percent in the encoding time and by 38.4-52.83 percent in the decoding time simultaneously.
ETPL BD - 015
PerfCompass: Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-Service Clouds
Infrastructure-as-a-service clouds are becoming widely adopted. However, resource sharing and multitenancy have made performance anomalies a top concern for users. Timely debugging those anomalies is paramount for minimizing the performance penalty for users. Unfortunately, this debugging often takes a long time due to the inherent complexity and sharing nature of cloud infrastructures. When an application experiences a performance anomaly, it is important to distinguish between faults with a global impact and faults with a local impact as the diagnosis and recovery steps forfaults with a global impact or local impact are quite different. In this paper, we present PerfCompass, an online performance anomaly fault debugging tool that can quantify whether a production-run performance anomaly has a global impact or local impact. PerfCompass can use this information to suggest the root cause as either an external fault (e.g., environment-based) or an internal fault (e.g., software bugs). Furthermore, PerfCompass can identify top affected system calls to provide useful diagnostic hints for detailed performance debugging. PerfCompass does not require source code or runtime application instrumentation, which makes it practical for production systems. We have tested PerfCompass by running five common open source systems (e.g., Apache, MySQL, Tomcat, Hadoop, Cassandra) inside a virtualized cloud testbed. Our experiments use a range of common infrastructure sharing issues and real software bugs. The results show that PerfCompass accurately classifies 23 out of the 24 tested cases without calibration and achieves 100 percent accuracy with calibration. PerfCompass provides useful diagnosis hints within several minutes and imposes negligible runtime overhead to the production system during normal execution time.
ETPL BD - 016
Protection of Big Data Privacy
In recent years, big data have become a hot research topic. The increasing amount of big data also increases the chance of breaching the privacy of individuals. Since big data require high computational power and large storage, distributed systems are used. As multiple parties are involved in these systems, the risk of privacy violation is increased. There have been a number of privacy-preserving mechanisms developed for privacy protection at different stages (e.g., data generation, data storage, and data processing) of a big data life cycle. The goal of this paper is to provide a comprehensive overview of the privacy preservation mechanisms in big data and present the challenges for existing mechanisms. In particular, in this paper, we illustrate the infrastructure of big data and the state-of-the-art privacypreserving mechanisms in each stage of the big data life cycle. Furthermore, we discuss the challenges and future research directions related to privacy preservation in big data.
ETPL BD - 017
Collaboration- and Fairness-Aware Big Data Management in Distributed Clouds
With the advancement of information and communication technology, data are being generated at an exponential rate via various instruments and collected at an unprecedented scale. Such large volume of data generated is referred to as big data, which now are revolutionizing all aspects of our life ranging from enterprises to individuals, from science communities to governments, as they exhibit great potentials to improve efficiency of enterprises and the quality of life. To obtain nontrivial patterns and derive valuable information from big data, a fundamental problem is how to properly place the collected data by different users to distributed clouds and to efficiently analyze the collected data to save user costs in data storage and processing, particularly the cost savings of users who share data. By doing so, it needs the close collaborations among the users, by sharing and utilizing the big data in distributed clouds due to the complexity and volume of big data. Since computing, storage and bandwidth resources in a distributed cloud usually are limited, and such resource provisioning typically is expensive, the collaborative users require to make use of the resources fairly. In this paper, we study a novel collaboration- and fairness-aware big data management problem in distributed cloud environments that aims to maximize the system throughout, while minimizing the operational cost of service providers to achieve the system throughput, subject to resource capacity and user fairness constraints. We first propose a novel optimization framework for the problem. We then devise a fast yet scalable approximation algorithm based on the built optimization framework. We also analyze the time complexity and approximation ratio of the proposed algorithm. We finally conduct experiments by simulations to evaluate the performance of the proposed algorithm. Experimental results demonstrate that the proposed algorithm is promising, and outperforms other heuristics.
ETPL BD - 018
Big Data Meet Green Challenges: Big Data toward Green Applications
Big data are widely recognized as being one of the most powerful drivers to promote productivity, improve efficiency, and support innovation. It is highly expected to explore the power of big data and turn big data into big values. To answer the interesting question whether there are inherent correlations between the two tendencies of big data and green challenges, a recent study has investigated the issues on greening the whole life cycle of big data systems. This paper would like to discover the relations between the trend of big data era and that of the new generation green revolution through a comprehensive and panoramic literature survey in big data technologies toward various green objectives and a discussion on relevant challenges and future directions.
ETPL BD - 019
Privacy Preserving Deep Computation Model on Cloud for Big Data Feature Learning
To improve the efficiency of big data feature learning, the paper proposes a privacy preserving deep computation model by offloading the expensive operations to the cloud. Privacy concerns become evident because there are a large number of private data by various applications in the smart city, such as sensitive data of governments or proprietary information of enterprises. To protect the private data, the proposed model uses the BGV encryption scheme to encrypt the private data and employs cloud servers to perform the high-order back-propagation algorithm on the encrypted data efficiently for deep computation model training. Furthermore, the proposed scheme approximates the sigmoid function as a polynomial function to support the secure computation of the activation function with the BGV encryption. In our scheme, only the encryption operations and the decryption operations are performed by the client while all the computation tasks are performed on the cloud. Experimental results show that our scheme is improved by approximately 2.5 times in the training efficiency compared to the conventional deep computation model without disclosing the private data using the cloud computing including ten nodes. More importantly, our scheme is highly scalable by employing more cloud servers, which is particularly suitable for big data.
ETPL BD - 020
Efficient Cloud-Based Framework for Big Data Classification
Big Data is a term that describes the large volume of data both structured and unstructured that is difficult to process using traditional database and software techniques. Cloud computing is a technology that offers a solution to this problem. We have designed a cloud-based framework for unstructured data analysis that is motivated by the goal of efficient image data analysis. The framework consists of two general stages of feature extraction and machine learning that can be used in training mode and classification mode. The framework uses sampling and feedback mechanisms whereby the system learns which image features are most important, and also learns which algorithm(s) is best (under user criteria of accuracy and speed). This information allows the system to be more efficient by automatically reducing the number of features captured, and number of algorithms being used to evaluate the data. While there is an overhead to the auto-adjusting mechanisms, they do lead to a more efficient solution for big data sets. The solution is evaluated using a leaf images classification problem, and the results shows the improvements in efficiency.
ETPL BD - 021
Policy-Based QoS Enforcement for Adaptive Big Data Distribution on the Cloud
Big Data distribution has benefited from the Cloud resources to accommodate application's QoS requirements. In this paper, we propose Big Data distribution scheme that matches the Cloud available resources to guarantee application's QoS given the continuously dynamic and varying resources of the Cloud infrastructure. We developed Two-Level QoS Policies (TLPS) for selecting clusters and nodes while satisfying the client's application QoS. We also proposed an adaptive data distribution algorithm to cope with changing QoS requirements. Experiments have been conducted to evaluate both the effectiveness and the communication overhead of our proposed distribution scheme and the results we have reported are convincing. Other experiments evaluated our TLPS algorithm against other singlebased QoS data distribution algorithms and the results show that TLPS algorithm adapts to the customer QoS requirements.
ETPL BD - 022
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applications
The MapReduce programming model simplifies large-scale data processing on commodity cluster by exploiting parallel map tasks and reduce tasks. Although many efforts have been made to improve the performance of MapReduce jobs, they ignore the network traffic generated in the shuffle phase, which plays a critical role in performance enhancement. Traditionally, a hash function is used to partition intermediate data among reduce tasks, which, however, is not traffic-efficient because network topology and data size associated with each key are not taken into consideration. In this paper, we study to reduce network traffic cost for a MapReduce job by designing a novel intermediate data partition scheme. Furthermore, we jointly consider the aggregator placement problem, where each aggregator can reduce merged traffic from multiple map tasks. A decomposition-based distributed algorithm is proposed to deal with the large-scale optimization problem for big data application and an online algorithm is also designed to adjust data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost under both offline and online cases.
ETPL BD - 023
An Efficient and Fine-grained Big Data Access Control Scheme with Privacy-preserving Policy
How to control the access of the huge amount of big data becomes a very challenging issue, especially when big data are stored in the cloud. Ciphertext-Policy Attributebased Encryption (CP-ABE) is a promising encryption technique that enables end-users to encrypt their data under the access policies defined over some attributes of data consumers and only allows data consumers whose attributes satisfy the access policies to decrypt the data. In CP-ABE, the access policy is attached to the ciphertext in plaintext form, which may also leak some private information about end-users. Existing methods only partially hide the attribute values in the access policies, while the attribute names are still unprotected. In this paper, we propose an efficient and fine-grained big data access control scheme with privacypreserving policy. Specifically, we hide the whole attribute (rather than only its values) in the access policies. To assist data decryption, we also design a novel Attribute Bloom Filter to evaluate whether an attribute is in the access policy and locate the exact position in the access policy if it is in the access policy. Security analysis and performance evaluation show that our scheme can preserve the privacy from any LSSS access policy without employing much overhead.
ETPL BD - 024
A Novel Pipeline Approach for Efficient Big Data Broadcasting
Big-data computing is a new critical challenge for the ICT industry. Engineers and researchers are dealing with data sets of petabyte scale in the cloud computing paradigm. Thus, the demand for building a service stack to distribute, manage, and process massive data sets has risen drastically. In this paper, we investigate the Big Data Broadcasting problem for a single source node to broadcast a big chunk of data to a set of nodes with the objective of minimizing the maximum completion time. These nodes may locate in the same datacenter or across geo-distributed datacenters. This problem is one of the fundamental problems in distributed computing and is known to be NP-hard in heterogeneous environments. We model the Big-data broadcasting problem into a LockStep Broadcast Tree (LSBT) problem. The main idea of the LSBT model is to define a basic unit of upload bandwidth, r, such that a node with capacity c broadcasts data to a set of [c/r] children at the rater. Note that r is a parameter to be optimized as part of the LSBT problem. We further divide the broadcast data into m chunks. These data chunks can then be broadcast down the LSBT in a pipeline manner. In a homogeneous network environment in which each node has the same upload capacity c, we show that the optimal uplink rate r* of LSBT is either c/2 or c/3, whichever gives the smaller maximum completion time. For heterogeneous environments, we present an O(nlog2n) algorithm to select an optimal uplink rater* and to construct an optimal LSBT. Numerical results show that our approach performs well with less maximum completion time and lower computational complexity than other efficient solutions in literature.
ETPL BD - 025
K Nearest Neighbour Joins for Big Data on Map Reduce: a Theoretical and Experimental Analysis
Given a point p and a set of points S, the kNN operation finds the k closest points to p in S. It is a computational intensive task with a large range of applications such as knowledge discovery or data mining. However, as the volume and the dimension of data increase, only distributed approaches can perform such costly operation in a reasonable time. Recent works have focused on implementing efficient solutions using the MapReduce programming model because it is suitable for distributed large scale data processing. Although these works provide different solutions to the same problem, each one has particular constraints and properties. In this paper, we compare the different existing approaches for computing kNN on MapReduce, first theoretically, and then by performing an extensive experimental evaluation. To be able to compare solutions, we identify three generic steps for kNN computation on MapReduce: data pre-processing, data partitioning and computation. We then analyze each step from load balancing, accuracy and complexity aspects. Experiments in this paper use a variety of datasets, and analyze the impact of data volume, data dimension and the value of k from many perspectives like time and space complexity, and accuracy. The experimental part brings new advantages and shortcomings that are discussed for each algorithm. To the best of our knowledge, this is the first paper that compares kNN computing methods on MapReduce both theoretically and experimentally with the same setting. Overall, this paper can be used as a guide to tackle kNN-based practical problems in the context of big data.
ETPL BD - 026
HadoopViz: A Map Reduce framework for extensible visualization of big spatial data
This paper introduces HadoopViz; a MapReduce-based framework for visualizing big spatial data. HadoopViz has three unique features that distinguish it from other techniques. (1) It exposes an extensible interface which allows users to define a new visualization types, e.g., scatter plot, road network, or heat map, by defining five abstract functions, without delving into the implementation details of the MapReduce algorithms. As it is open source, HadoopViz allows algorithm designers to focus on how the data should be visualized rather than performance or scalability issues. (2) HadoopViz is capable of generating big images with giga-pixel resolution by employing a three-phase technique, partition-plot-merge. (3) HadoopViz provides a smoothing functionality which can fuse nearby records together as the image is plotted. This makes it capable of generating more types of images with high quality as compared to existing work. Experimental results on real datasets of up to 14 Billion points show the extensibility, scalability, and efficiency of HadoopViz to handle different visualization types of spatial big data.
ETPL BD - 027
OverFlow: Multi-Site Aware Big Data Management for Scientific Workflows on Clouds
The global deployment of cloud datacenters is enabling large scale scientific workflows to improve performance and deliver fast responses. This unprecedented geographical distribution of the computation is doubled by an increase in the scale of the data handled by such applications, bringing new challenges related to the efficient data management across sites. High throughput, low latencies or cost-related trade-offs are just a few concerns for both cloud providers and users when it comes to handling data across datacenters. Existing solutions are limited to cloud-provided storage, which offers low performance based on rigid cost schemes. In turn, workflow engines need to improvise substitutes, achieving performance at the cost of complex system configurations, maintenance overheads, reduced reliability and reusability. In this paper, we introduce OverFlow, a uniform data management system for scientific workflows running across geographically distributed sites, aiming to reap economic benefits from this geo-diversity. Our solution is environment-aware, as it monitors and models the global cloud infrastructure, offering high and predictable data handling performance for transfer cost and time, within and across sites. OverFlow proposes a set of pluggable services, grouped in a data scientist cloud kit. They provide the applications with the possibility to monitor the underlying infrastructure, to exploit smart data compression, deduplication and geo-replication, to evaluate data management costs, to set a tradeoff between money and time, and optimize the transfer strategy accordingly. The system was validated on the Microsoft Azure cloud across its 6 EU and US datacenters. The experiments were conducted on hundreds of nodes using synthetic benchmarks and real-life bio-informatics applications (A-Brain, BLAST). The results show that our system is able to model accurately the cloud performance and to leverage this for efficient data dissemination, being able to reduc- the monetary costs and transfer time by up to three times.
ETPL BD - 028
Scalable Algorithms for Nearest-Neighbour Joins on Big Trajectory Data
Trajectory data are prevalent in systems that monitor the locations of moving objects. In a locationbased service, for instance, the positions of vehicles are continuously monitored through GPS; the trajectory of each vehicle describes its movement history. We study joins on two sets of trajectories, generated by two sets M and R of moving objects. For each entity in M, a join returns its k nearest neighbors from R. We examine how this query can be evaluated in cloud environments. This problem is not trivial, due to the complexity of the trajectory, and the fact that both the spatial and temporal dimensions of the data have to be handled. To facilitate this operation, we propose a parallel solution framework based on MapReduce. We also develop a novel bounding technique, which enables trajectories to be pruned in parallel. Our approach can be used to parallelize existing single-machine trajectory join algorithms. To evaluate the efficiency and the scalability of our solutions, we have performed extensive experiments on a real dataset.
ETPL BD - 029
Crowdsourcing based Description of Urban Emergency Events using Social Media Big Data
Crowdsourcing is a process of acquisition, integration, and analysis of big and heterogeneous data generated by a diversity of sources in urban spaces, such as sensors, devices, vehicles, buildings, and human. Especially, nowadays, no countries, no communities, and no person are immune to urban emergency events. Detection about urban emergency events, e.g., fires, storms, traffic jams is of great importance to protect the security of humans. Recently, social media feeds are rapidly emerging as a novel platform for providing and dissemination of information that is often geographic. The content from social media usually includes references to urban emergency events occurring at, or affecting specific locations. In this paper, in order to detect and describe the real time urban emergency event, the 5W (What, Where, When, Who, and Why) model is proposed. Firstly, users of social media are set as the target of crowd sourcing. Secondly, the spatial and temporal information from the social media are extracted to detect the real time event. Thirdly, a GIS based annotation of the detected urban emergency event is shown. The proposed method is evaluated with extensive case studies based on real urban emergency events. The results show the accuracy and efficiency of the proposed method.
ETPL BD - 030
i2MapReduce: Incremental map reduce for mining evolving big data
As new data and updates are constantly arriving, the results of data mining applications become stale and obsolete over time. Incremental processing is a promising approach to refresh mining results. It utilizes previously saved states to avoid the expense of re-computation from scratch. In this paper, we propose i2MapReduce, a novel incremental processing extension to MapReduce. Compared with the state-of-the-art work on Incoop, i2MapReduce (i) performs key-value pair level incremental processing rather than task level re-computation, (ii) supports not only one-step computation but also more sophisticated iterative computation, and (iii) incorporates a set of novel techniques to reduce I/O overhead for accessing preserved fine-grain computation states. Experimental results on Amazon EC2 show significant performance improvements of i2MapReduce compared to both plain and iterative MapReduce performing re-computation.
ETPL BD - 031
Scalable algorithms for nearest-neighbour joins on big trajectory data
Trajectory data are prevalent in systems that monitor the locations of moving objects. In a locationbased service, for instance, the positions of vehicles are continuously monitored through GPS; the trajectory of each vehicle describes its movement history. We study joins on two sets of trajectories, generated by two sets M and R of moving objects. For each entity in M, a join returns its k nearest neighbors from R. We examine how this query can be evaluated in cloud environments. This problem is not trivial, due to the complexity of the trajectory, and the fact that both the spatial and temporal dimensions of the data have to be handled. To facilitate this operation, we propose a parallel solution framework based on MapReduce. We also develop a novel bounding technique, which enables trajectories to be pruned in parallel. Our approach can be used to parallelize existing single-machine trajectory join algorithms. To evaluate the efficiency and the scalability of our solutions, we have performed extensive experiments on a real dataset.
ETPL BD - 032
An Efficient Privacy-Preserving Ranked Keyword Search Method
Cloud data owners prefer to outsource documents in an encrypted form for the purpose of privacy preserving. Therefore it is essential to develop efficient and reliable ciphertext search techniques. One challenge is that the relationship between documents will be normally concealed in the process of encryption, which will lead to significant search accuracy performance degradation. Also the volume of data in data centers has experienced a dramatic growth. This will make it even more challenging to design ciphertext search schemes that can provide efficient and reliable online information retrieval on large volume of encrypted data. In this paper, a hierarchical clustering method is proposed to support more search semantics and also to meet the demand for fast ciphertext search within a big data environment. The proposed hierarchical approach clusters the documents based on the minimum relevance threshold, and then partitions the resulting clusters into sub-clusters until the constraint on the maximum size of cluster is reached. In the search phase, this approach can reach a linear computational complexity against an exponential size increase of document collection. In order to verify the authenticity of search results, a structure called minimum hash sub-tree is designed in this paper. Experiments have been conducted using the collection set built from the IEEE Xplore. The results show that with a sharp increase of documents in the dataset the search time of the proposed method increases linearly whereas the search time of the traditional method increases exponentially. Furthermore, the proposed method has an advantage over the traditional method in the rank privacy and relevance of retrieved documents.