Iaetsd secured and efficient data scheduling of intermediate data sets by Iaetsd Iaetsd

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

Secured and Efficient Data Scheduling of Intermediate Data Sets in Cloud

D.TEJASWINI.,M.TECH C.RAJENDRA.,M.TECH.,M.E.,PH.D.

AUDISANKARA COLLEGE OF ENGINEERING & TECHNOLOGY

ABSTRACT Cloud computing is an emerging field in the development of business and organizational environment. As it provides more computation power and storage space users can process many applications.Due to this large number of intermediate datasets are generated and Encryption and decryption techniques are used to preserve the intermediate data sets in cloud. An Upper bound constraint based approach is used to identify sensitive intermediate data sets and we apply suppression technique on sensitive data sets in order to reduce the time and cost. The Value Generalization Hierarchy protocol is used to achieve more security so that number of users can access the data with privacy.Along with that Optimized Balanced Scheduling is also used for the best mapping solution to meet the system load balance to the greatest extent or to reduce the load balancing cost The Privacy preservation is also ensured with dynamic data size and access frequency values. Storage space and computational requirements are optimally utilized in the privacy preservation process. Data distribution complexity is also handled in the scheduling process. Keywords: Cloud computing, privacy upper bound, intermediate data sets, optimized balanced scheduling, value generalization hierarchy protocol.

shared for multiple users but also dynamically reallocated per demand. The privacy issues [12] caused by retaining intermediate data sets in cloud are important but they were paid little attention. For preserving privacy v[9] of multiple data sets, we should anonymize all data sets first and then encrypt them before storing or sharing them in cloud. Usually, the weightage of intermediate data sets[11] is huge. Users will store only

1. INTRODUCTION Cloud computing mainly relies on sharing of resources to achieve coherence and economies of scale similar to a utility over a network. The basement of cloud computing is the broader concept of converged infrastructure and shared services. The cloud mainly focuses on maximizing the effectiveness of shared resources. Cloud resources are not only

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

347

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

cloud. While our work provides a significant first step towards Zhang et al.[10] proposed a system named Sedic which partitions Map Reduce computing jobs in terms of the security labels of data they work on and then assigns the computation without sensitive data to a public cloud. The sensitivity of data is required to be labelled in advance to make the above approaches available. Ciriani et al.[10] has proposed an approach that combines the encryption and data fragmentation to achieve the privacy protection for distributed data storage with encrypting only part of data sets.

important datasets on cloud when processing original data sets in dataintensive applications such as medical diagnosis[16], in order to reduce the overall expenses by avoiding frequent recomputation to obtain these data sets. Such methods are quite common because data users often re-analyse results, conduct new analysis on intermediate data sets, and also share some intermediate results with others for collaboration. Data Provenance is employed to manage the intermediate datasets. A number of tools for capturing provenance have been developed in workflow systems and a standard for provenance representation called the Open Provenance Model (OPM) has been designed.

3.SYSTEM ARCHITECTURE

2.RELATED WORK. Encryption is usually integrated with other methods to achieve cost reduction, high data usability and privacy protection. Roy et al. [8] investigated the data privacy problem caused by Map Reduce and presented a system named Airavat which incorporates mandatory access control with differential privacy. Puttaswamy et al. [9] described a set of tools called Silverline which identifies all encryptable data and then encrypts them to protect privacy. Encrypted data on the cloud prevent privacy leakage to compromisedor malicious clouds, while users can easily access data by decrypting data locally with keys from a trusted organization. Using dynamic program analysis techniques Silverline automatically identifies the encryptable application data that can be safely encrypted without negatively affecting the application functionality. By modifying the application runtime, e.g. the PHP interpreter, we show how Silverline can determine an optimal assignment of encryption keys that minimizes key management overhead and impact of key compromise. Our applications running on the cloud can protect their data from security breaches or compromises in the

Fig 1: System Architecture For Secure Transaction Using The Cloud Our approach mainly will work by automatically identifying the subsets of an application’s data that are not directly used in computation, and exposing them to the cloud only in encrypted form. • We present a technique to partition encrypted data into parts that are accessed by different sets of the users (groups). Intelligent key assignment limits the damage which is possible from a given key compromise, and strikes a good trade off between robustness and key management complexity. • We present a technique that enables clients to store and use their keys safely while preventing cloud-based service from stealing the keys. Our solution works

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

348

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

today on unmodified web browsers. There are many privacy threats caused due to the intermediate data sets so, we need to encrypt these data sets to provide privacy and make them secure.

ISBN: 378 - 26 - 138420 - 5

4.2 Privacy Preserved Data Scheduling Scheme Here multiple intermediate data set privacy models is combined with the data scheduling mechanism. The Privacy preservation is ensured with the dynamic data size and access frequency values and along with that Storage space and computational requirements are optimally utilized in the privacy preservation process and Data distribution complexity is also handled in the scheduling process. The Data sensitivity is considered in the intermediate data security process. Resource requirement levels are monitored and controlled by the security operations. The system is divided into five major modules. They are data center, data provider, intermediate data privacy, security analysis and data scheduling. The data center maintains the encrypted data values for the providers. Shared data uploading process are managed by the data provider module. The Intermediate data privacy module is designed to protect intermediate results. Security analysis module is designed to estimate the resource and access levels. Original data and intermediate data distribution is planned under the data scheduling module. Dynamic privacy management and scheduling mechanism are integrated to improve the data sharing with security. Privacy preserving cost is reduced by the joint verification mechanism.

Fig.2: A Scenario Showing Privacy Threats Due To Intermediate Datasets

4.IMPLEMENTATION 4.1 Requirements The problem of managing the intermediate data which is generated during dataflow computations, deserves deeper study as a first-class problem. They are the following two major requirements that any effective intermediate storage system needs to satisfy: availability of intermediate data, and minimal interference on foreground network traffic generated by the dataflow computation. Data Availability: A task which is in a dataflow stage cannot be executed if the intermediate input data is unavailable. A system that provides higher availability for intermediate data will suffer from fewer delays for re-executing tasks in case of failure. In multi-stage computations, high availability is critical as it minimizes the effect of cascaded re-execution. Minimal Interference: At the same time, the data availability cannot be pursued over-aggressively. In particular, since intermediate data is used immediately, and there is high network contention for foreground traffic of the intermediate data transferred to the next stage. So an intermediate data management system needs to minimize interference.

4.3 Analysis of the Cost Problem A cloud service provides various pricing models to support the pay-as-you-go model, e.g., Amazon Web Services pricing model[4]. The Privacy-preserving cost of the intermediate data sets can be reduced from frequent encryption or decryption with charged cloud services which needs more computation power, data storage, and other cloud services. To avoid the pricing

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

349

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

multiple data sets, it is promising to anonymize all data sets first and then encrypt them before storing or sharing them in cloud. Usually, the volume of intermediate data sets is huge. Data sets are divided into two sets. One is sensitive intermediate data set and another is nonsensitive intermediate data set. Sensitive data set is denoted as SD then non sensitive data set is denoted as NSD. The equations, sd U NSD =D and SD â&#x2C6;Š NSD =Đ¤ hold. The pair of (SD, NSD) as a global privacy preserving of cloud data. Suppression technique done only on sensitive data sets in two ways such as semi suppression and full suppression, while full suppression apply on most important sensitive intermediate data set that is individual data set value fully encoded then semi suppression apply on selective sensitive data sets that is half of the data set value will be encoded. Also propose Value Generalization Hierarchy (VGH) protocol to reduce cost of data

details and to focus on, combine the prices of various services required by encryption or decryption into one. 4.4 Proposed Framework The technique or the new protocol which we use for privacy protection here is the Value Generalization Hierarchy Protocol which has the functionality of assignment of the common values for the unknown and original data values for general identification later on we add of full suppression on the more important data sets which enhances the complete encryption of the entire data sets given. To Investigate privacy aware and efficient scheduling of intermediate data sets for minimum cost and fast computation. Suppression of data is done to reduce the overall computation time and cost and where VGH Protocol is also proposed to achieve it. Here we secure the more important dataset though semi suppression only. The full suppression to achieve the high privacy or security of original data sets and the original data set is only viewed by owner. Here number user can access the data with security and to avoid privacy leakage. The privacy protection cost for intermediate data sets that needs to be encoded while using an upper bound constraint-based approach to select the necessary subset of intermediate data sets. The privacy concerns caused by retaining intermediate data sets in cloud are important Storage and computation services in cloud are equivalent from an economical perspective because they are charged in proportion to their usage. Existing technical approaches for preserving the privacy of datasets stored in cloud mainly include encryption and anonymization. On one hand, encrypting all data sets, a straightforward and effective approach, is widely adopted in current research. However, processing on encrypted data sets efficiently is quite a challenging task, because most existing applications only run on unencrypted data sets. Thus, for preserving privacy of

4.5 Optimized Balanced Scheduling The optimized balanced scheduling is used for the best mapping solution to meet the system load balance to the greatest extent or to lower the load balancing cost. The best scheduling solution for the current scheduling process can be done through genetic algorithm. First we need to compute the cost through the ratio of the current scheduling solution to the best scheduling solution, and then we have to make the best scheduling strategy according to the cost. So that it has the least influence on the load of the system after scheduling and it has the lowest cost to reach load balancing. In this way, we can form the best scheduling strategy.

5. CONCLUSION In this paper, focus is mainly contributed towards identification of the areas where the most sensitive intermediate datasets are present in cloud. An upper bound constraint based approach is used where data sets needs to be

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

350

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

[5] K. Zhang, X. Zhou, Y. Chen, X. Wang, and Y. Ruan, “Sedic: Privacy-Aware Data Intensive Computing on Hybrid Clouds,” Proc. 18th ACM Conf. Computer and Comm. Security (CCS ’11), pp. 515-526, 2011. [6] H. Lin and W. Tzeng, “A Secure Erasure Code-Based Cloud Storage System with Secure Data Forwarding,” IEEE Trans. Parallel and Distributed Systems, vol. 23, no. 6, pp. 995-1003, June 2012. [7] G. Wang, Z. Zutao, D. Wenliang, and T. Zhouxuan, “Inference Analysis in Privacy-Preserving Data Re-Publishing,” Proc. Eighth IEEE Int’l Conf. Data Mining (ICDM ’08), pp. 1079-1084, 2008. [8] K.P.N. Puttaswamy, C. Kruegel, and B.Y. Zhao, “Silverline: Toward Data Confidentiality in Storage-Intensive Cloud Applications,” Proc. Second ACM Symp. Cloud Computing (SoCC ’11), 2011. [9] I. Roy, S.T.V. Setty, A. Kilzer, V. Shmatikov, and E. Witchel, “Airavat: Security and Privacy for Mapreduce,” Proc. Seventh USENIX Conf. Networked Systems Design and Implementation (NSDI ’10), p. 20, 2010. [10] X. Zhang, C. Liu, J. Chen, and W. Dou, “An Upper-Bound Control Approach for Cost-Effective Privacy Protection of Intermediate Data Set Storage in Cloud,” Proc. Ninth IEEE Int’l Conf. Dependable, Autonomic and Secure Computing (DASC ’11), pp. 518-525, 2011. [11] B.C.M. Fung, K. Wang, R. Chen, and P.S. Yu, “Privacy-Preserving Data Publishing: A Survey of Recent Developments,” ACM Computing Survey, vol. 42, no. 4, pp. 1-53, 2010. [12] H. Lin and W. Tzeng, “A Secure Erasure Code-Based Cloud Storage System with Secure Data Forwarding,” IEEE Trans. Parallel and Distributed Systems, vol. 23, no. 6, pp. 995-1003, June 2012.

encoded, in order to reduce the privacy preserving cost so we investigate privacy aware efficient scheduling of intermediate data sets in cloud by taking privacy preserving as a metric together with other metrics such as storage and computation. Optimized balanced scheduling strategies are expected to be developed toward overall highly efficient privacy aware data set scheduling and mainly in the overall time reduction and Data delivery overhead is reduced by the load balancing based scheduling mechanism. Dynamic privacy preservation model is supported by the system and along with that a high security provisioning is done with the help of full suppression, semi suppression and Value Generalization Hierarchy Protocol. This protocol is used to assign the common attribute for different attributes and the Resource consumption is also controlled by the support of the sensitive data information graph.

6.REFERENCES [1] L. Wang, J. Zhan, W. Shi, and Y. Liang, “In Cloud, Can Scientific Communities Benefit from the Economies of Scale?,” IEEE Trans. Parallel and Distributed Systems, vol. 23, no. 2, pp. 296-303, Feb.2012. [2] Xuyun Zhang, Chang Liu, Surya Nepal, Suraj Pandey, and Jinjun Chen, “A Privacy Leakage Upper Bound ConstraintBased Approach for Cost-Effective Privacy Preserving of Intermediate Data Sets in Cloud”, IEEE Transactions On Parallel And Distributed Systems, Vol. 24, No. 6, June 2013. [3] D. Zissis and D. Lekkas, “Addressing Cloud Computing Security Issues,” Future Generation Computer Systems, vol. 28, no. 3, pp. 583- 592, 2011. [4] D. Yuan, Y. Yang, X. Liu, and J. Chen, “On-Demand Minimum Cost Benchmarking for Intermediate Data Set Storage in Scientific Cloud Workflow Systems,” J. Parallel Distributed Computing, vol. 71, no. 2, pp. 316-332, 2011.

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

351