Tr 00096

Page 1

IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy 1.Ambika M Patil, M.Tech Computer Science Engineering, Center for P G Studies Jnana Sangama VTU Belagavi, Belagavi, INDIA, Ambika702@gmail.com 2.Assistant Prof.Ranjana B Nadagoudar, Computer Science Engineering Department, Center for P G Studies Jnana Sangama VTU Belagavi, Belagavi, INDIA 3.Dhananjay A Potdar , Dhananjay.potdar@gmail.com

ABSTRACT - While Big Data gradually become a hot topic of research and business and has been everywhere used in many industries, Big Data security and privacy has been increasingly concerned. However, there is an obvious contradiction between Big Data security and privacy and the widespread use of Big Data. There have been a various different privacy preserving mechanisms developed for protecting privacy at different stages (e.g. data generation, data storage, data processing) of big data life cycle. The goal of this paper is to provide a complete overview of the privacy preservation mechanisms in big data and present the challenges for existing mechanisms and also we illustrate the infrastructure of big data and state-of-the-art privacy-preserving mechanisms in each stage of the big data life cycle. This paper focus on the anonymization process, which significantly improve the scalability and efficiency of TDS (top-down-specialization) for data anonymization over existing approaches. Also, we discuss the challenges and future research directions related to preserving privacy in big data. KEYWORDS - Big data, privacy, big data storage, big data processing. Data anonymization, top-down specialization, MapReduce, cloud, privacy preservation.

I. INTRODUCTION As a result of recent technological development, the amount of data generated by social networking sites, sensor networks, Internet, healthcare applications, and many other companies, is significantly increasing day by day. The term “Big Data” reflects the trend and salient features of the data being produced from various sources. Basically Big Data can be described by “3Vs” which stands for Volume, Velocity and

IDL - International Digital Library

Variety. Volume shows the huge amount of data being produced from multiple sources. Velocity is concerned with both how fast we produce and collect data, but also how fast some of the collected data is changing. Variety shows their highly distributed and various nature. The data generation rate is growing so rapidly that it is becoming very difficult to handle it using traditional methods or systems [1]. In the “3Vs” model, Variety indicates the various types of data which include structured, semistructured and unstructured data; Volume means data scale is large; Velocity indicates all processes of Big Data must be quick and timely in order to maximize value of Big Data as shown in Fig.1. These features that Big Data handles huge amount of data and uses various types of data including unstructured data and attributes that were never used in the past distinguish data mining from Big Data. In 2011, IDC defined big data as “big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling the highvelocity capture, discovery, and/or analysis”[2]. In this definition, features of big data may be abridged as 4Vs, i.e., Variety, Velocity, Volume and Value, where the implications of Variety, Velocity, Volume is same as the 3Vs model respectively and Value refers big data have great social value. The 4Vs model was widely recognized because it indicates the most critical problem which is how to discover value from an enormous, various types, and rapidly generated datasets in big data.

1|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

FIGURE 1. Illustration of the 3 V's of big data.

Despite big data could be effectively used for us to better understand the world and innovate in various aspects of human activities, the blowing up amount of data has increased potential privacy breach of individual. For example, Amazon and Google can learn our shopping preferences and browsing habits. Social networking sites such as Facebook store all the information about our personal life and social relationships. Popular video sharing websites such as YouTube recommends us videos based on our search history. With all the power motivated by big data, gathering, storing and reusing our personal information for the purpose of attainment of commercial profits, have put a threat to our privacy and security. In 2006, AOL released 20 million search queries for 650 users by eliminating the AOL id and IP address for research purposes. Though, it took researchers only couple of days to re-identify the users. Users privacy may be breached under the following circumstances [3]: I. Personal information when combined with exterior Datasets may lead to the inference of new facts about the users. Those details may be secretive and not supposed to be exposed to others. II. Personal facts is sometimes collected and used to add value to business. For example, individual's shopping habits may disclose a lot of personal information. III. The sensitive data are stored and processed in a location not secured properly and data leakage may occur during storage and processing phases. In order to safeguard big data privacy, numerous mechanisms have been developed in recent years. These mechanisms can be grouped based on the stages of big data life cycle, i.e., data generation, data storage, and data processing. In data generation phase, for the protection of privacy, access restriction and falsifying data techniques are used. While access restriction techniques try to limit the access to individuals private data, falsifying data techniques alter the

IDL - International Digital Library

original data before they are released to a non-trusted party. The approaches to privacy protection in data storage phase are mainly based on encryption techniques. Encryption based techniques can be further divided into attribute based encryption (ABE), Identity based encryption (IBE), and storage path encryption. In addition, to protect the sensitive information, hybrid clouds are used where sensitive data are stored in private cloud. The data processing phase includes privacy preserving data publishing (PPDP) and knowledge extraction from the data. In PPDP, anonymization techniques such as generalization and suppression are used to protect the privacy of data. Ensuring the utility of the data while preserving the privacy is a great challenge in PPDP. In the knowledge extracting process, there exist several mechanisms to extract useful information from large-scale and complex data. These mechanisms can be further divided into clustering, classification and association rule mining based techniques. While clustering and classification split the input data into different groups, association rule mining based techniques and the useful relationships and trends in the input data.

FIGURE 2. Illustration of big data life cycle.

Protecting privacy in big data is a fast growing research area. Although some related papers have been published but only few of them are survey/review type of papers [4], [5]. Moreover, while these papers introduced the basic concept of privacy protection in big data, they failed to cover several important aspects of this area. For example, neither [4] nor [5] provide detailed discussions regarding big data privacy with respect to cloud computing. Besides, none of the papers discussed future challenges in detail. In this paper, we will give a comprehensive overview of the state-of-the-art technologies to preserve privacy of big data at each stage of big data life cycle. This paper focus on the anonymization process, which significantly improve the scalability and efficiency of TDS (top-down-specialization) for data anonymization over existing approaches. The major contributions of our research are threefold. First, we creatively apply MapReduce on cloud to TDS for data anonymization and deliberately design a group of innovative MapReduce jobs to concretely accomplish the specializations in a highly scalable fashion. Second, we propose a two-phase TDS approach to gain high scalability via allowing specializations to be conducted on multiple data

2|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 partitions in parallel during the first phase. Third, implementation results show that our approach can significantly improve the scalability and efficiency of TDS for data anonymization over existing approaches. The remainder of this paper is organized as follows: The infrastructure of bigdata and issues or challenges related to privacy of big data because of the underlying structure of cloud computing, traditional data privacy preservation methods in section III. Privacy preservation for big data, formulates the two-phase TDS approach, elaborates algorithmic details of MapReduce jobs in section IV. Implementation of our approach is shown in Section V. Finally, we conclude this paper and discuss future work in Section VI. II. INFRASTRUCTURE OF BIG DATA To handle different dimensions of big data in terms of volume, velocity, and variety, we need to design efficient and effective systems to process large amount of data arriving at very high speed from different sources. Big data has to go through multiple phases during its life cycle, as shown in Figure. 2. Data are distributed nowadays and new technologies are being developed to store and process large repositories of data. For example, cloud computing technologies, such as Hadoop MapReduce, are explored for big data storage and processing. In this section we will explain the life cycle of big data. In addition, we will also discuss how big data are leveraging from cloud computing technologies and challenges related with cloud computing when used for storage and processing of big data. A. LIFE CYCLE OF BIG DATA Data generation: Data can be generated from many distributed sources. The amount of data generated by humans and machines has blowup in the past few years. For example, everyday 2.5 quintillion bytes of data are generated on the web and 90 percent of the data in the world is generated in the past few years. Facebook, a social networking site alone is generating 25TB of new data every day. Usually, the data generated is large, diverse and complex. Therefore, it is hard for traditional systems to handle them. The data generated are normally related with a specific domain such as business, Internet, research, etc. Data storage: This phase refers to storing and managing largescale data sets. A data storage system consists of two parts i.e., hardware infrastructure and data management [6]. Hardware infrastructure refers to using information and communications technology (ICT) resources for several tasks (such as distributed storage). Data management refers to the set of software positioned on top of hardware infrastructure to

IDL - International Digital Library

manage and query large scale data sets. It should also provide many interfaces to interact with and analyze stored data. Data processing: Data processing phase refers basically to the process of data collection, data transmission, pre-processing and take out useful information. Data collection is needed because data may be coming from different various sources i.e., sites that contains text, images and videos. In data collection phase, data are acquired from specific data production environment using dedicated data collection technology. In data transmission phase, after collecting raw data from a specific data production environment we need a high speed transmission mechanism to transmit data into a proper storage for different type of analytic applications. Finally, the pre-processing phase aims at removing meaningless and redundant parts of the data so that more storage space could be saved. The excessive data and domain specific analytical methods are used by various applications to derive significant information. Although different fields in data analytics require different data characteristics, few of these fields may leverage similar underlying technology to inspect, transform and model data to extract value from it. Emerging data analytics research can be categorized into the following six technical areas: structured data analytics, text analytics, multimedia analytics, web analytics, network analytics, and mobile analytics [6]. B. CHALLENGES OF BIG DATA The application of Big Data is leading to a set of new challenges since data sets of Big Data so large and complex that it is difficult to acquisition, storage, management and analysis. The main challenges are listed as following [7][8]: 1. Data preparation. According to the definition of strong and accurate techniques for big data, an important basis of big data analysis and management is the availability of high-quality, precise, and trustworthy data. Data preparation is paramount for increasing the value of big data. 2. Efficient distributed storage and search. Timeliness of data collection is fundamental to offer fast analysis of big data. Therefore, there is an increasing need of providing efficient distributed storage with faster memories and enhancing search algorithms. 3. Effective online data analysis. Online analysis of multidimensional data becomes a must and potential source of information for decision making. This would require adapting existing OLAP approaches to big data. 4. Effective machine learning techniques for big data mining. Machine learning and data mining should be adapted to big data to unleash the full potential of collected data.

3|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 5. Efficient handling of big data streams. Some specific scenarios (e.g., stock exchange) would require analysis of data in the form of streams. Fast and optimized solutions should be developed to make inference on big data streams. 6. Semantic lifting techniques. Semantics of collected big data represents an important aspect for future development of big data applications. Future approaches to big data analysis should be able to manage with their semantics. 7. Programming models. Many programming models of big data infrastructures are available. Some examples include MapReduce and Hadoop. We should consider different approaches for storing and managing data. 8. Social analytics. The ability to differentiate those data that can be trusted and comply with user’s needs and preferences is important as well as different to achieve. Social analytics should then address this problem providing correct and sound approaches to social data analysis. 9. Security and privacy. Big data are a priceless source of information. However, it often contains sensitive information that needs to be protected from unauthorized access and release.

attributes that need to be protected are identified based on type of big data and company policies. In nutshell, encryption or cryptography alone can’t stand as big data privacy preservation method. They can help us to do data anonymization but cannot be used directly for big data privacy. IV. PRIVACY PRESERVATION FOR BIG DATA TWO-PHASE TOP-DOWN SPECIALIZATION (TPTDS) The sketch of the TPTDS approach is shown in figure 3. Three components of the TPTDS approach, namely, data partition, anonymization level merging(AL), and data specialization.

III TRADITIONAL DATA PRIVACY PRESERVATION METHODS Cryptography refers to set of techniques and algorithms for protecting data. In cryptography plaintext is transformed into cipher text using various encryption schemes. There are numerous methods based on this scheme like public key cryptography, digital signatures etc. Cryptography alone can’t enforce the privacy demanded by common cloud computing and big data services [9]. This is because big data differs from traditional large data sets on the basis of three V’s (velocity, variety, volume) [10, 11]. It is these features of big data that make big data architecture different from traditional information architectures. These changes in architecture and its complex nature make cryptography and traditional encryption schemes not scalable up to the privacy needs of big data. The challenge with cryptography is all or nothing retrieval policy of encrypted data [12]. The less sensitive data that can be useful in big data analytics is also encrypted and user is not allowed to access it. It makes data unreachable to those who don’t have access to decryption key. Also privacy may be breached if data is stolen before encryption or cryptographic keys are misused. Attribute based encryption can also be used for big data privacy [13, 14]. This method of securing big data is based on relationships among attributes present in big data. The

IDL - International Digital Library

Figure. 3 Execution framework overview of MRTDS.

TPTDS approach to conduct the computation required in TDS in a highly scalable and efficient fashion. The two phases of our approach are based on the two levels of parallelization provisioned by Map Reduce on cloud. Essentially, Map Reduce on cloud has two levels of parallelization, i.e., job level and task level. Job level parallelization means that many Map Reduce jobs can be executed simultaneously to make full use of cloud infrastructure resources. Combined with cloud, Map Reduce becomes more powerful and elastic as cloud can offer infrastructure resources on demand, for example, Amazon Elastic Map Reduce service. Task level parallelization refers to that many Mapper/reducer tasks in a Map Reduce job are executed simultaneously over data splits. To achieve High scalability, we parallelizing multiple jobs on data partitions in the first phase, but the resultant anonymization levels are not indistinguishable. To obtain

4|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 finally consistent anonymous data sets, the second phase is essential to integrate the intermediate results and further anonymize entire data sets. Then, we run a subroutine over each of the partitioned data sets in parallel to make full use of the job level parallelization of MapReduce. The subroutine is a MapReduce version of centralized TDS (MRTDS) which concretely conducts the computation required in TPTDS. MRTDS anonymizes data partitions to generate intermediate anonymization levels. An intermediate anonymization level means that further specialization can be performed without violating k-anonymity. MRTDS only leverages the task level Parallelization of MapReduce. ALGORITHM 1. SKETCH OF TWO-PHASE TDS (TPTDS). Input: information set D, obscurity parameters k, kI and the number of partitions p. Output: Anonymous information set D⃰. 1: Partition D into Di,1 ≤ i ≤ p. 2: Execute MRTDS(Di, kI, AL0) → AL0i, one ≤ i ≤ p in parallel as multiple MapReduce jobs. 3: Merge all intermediate anonymization levels into one, Merge(AL01, AL02, . . ., AL0p) → ALI. 4: Execute MRTDS(D, k, ALI) → AL⃰ to realize kanonymity. 5: Specialize D in line with AL⃰, Output D⃰

Modules Description: Data Partition: o In this module the data partition is performed on the cloud. o Here we collect the large no of data sets. o We are split the large into small data sets. o Then we provide the random number for each data set. Anonymization: o After getting the individual data sets we apply the anonymization. The anonymization means hide or remove the sensitive field in data sets. o Then we get the intermediate result for the small »» data sets the intermediate results are used for the specialization process. o All intermediate anonymization levels are merged into one in the second phase. The merging of anonymization levels is completed by merging cuts. To ensure that the merged intermediate anonymization level ALI(Anonymization Service Level Improve) never violates privacy requirements, the more general one is selected as the merged one Merging:

IDL - International Digital Library

o

The intermediate results of the number of small data sets are merged here. o The MRTDS driver is used to organize the small intermediate result for merging; the merged data sets are unruffled on cloud. o The merging result is again applied in anonymization called specialization. Specialization: o After getting the intermediate result those results are merged into one. o Then we again applies the anonymization on the merged data it called specialization. o Here we are using the two kinds of jobs such as IGPL UPDATE AND IGPL INITIALIZATION. o The jobs are organized by web using the driver. Obs: The OBS called optimized balancing scheduling. o Here we focus on the two kinds of the scheduling called time and size. o Here data sets are split in to the specified size and applied anonymization on specified time. o The OBS approach is to deliver the high ability on handles the large data sets. V.IMPLEMENTATION AND IMPROVEMENT To elaborate however knowledge sets are processed in MRTDS, the execution framework supported common place MapReduce is depicted in Fig. 1. The solid arrow lines represent the information flows within the canonical MapReduce framework. From Fig. 1, we are able to see that the iteration of MapReduce jobs is controlled by anonymization level AL in Driver. The data flows for handling iterations are represented by dotted arrow lines. AL is sent from Driver to any or all staff including Mappers and Reducers via the distributed cache mechanism. The worth of AL is changed in Driver rendering to the output of the IGPL data formatting or IGPL Update jobs. Because the quantity of such knowledge is extremely small compared with knowledge sets that may be anonymized, they can be expeditiously transmitted between Driver and workers. We have a tendency to adopt Hadoop, Associate in Nursing ASCII text file implementation of MapReduce, to implement MRTDS. Since most of Map and cut back functions have to be compelled to access current anonymization level AL, we have a propensity to use the distributed cache mechanism to pass the content of AL to every Mapper or Reducer node as shown in Fig. 1. Also, Hadoop provides the mechanism to line easy international variables for Mappers and Reducers. The simplest specialization is passed into the Map operate of IGPL Update job during this method. The partition hash operate within the shuffle part is changed because the 2 jobs need that

5|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 the key-value pairs with the same key:p field instead of entire key ought to visit the same Reducer. To scale back communication traffics, MRTDS exploits combiner mechanism that aggregates the key value pairs with constant key into one on the nodes running Map functions. The following are snapshot of implementation of Two-Phase TDS approach for data anonymization for Preserving Bigdata Privacy.

IDL - International Digital Library

6|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

VI.

CONCLUSION AND FUTURE RESEARCH CHALLENGES

In this paper, we have examined the scalability problem of large-scale data anonymization by TDS, and proposed a highly scalable two-phase TDS approach using MapReduce on cloud. Data sets are partitioned and anonymized in parallel in the first phase, producing intermediate results. Then, the intermediate results are merged and further anonymized to produce consistent k-anonymous data sets in the second phase. We have creatively applied MapReduce on cloud to data anonymization and deliberately designed a group of innovative MapReduce jobs to concretely achieve the specialization computation in a highly scalable way. Experimental results on real-world data sets have demonstrated that with our approach, the scalability and efficiency of TDS are improved significantly over existing approaches. In cloud environment, the privacy preservation for data analysis, share and mining is a challenging research issue due to increasingly larger volumes of data sets, thereby requiring intensive investigation. We will investigate the adoption of our approach to the bottom-up generalization algorithms for data anonymization. Based on the contributions herein, we plan to further explore the next step on scalable privacy preservation aware analysis and scheduling on largescale data sets. Optimized balanced scheduling strategies are expected to be developed towards overall scalable privacy preservation aware data set scheduling.

REFERENCES [1] J. Manyika et al., Big data: The Next Frontier for Innovation, Competition,and Productivity. Zürich, Switzerland: McKinsey Global Inst., Jun. 2011, [2] Gantz J, Reinsel D. Extracting value from chaos [J]. IDC iview, 2011: 1-12. pp. 1_137. [3]A. Katal, M. Wazid, and R. H. Goudar, ``Big data: Issues, challenges, tools and good practices,'' in Proc. IEEE Int. Conf. Contemp. Comput., Aug. 2013, pp. 404_409. [4] B. Matturdi, X. Zhou, S. Li, and F. Lin, ``Big data security and privacy: A review,'' China Commun., vol. 11, no. 14, pp. 135_145, Apr. 2014.

IDL - International Digital Library

[5] L. Xu, C. Jiang, J.Wang, J. Yuan, and Y. Ren, ``Information security in big data: Privacy and data mining,'' in IEEE Access, vol. 2, pp. 1149_1176, Oct. 2014. [6] H. Hu, Y. Wen, T.-S. Chua, and X. Li, ``Toward scalable systems for big data analytics: A technology tutorial,'' IEEE Access, vol. 2, pp. 652_687,Jul.2014. [7] Ardagna C A, Damiani E. Business Intelligence meets Big Data: An Overview on Security and Privacy[J]. [8] Labrinidis A, Jagadish H V. Challenges and opportunities with big data [J]. Proceedings of the VLDB Endowment, 2012, 5(12): 2032-2033. [9] M. V. Dijk, A. Juels, "On the impossibility of cryptography alone for privacy-preserving cloud computing," Proceedings of the 5th USENIX conference on Hot topics in security, August 10, 2010, pp.1-8. [10]S. Sagiroglu and D. Sinanc, “Big Data: A Review,” Proc. International Conference on Collaboration Technologies and Systems, 2013, pp. 42- 47 [11] Y. Demchenko, P. Grzsso, C. De Laat, P. Membrey, “Addressing Big Data Issues in Scientific Data Infrastructure,” Proc. International Conference on Collaboration Technologies and Systems, 2013, pp. 48- 55. [12] Top Ten Big Data Security and Privacy Challenges, Technical report, Cloud Security Alliance, November 2012 [13]S. H. Kim, N. U. Kim, T. M. Chung, “Attribute Relationship Evaluation Methodology for Big Data Security,” Proc. International Conference on IT Convergence and Security (ICITCS), 2013, pp. 1-4. [14] S.H. Kim, J. H. Eom, T. M. Chung, “Big Data Security Hardening Methodology Using Attributes Relationship,” Proc. International Conference on Information Science and Applications (ICISA), 2013, pp. 1-2. [15] H. Takabi, J.B.D. Joshi, and G. Ahn, “Security and Privacy Challenges in Cloud Computing Environments,” IEEE Security and Privacy, vol. 8, no. 6, pp. 24-31, Nov. 2010. [16] K. LeFevre, D.J. DeWitt, and R. Ramakrishnan, “Incognito: Efficient Full-Domain K-Anonymity,” Proc. ACM SIGMOD Int‟l Conf. Management of Data (SIGMOD ‟05), pp. 49-60, 2005. [17]ABID MEHMOOD, IYNKARAN NATGUNANATHAN, YONG XIANG, (Senior Member, IEEE), GUANG HUA, (Member, IEEE), AND SONG GUO, (Senior Member, IEEE), “Protection of Big Data Privacy”, IEEE ACCESS, 2016.

7|P a g e

Copyright@IDL-2017


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.