106 by ides editor

Poster Paper Proc. of Int. Conf. on Advances in Computer Engineering 2011

Effective Information Preservation by Identifying Victim Items Rahul P. Mirajkar1, Prof. Santaji. K. Shinde2 1

Department of Computer Science & Technology, Shivaji University, Kolhapur, India Email: rahulmirajkar982@gmail.com 2 Department of Information Technology, Bharati Vidyapeeth’s College of Engineering, Kolhapur, India Email: santaji@rediffmail.com cannot be discovered. “Ref. [8]”: Is it possible for these companies to benefit from such collaboration by sharing their data while still preserving some restrictive association rules? In this paper, we address the problem of transforming a database into a new one that conceals some strategic patterns (restrictive association rules) while preserving the general patterns and trends from the original database. The procedure of transforming an original database into a sanitized one is called data sanitization. The sanitization process on the data to remove or hide a group of restrictive association rules that contain sensitive knowledge. On the one hand, this approach slightly modifies some data, but this is perfectly acceptable in some real applications. The hiding process is guided by the need to maximize the data utility of the sanitized database by introducing the least possible amount of side effects. We aim at the security issue of data mining to provide a data and knowledge protection technique for perturbing datasets to meet the purpose of protecting privacy and clustering knowledge in datasets. We need to provide the solution for sensitive hiding. Algorithms need to be devised that can effectively protect the sensitive knowledge. Using mining tools to analyze the collected data can result in efficient strategies to improve the production process, find out the unusual steps during production process. It is important to hide the data so that it cannot be attacked by an intruder or hacker. Also we need to provide the solution for sensitive hiding and to define a methodology which will be capable of identifying an ideal solution whenever one exist, or approximate the exact solution, otherwise. So it becomes important to work out on a topic like this.

Abstract— Data mining mechanisms have widely been applied in various businesses and manufacturing companies across many industry sectors. Sharing data or sharing mined rules has become a trend among business partnerships. This has also increased the risk of unexpected information leaks when releasing data. To conceal restrictive itemsets (patterns) contained in the source database, a sanitization process is needed to transform the source database into a released database. The transformed result should also conceal nonrestrictive information as an unwanted event. The main problem is that from non-sensitive information or unclassified data, one is able to infer sensitive information, including personal information, facts, or even patterns that are not supposed to be disclosed. This scenario reveals a need for techniques that will ensure privacy protection, while facilitating proper information accuracy and mining. Index Terms— data mining, knowledge hiding, privacy preservation

I. INTRODUCTION Advances in data collection, processing, and analysis, along with privacy concerns regarding the misuse of the induced knowledge from this data, soon brought into existence the field of privacy preserving data mining. “Ref. [1], [2]”: Simple de-identification of the data prior to its mining is insufficient to guarantee a privacy-aware outcome since intelligent analysis of the data, through inference based attacks, may reveal sensitive patterns that were unknown to the database owner before mining the data. “Ref. [3], [4], [5], [6], and [7]”: Since the introduction of privacy preserving data mining in early nineties, it has received considerable attraction from the data mining research community. This paper concentrates on a subfield of privacy preserving data mining, known as “knowledge hiding.” Consider the following scenario: Two or more companies have a very large dataset of records of their customers’ buying activities. These companies decide to cooperatively conduct association rule mining on their datasets for their mutual benefit since this collaboration brings them an advantage over other competitors. However, some mutual benefit since this collaboration brings them an advantage over other competitors. However, some of these companies may not want to share some strategic patterns hidden within their own data (also called restrictive association rules) with the other parties. They would like to transform their data in such a way that these restrictive association rules © 2011 ACEEE DOI: 02.ACE.2011.02.106

II. RELATED WORK “Ref. [9]”: The methodology lies between the fields of frequent item set hiding and synthetic database generation (examined in the context of privacy preservation). Extending the original database to accommodate knowledge hiding can be considered as a bridging between the itemset hiding and the synthetic database generation approaches. Fundamental related work in both research directions is as follows: A. Frequent Item Set Hiding: There has been a lot of active research in the field of frequent item set and association rule hiding. “Ref. [10]”: The first work is the proposal of a greedy algorithm 147

Poster Paper Proc. of Int. Conf. on Advances in Computer Engineering 2011 for selecting items to sanitize among the transactions supporting sensitive item sets. Greedy Algorithm selects items to sanitize among transactions supporting sensitive item sets. Greedy algorithm searches through ancestor of item set, selecting at each level, the parent with maximum support & setting the selected parent as the new item set that needs to be hidden. “Ref. [11]”: C. M. Chiang presents a limited side-effect approach that modifies the original database to hide sensitive rules. “Ref. [12], [13]”: X. Sun introduce a border-based approach (BBA) for frequent item set hiding. The approach is greedy in nature and focuses on preserving the quality of the border constructed by the non sensitive frequent item sets in the item set lattice. “Ref. [14]”: S. Menon presents an integer programming approach for the hiding of sensitive item sets. The algorithm treats the hiding process as a CSP that identifies the minimum number of transactions to be sanitized. The authors first reduce the size of the CSP by using constraints involving only the sensitive item sets and then solve it by using integer programming. A heuristic is then enforced to identify the actual transactions and sanitize them. “Ref. [15]”: A distortion approach that is also based on integer programming is presented in the work of Gkoulalas-Divanis and Verykios. The authors present an exact methodology that relies on the process of border revision to identify the least amount of candidate items for sanitization. As a consequence, the provided hiding solution is guaranteed to minimally distort the original transactions.

this value can be established based on the sensitive item set in S, which has the highest support (breaking ties arbitrarily).

B. Synthetic Database Generation: Synthetic database generation approaches aim at the construction of data sets that adhere to specific rules and experience desirable statistical properties. These approaches achieve privacy preservation when the constructed data sets resemble the behaviour of the original ones, but avoid the disclosure of sensitive knowledge. “Ref. [16]”: Calder provides an approach by attaching frequency interval to each item set & by generating a database in which all the item set counts lie within these interval. “Ref. [17]”: X. Chen introduces a privacy preserving data set reconstruction-based framework for data sharing. This framework is based on knowledge rather than on transaction modification and uses constraint-based inverse item set mining to generate data sets that can be safely released for sharing. It puts the original data aside & start for sanitizing “knowledge sanitized knowledge base. “Ref. [18]”: Y.Wang applied graph theoretical results to divide the original itemsets into components & then use an iterative proportional fitting method to each component. “Ref. [19], [20] and [21]”: The tasks range from privacy-aware dataset regeneration to privacy-aware data mining results publication to privacy-aware database sharing. “Ref. [22]”: The author uses a heuristic approach to hide sensitive information.

c) Handling of Suboptimality: Since an exact solution may not always be feasible, the hiding algorithm should be capable of identifying good approximate solution. It is crucial to ensure that hiding algorithm should hold for all sensitive item sets in D & should properly protected from disclosure. Also it should have minimal possible impact of the sanitization process to DO. Hiding algorithm also should use a safety margin threshold which decides which are the highly victim transactions.

Where, Q = Size of applied extension, sup (Im, DO) = No. of transactions T å DO Im = Mth item set, DO = Original Database, mfreq = An item set I is called large or frequent in database D if and only if its frequency in D is at least equal to a minimum threshold mfreq. b) Exact and Ideal Solutions: A solution to the hiding of the sensitive knowledge in DO is considered as feasible if it achieves to hide the sensitive patterns. Any feasible solution, introducing no side effect in the hiding process is called exact. Finally, any nonexact feasible solution is called approximate. Given the sanitized database D, its original version DO, and the produced extension DX, the quality of database D is measured both in the size of DX, and in the number of binary variables set to “1” in the transactions of DX. In both cases, lower values correspond to better solutions. A solution to the hiding of the sensitive item set is considered as ideal if it has minimum distance between existing exact solutions.

III. ALGORITHM USED Step 1: Take any data set Step 2: Calculate support of each transaction Step 3: Identify sensitive information Step 4: Sanitize dataset by applying random sequence Step 5: Again calculate support of each transaction Step 6: Identify victim items among all transactions Step 7: Find total number of transactions to be sanitized Step 8: Rearrange the transactions to be sanitized in any particular order (e.g. ascending) Step 9: If removing transaction does not have any side effect on original data set, then remove it Step 10: Otherwise sanitize it again Algorithm 1: Algorithm for effective information preservation In Algorithm1, after taking any input data set (e.g. Chess dataset), we first calculated support of each transaction. Then we identified the sensitive information. Then we sanitized dataset by applying any random pattern. Again we calculated support of each transaction. Then after identifying victim items, we found total number of transactions to be sanitized.

C. Main Issues Pertaining To Hiding Methodology: a) Size of the database extension: If we consider an original database DO and if it is extended by DX (applied extension) to construct database D, an initial and very important step in the hiding process is the computation of the size of DX. A lower bound on © 2011 ACEEE DOI: 02.ACE.2011.02.106

148

Poster Paper Proc. of Int. Conf. on Advances in Computer Engineering 2011 Then we rearranged the transactions. After removing transaction if it does not have any effect, then we removed it. Otherwise we sanitized it again.

VI.CONCLUSION In this paper, we used an efficient algorithm that improves the balance between protection of sensitive knowledge and pattern discovery. This algorithm is useful for sanitizing large transactional databases based on a disclosure threshold (or a set of thresholds) controlled by a database owner. There is possible way to reproduce the original database from the sanitized one. Our performance study compares the effectiveness and scalability of the algorithm and analyzes the fraction of association rules to be preserved after sanitizing a database.

IV. METHODS OF DATA COLLECTION We applied Algorithm1 on the data sets which are publicly available on internet and through the FIMI repository located at http://fimi.cs.helsinki.fi/. Data sets BMS-WebView-1 and BMS-WebView-2 contain click stream data from the Blue Martini Software and were used for the KDD Cup 2000. The mushroom and the chess data sets were prepared by Bayardo (University of California, Irvine) from the UCI data sets and PUMSB. All these data sets demonstrate varying characteristics in terms of the number of transactions and items and the average transaction lengths.

ACKNOWLEDGEMENT I would like to express my sincere thanks to my Guide Prof. Mr. S. K. Shinde. This work has been also supported by Dr. P. C. Bhaskar, Coordinator, Department of Computer Science & Technology, Shivaji University, Kolhapur.

V. METHODS OF DATA ANALYSIS The algorithm is tested on real-world data sets using different parameters such as a. minimum support threshold b. number of sensitive item sets c.size of sensitive item sets All these data sets are publicly available through the FIMI repository located at http://fimi.cs.helsinki.fi/ Fig. 1 depicts the steps to be implemented.

REFERENCES [1] A. Gkoulalas-Divanis, V.S. Verykios, “Exact Knowledge Hiding Through Database Extension”, 2009 [2] Y. Saygin, V. S. Verykios, C. Clifton, “Using Unknowns to Prevent Discovery of Association Rules”, 2002 [3] R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD 2000), pages 439–450, 2000. [4] M. Atzori, F. Bonchi, F. Giannotti, and D. Pedreschi. Geopkdd: Alignment report on privacy-preserving data mining. Technical report, Jan. 2006. Pisa KDD laboratory,ISTI-CNR and University of Pisa. [5] C. Clifton and D. Marks. Security and privacy implications of data mining. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD’96), pages 15–19, Feb. 1996. [6] S. R. M. Oliveira and O. R. Zaiane. A framework for enforcing privacy in mining frequent patterns. Technical report, Computer Science Department, University of Alberta, Canada, June 2002. [7] V. S. Verykios, E. Bertino, I. N. Fovino, L. P. Provenza, Y. Saygin, and Y. Theodoridis. State-of-the-art in privacy preserving data mining. ACM SIGMOD Record, 33(1):50–57, 2004. [8] V.S. Verykios, A.K. Emagarmind, E. Bertino and E. Dasseni, “Association Rule Hiding”, 2004 [9] G. V. Moustakeides and V.S. Verykios, personnel comm., 2006 [10] M. Attalah, E. Bertino, A. Elmagarmid, M. Ibrahim, V. S. Verykios,”Disclosure Limitation of Sensitive Rules” 1999 [11] Y. H. Wu, C. M. Chiang, A.L.P. Chen, “Hiding Sensitive Association Rules With Limited Side Effects”, 2007

A. Process Flow Diagram:

Figure 1.

Process flow of the system

We first took a data set out of which we identified sensitive information after calculating support of each transaction. Then we sanitized dataset by applying any random sequence. Then we identified victim items among all transactions. Then we found the total number of transactions to be sanitized. Then we rearranged the transactions in ascending order. We removed the transactions if it does not have any side effect on the original data set. Otherwise we sanitized that transaction once again.

149

Poster Paper Proc. of Int. Conf. on Advances in Computer Engineering 2011 19] X. Wu, Y. Wu, Y. Wang, and Y. Li. Privacy aware market basket data set generation: A feasible approach for inverse frequent set mining. In Proceedings of the 2005 SIAM International Conference on Data Mining (SDM 2005), 2005. [20] M. Atzori, F. Bonchi, F. Giannotti, and D. Pedreschi. Blocking anonymity threats raised by frequent itemset mining. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM 2005), pages 561–564, 2005. [21] M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, and V. S. Verykios. Disclosure limitation of sensitive rules. In Proceedings of the 1999 IEEE Knowledge and Data Engineering Exchange Workshop (KDEX’99), pages 45–52, 1999. [22] S. R. M. Oliveira and O. R. Za¨ýane “Protecting Sensitive Knowledge By Data Sanitization”, 2003

[12] X. Sun, P.S. Yu, “A Border Based Approach for Hiding Sensitive Frequent Itemsets”, 2005 [13] X. Sun, P.S. Yu, “Hiding Sensitive Frequent Itemset by a Border-Based Approach”, 2007 [14] S. Menon, S. Sarkar, S. Mukheree, “Maximizing Accuracy of Shared Databases When Concealing Sensitive Patterns”, 2005 [15] A. Gkoulalas-Divanis, V.S. Verykios, “An Integer Programming Approach for Frequent Itemset Hiding”, 2006 [16] T. Calders, “Computational Complexity of Itemset Frequency Satisfiability”, 2004 [17] X. Chen, M. Orlowska, X. Li, “A New Framework of Privacy Preserving Data Sharing”, 2004 [18] X. Wu, Y. Wu, Y. Wang, Y. Li, “Privacy-Aware Market Based Data Set Generation: A Feasible Approach for Inverse Frequent Set Mining”, 2005

150