11_ by ides editor

Full Paper Proc. of Int. Conf. on Advances in Communication and Information Technology 2011

Record Ordering Heuristics for Disclosure Control through Microaggregation Brook Heaton1 and Sumitra Mukherjee2 1

Graduate School of Computer and Information Sciences, Nova Southeastern University, Fort Lauderdale, FL, USA Email: willheat@nova.edu 2 Graduate School of Computer and Information Sciences, Nova Southeastern University, Fort Lauderdale, FL, USA. Email: sumitra@nova.edu actual data values with group means leads to information loss and renders the data less valuable to users. Optimal microaggregation aims to minimize information loss by grouping together records that are very similar, thereby maximizing homogeneity among the records of each group. The optimal microaggregation problem can be formally defined as follows: Given a data set with n records and p numerical attributes per record, partition the records into groups containing at least k records each, such that the sum squared error (SSE) within groups is minimized. SSE is defined

Abstract— Statistical disclosure control (SDC) methods reconcile the need to release information to researchers with the need to protect privacy of individual records. Microaggregation is a SDC method that protects data subjects by guarantying k-anonymity: Records are partitioned into groups of size at least k and actual data values are replaced by the group means so that each record in the group is indistinguishable from at least k-1 other records. The goal is to create groups of similar records such that information loss due to data modification is minimized, where information loss is measured by the sum of squared deviations between the actual data values and their group means. Since optimal multivariate microaggregation is NP-hard, heuristics have been developed for microaggregation. It has been shown that for a given ordering of records, the optimal partition consistent with that ordering can be efficiently computed and some of the best existing microaggregation methods are based on this approach. This paper improves on previous heuristics by adapting tour construction and tour improvement heuristics for the traveling salesman problem (TSP) for microaggregation. Specifically, the Greedy heuristic and the Quick Boruvka heuristic are investigated for tour construction and the 2-opt, 3-opt, and Lin-Kernighan heuristics are used for tour improvements. Computational experiments using benchmark datasets indicate that our method results in lower information loss than extant microaggregation heuristics.

 i 1 j 1

X ij  X i , where g is the number of groups, X ij

is the jth record of the ith group, and X i is the mean for the ith group. This problem has been shown to be NP-hard when p>1 [21]. Since microaggregation is extensively used for SDC, several heuristics have been proposed that lead to low information loss (see e.g. [4], [6], [7], [9], [10], [11], [13], [16], [17], [18], [20], [24], and [25]). There is a need to develop new heuristics that perform better than those currently known, either by producing groups with lower information loss, or by achieving the same information loss at lower computational cost. Optimal univariate microaggregation (i.e., involving a single attribute) has been shown to be solvable in polynomial time by taking an ordered list of records and using a shortest-path algorithm to compute the lowest information loss k-partition for the given ordered list [14]. A k-partition is a set of groups such that every group contains at least k elements. Optimality is guaranteed, since univariate data can be strictly ordered. Practical applications, however, require microaggregation to be performed on records containing multiple attributes. While multivariate data cannot be strictly ordered, Domingo-Ferrer et al. [9] show that, for a given ordering of records, the best partition consistent with that ordering can be identified efficiently as a shortest path in a network using Hansen and Mukherjee’s method [14]. Exploiting this idea, they develop the Multivariate HansenMukherjee (MHM) algorithm and propose several heuristics for ordering the records. Empirical results indicate that MHM, when used with good ordering heuristics, outperforms extant microaggregation methods. This paper builds on the work of [9] by investigating new methods for ordering records as a first step in multivariate microaggregation. The techniques for ordering records are based on the tour construction and

Index Terms—Disclosure Control, Microaggregation, Privacy protection, Tour construction heuristics, Tour improvement heuristics, Shortest path.

I. INTRODUCTION Many databases, such as health information systems and U.S. census data contain information that is valuable to researchers for statistical purposes. At the same time, these databases contain private information about individuals. There is a tradeoff between providing information for the benefit of society, and restricting information for the benefit of individuals. Statistical disclosure control (SDC) is a set of techniques for providing access to aggregate information in statistical databases, while at the same time protecting the privacy of individual data subjects. See [1], [12], and [27] for good overviews of statistical disclosure controls methods. Microaggregation is a popular SDC method that that protects data subjects by guarantying k-anonymity (see [2], [23], and [26]). Under microaggregation, records are partitioned into groups of size at least k and actual data values are replaced by the group means so that each record in the group is indistinguishable from at least k-1 other records. Replacing © 2011 ACEEE DOI: 02.CIT.2011.01.11