DAIPEX Project 2012-5-111-R
Deliverable D4.1 Survey on methods for data correction, data clustering, and data obfuscation 30 June 2014 Public Document
30 June 2014
Public Document
Project acronym: Project full title:
DAIPEX Data and Algorithms for Integrated Transportation Planning and Execution
Work package: Document number: Document title:
4 D4.1.1 and D4.1.2 Survey on methods for data correction, data clustering, and data obfuscation 1.0
Version:
Editor(s) / lead beneficiary: Authors(s):
DAIPEX
Remco Dijkman (TU/e) Noor Rijk (TU/e) Huiye Ma (TU/e) Remco Dijkman (TU/e)
1
30 June 2014
Public Document
Contents 1
Introduction .................................................................................................................................. 3
2
Methods to process incorrect data............................................................................................... 5 2.1
General assumptions............................................................................................................ 5
2.2
Framework............................................................................................................................ 5
2.3
General methods .................................................................................................................. 5
2.3.1
Methods to detect missing data values ......................................................................... 5
2.3.2
Methods to correct missing data values ........................................................................ 6
2.3.3
Methods to detect incorrect data values........................................................................ 9
2.3.4
Correction of incorrect data values.............................................................................. 10
2.4
3
4
5
Specific methods ................................................................................................................ 11
2.4.1
Methods to detect missing data values ....................................................................... 11
2.4.2
Methods to correct missing data values ...................................................................... 11
2.4.3
Methods to detect incorrect data values...................................................................... 12
2.4.4
Correction of incorrect data values.............................................................................. 15
2.5
Method overview ................................................................................................................ 17
2.6
Summary ............................................................................................................................ 18
2.7
Discussion .......................................................................................................................... 19
Data clustering techniques ........................................................................................................ 21 3.1
Hierarchical clustering algorithms....................................................................................... 21
3.2
Partitioning clustering algorithms........................................................................................ 22
3.2.1
Probabilistic clustering................................................................................................. 22
3.2.2
K-medoids methods .................................................................................................... 22
3.2.3
K-means methods ....................................................................................................... 23
3.3
Large dataset clustering algorithms.................................................................................... 23
3.4
High dimensional data clustering algorithms ...................................................................... 24
3.5
Sequential data clustering algorithms................................................................................. 24
3.6
Text clustering algorithms................................................................................................... 25
3.7
Imbalanced data clustering methods .................................................................................. 26
3.8
Evaluation of clustering results ........................................................................................... 26
3.9
Discussions ........................................................................................................................ 27
Data obfuscation ........................................................................................................................ 28 4.1
Data randomization ............................................................................................................ 28
4.2
Data swapping .................................................................................................................... 28
4.3
Partial suppression ............................................................................................................. 29
4.4
Linear transformation.......................................................................................................... 29
4.5
Discussion .......................................................................................................................... 29
Conclusions ............................................................................................................................... 30
DAIPEX
2
30 June 2014
Public Document
1 Introduction Large databases are common nowadays. The data that is stored in those databases can be of great help for explanatory research and making decisions to approach the ideal situation for the owner or user of the database. In order to come to explanations of behaviour and decision that will lead to improvement, analysis of the data is required. Poor data quality can lead to misleading data analysis and incorrect decision making (J. Chen, W. Li, A. Lau, J. Cao, K. Wang , 2010). Therefore, it is important that data are complete and give a correct representation of reality. Unfortunately, this is often not the case (J.R. Carpenter, M.G. Kenward, S. Vansteelandt, 2006); databases contain incorrect data values or miss certain data. Incorrect values can be entered into a database on purpose (e.g. in case the real value of the observation is unknown but the observer is forced to fill in a value) or accidentally (i.e. typing or measurement errors). Observations can be missing by design or because, for one reason or another, the intended observations were not made (J.R. Carpenter, M.G. Kenward, S. Vansteelandt, 2006). Missing and incorrect data values should be detected and corrected in order to obtain a more useful data set to analyse and draw conclusions from. Formerly, the common practice has been to ignore observations with missing data, but this can result in biased prediction models (I. Myrtveit, E. Stensrud, U.H. Olsson, 2001). Over the years different researchers came up with new methods and algorithms to detect and correct those missing and incorrect values in databases. Reliability, ease of use, and effort and time needed to execute those methods are important indicators of performance for these methods. This report provides an overview of the methods to detect and incorrect missing and incorrect data along with examples of their possibilities for application. It is shown in which areas some extra effort is needed to develop better methods for the detection and/or correction of missing and/or incorrect data values. This report can be used as a source of information one can refer to when searching for a method to detect and/or correct missing and/or incorrect data values in a data set. As mentioned before, detection and correction of missing and incorrect data values will enhance the usefulness of a data set. Hence, more informed decisions can be made which can be considered a progress. After the data have been corrected, one of the vital means in analyzing these data is to classify or group them into a set of categories or clusters. Clustering is a data mining technique, which consists in discovering interesting data distributions in (large) databases. Given k data points in a ddimensional metric space, the problem of clustering is to partition the data points into n clusters such that the data points within a cluster are closer (more similar) to each other than data points in different clusters. The applications of clustering cover customer segmentation, catalogue design, store layout, stock market segmentation, etc. Clustering has a long history, along which a huge amount of clustering algorithms dealing with various specific problems have been proposed in Zhang et al. (1996), S. Guha et al. (1998), S. Guha et al. (1998), G. Karypis et al. (1999), R. Agrawal et al. (1998), S. Goil et al. (1999), C. Cheng et al. (1999), Y. Cai et al. (2009), E. Buyko et al. (2009), and so on. Important survey papers on clustering techniques also exist in the literature, P. Berkhin (2006), R. Xu et al. (2005), C. C. Aggarwal et al. (2012). In contrast to the above, the purpose of this report is to provide a concise description of the influential and important clustering algorithms which can address events or texts with emphasis on the data with high dimensional, large scale, sequential, imbalanced features. In a practical context, an important concern when doing data analysis is data confidentiality. This is also an important concern in the DAIPEX project. In particular, because this project has some nonstandard requirements concerning data confidentiality. When doing analysing data about companies or people in a practical context, techniques for anonymization and obfuscation are often used to protect the identities of the companies or people, or to protect the specific values of the variables that are being analysed. Section 4 provides an overview of the techniques that are used for those purposes. However, the techniques that are used, are specifically meant to preserve statistical properties, while this may be undesirable in the logistics sector. For example, if statistical properties are preserved about how busy transportation is on certain routes, this may drive down the price of
DAIPEX
3
30 June 2014
Public Document
transportation on that route. Therefore, if the analysis techniques that are developed, preserve such properties, transportation companies will not want to use them. Consequently, techniques must be developed that preserve the statistical properties that are necessary for planning, while obfuscating other properties. Against this background, the remainder of this report is structured as follows. The literature review on methods of handling incorrect or missing data is presented in Section 2. This section discusses general assumptions and both general and specific methods to detect and correct missing and incorrect data values. In addition it also answers the research question and draws conclusions. Section 3 of this report surveys data clustering techniques in several aspects. Evaluation criteria and discussion on various clustering techniques are given at the end of Section 3. Section 4 discusses the data obfuscation techniques. Finally, section 5 presents the conclusions.
DAIPEX
4
30 June 2014
Public Document
2 Methods to process incorrect data This section shows the literature review that was conducted. In section 2.1 some general assumptions that are used for all found methods concerned with detection or correction of missing or incorrect data values are stated. Section 2.2 describes the framework that was used to distinct different groups of methods in the subsequent two sections. Section 2.3 explains the general methods for detection and correction of missing and incorrect data values. Specific methods for detection and correction of missing and incorrect data values are discussed in section 2.4. Finally, section 2.5 gives an overview of all methods and the articles in which they were encountered.
2.1 General assumptions For all methods, certain assumptions have to be made. Practically all methods assume data to be missing completely at random (MCAR) or missing at random (MAR). Data are MCAR when the missing values on variable X are independent of other observed variables as well as the values of X itself. Data are MAR when the probability that an observation is missing on variable X can depend on another observed variable in the data set, but not on the values of X itself (Enders, 2001). Another assumption concerns distribution of data. For most algorithms, a (multivariate) normal distribution of data is assumed. These assumptions are required in order to justify a method to be used on that data. However, it is possible that these assumptions are not realistic for data in a certain database. This can make the results of a method questionable. Some researchers try to justify the use of certain assumptions in order to gain reliability. For example, M. Zhong et al. (2004) state that using a normal distribution and data that are MAR leads to the same parameter estimates as using an unknown distribution and data that are MCAR.
2.2 Framework Different topics will be distinguished in this section: 1. Methods to detect missing data values 2. Methods to correct missing data values 3. Methods to detect incorrect data values 4. Correction of incorrect data values This distinction follows from the natural difference between missing data values and incorrect data values. Missing data values are empty fields in a data set or database, which one would expect to contain data. In contrast with missing data values, cells that contain incorrect data values are nonempty fields. This makes it harder to distinguish incorrect data values from correct data values then distinguishing missing data values from correct data values. Furthermore, the distinction between detecting and correcting the missing and incorrect data values is made because different approaches are used to execute the tasks of detection and correction for both missing and incorrect data values. For detection, the only similarity between missing and incorrect data values is the fact that one wants to find them. The way in which they can be found, is not the same. To find missing data values, one can simply look up the empty cells. For detection of incorrect data values, more sophisticated methods are needed. This justifies the distinction between detection of missing and incorrect data values. For correction, there is more similarity: for both missing and incorrect data, the target is to find a reliable value and in some cases the methods to do this can (partly) intersect. However, different methods exist for correcting missing and incorrect data values. For example, when correcting incorrect data, some methods use the incorrect data value to calculate a value with a better fit. Obviously, this is not possible for missing data values. Therefore, the distinction between correction of missing and incorrect data values is justified as well.
2.3 General methods 2.3.1 Methods to detect missing data values Detecting missing data values in databases is relatively easy. A missing data value in a database is often represented with an empty field, a dot or the code ‘999’. Most of the programs used to store or
DAIPEX
5
30 June 2014
Public Document
analyse data, e.g. Excel or SPSS, employ simple ways to find missing values in a data set. Therefore, this section will not be discussed elaborately. Nevertheless, to be complete on the subject of detecting and correcting missing and incorrect data values, this part should be at least mentioned. 2.3.2 Methods to correct missing data values Extensive research is conducted in the area of correcting missing data. Articles found on this subject mention complete-case analysis, weighting methods, multiple imputation, and maximum likelihood as methods to correct missing data values. The most common methods and their advantages and disadvantages will be discussed below. Complete-case analysis. Complete-case analysis is also known as ‘case deletion’ or ‘ignore missing’. In this method the analysis is restricted to cases with complete data. This means that cases with missing data values are excluded from the analysis hence no real effort is put in correcting for the missing values. Therefore, this method consumes only little time, which can be considered an advantage. After removing the cases with missing data values, the data set can be considered complete which implies that any statistical method to analyse the data further can be used. J. Ibrahim et al. (2005) present complete-case analysis as a simple way to avoid the problem of missing data. They state that this method is still the default method in most software packages, although more appropriate methods are already developed. Complete-case analysis can be biased when the missing data is not MCAR. When the MCAR assumption is not satisfied, the deletion of cases with missing data is inefficient and wasteful. The claims that were made by J. Ibrahim et al. (2005) are strengthened in the article of S. Greenland et al. (1995). They state that complete-case analysis is commonly used for handling missing data but can be biased under reasonable circumstances. There are many other, more sophisticated, methods that are superior to simple methods like complete-case analysis. However, due to complexity and lack of application software in some fields these superior methods are not used as much as the simpler methods. Bias in complete-case analysis can also occur when cases with complete data are a biased subsample of all cases for example. Complete-case analysis is inefficient because it produces estimates of missing values with a high variance compared to other methods. R. Young et al. (2013) agree with the previous claims in a general sense. In the article of J. Luengo et al. (2011), complete-case analysis was tested among 13 other methods. For three different systems and 21 different data sets, the best imputation method was chosen. Complete-case analysis scored best on 11 out of 63 tests. A mentioned shortcoming of the method is low mean accuracy. Imputation methods that do fill in missing values outperform this method. However, opposed to claims in other articles, in this article it was stated that, for systems with good generalization abilities, complete-case analysis is an affordable option. Variants on complete-case analysis are ‘listwise deletion’, ‘pairwise deletion’ and ‘dummy variable adjustment’. In listwise deletion, cases are only deleted if they have missing data on variables that will be used in the subsequent analysis. This means that cases with missing data can be left in the data set if the missing data are on variables that will not be included in the analysis. The assumptions and problems are the same as with complete-case analysis. Pairwise deletion, also known as ‘available casa analysis’, is a simple alternative to listwise deletion that preserves more data. This method calculates means and (co)variances for each pair of variables with available data from all cases. The resulting estimates are then used as input for analysis. Problems with this method are that sometimes not all parameters can be estimated, estimates can have a larger variance than with listwise deletion, and found standard errors are inconsistent. The latter questions the validity of confidence intervals and hypothesis tests. Furthermore, sample size cannot be specified since a different number of cases is used to calculate the moments for almost every pair of variables. At last, dummy variable adjustment was mentioned as a variant on complete-case analysis. This method creates dummy variables for each of the variables with missing data. Both the dummy variable and the real variable are then included as explanatory variables in a regression model. The disadvantage of this method is that is produces biased estimates (Allison, 2003) (I. Myrtveit, E. Stensrud, U.H. Olsson, 2001).
DAIPEX
6
30 June 2014
Public Document
The biggest disadvantage of complete-case analysis is that it does not use all the available information. This makes complete-case analysis a less reliable and less accurate method. Maximum likelihood. In maximum likelihood, data from both complete and incomplete cases are used to produce a matrix with covariances between the variables (R. Young, D. Johnson, 2013). Information from complete cases is borrowed to estimate missing values (Enders, 2001). Hence, maximum likelihood can be considered an incomplete-case analysis model-based method (I. Myrtveit, E. Stensrud, U.H. Olsson, 2001). P. Allison (2003) states that there are several popular software packages that make maximum likelihood analysis available. The method can be split up in three sub-variants: factoring likelihood, expectation maximization and direct maximum likelihood, which can be also referred to as full information maximum likelihood (Allison, 2003). The sub-variant ‘full information maximum likelihood’ assumes multivariate normality and maximizes the likelihood of the theoretical model given the observed data (I. Myrtveit, E. Stensrud, U.H. Olsson, 2001). Data are assumed to be MAR. This method allows incomplete cases to be used in the analysis (R. Young, D. Johnson, 2013). The method is conceptually analogous to pairwise deletion; all available data are used for the estimations. Standard errors are obtained directly from the analysis and missing values are not imputed but only estimated (Enders, 2001). These estimates can then be used for imputation by means of sampling (J. Luengo, J.A. Sáez, F. Herrera, 2011). A claimed disadvantage of this method is that current software restricts the types of analysis that can be applied for this method (R. Young, D. Johnson, 2013). Furthermore, relatively large data sets, more than 100 samples, are needed for maximum likelihood methods (I. Myrtveit, E. Stensrud, U.H. Olsson, 2001). Maximum likelihood algorithms may be superior to other techniques in many cases (Enders, 2001) (S. Greenland, W.D. Finkle, 1995) (Allison, 2003). They have nearly optimal statistical properties (Allison, 2003). However, these methods are complex and some researchers claim that current software lacks application (S. Greenland, W.D. Finkle, 1995) (R. Young, D. Johnson, 2013). Other researchers state that software packages to perform maximum likelihood methods do exist (Allison, 2003) (I. Myrtveit, E. Stensrud, U.H. Olsson, 2001). Weighting methods. Weighting methods can be considered extensions of pseudo-likelihood methods. A simple weighting method is the ‘corrected complete-case analysis’, which uses complete cases in a weighted regression. The weights vary inversely with the estimated probability of being complete (S. Greenland, W.D. Finkle, 1995). According to R. Young et al. (2013) weighting procedures can also be seen as modifications of complete-case analysis. A weight is a value assigned to each case to indicate how much that case will count in the analysis. Corrected complete-case analysis is here referred to as ‘inverse probability weighting’ and explained in more detail. With inverse probability weighting, for each observed value a probability of having been observed is estimated with logistic regression. The inverses of these scores are then used to weigh the observed data to match the distribution of all cases (R. Young, D. Johnson, 2013). In other words: weights are used to rebalance the set of complete cases to be representative for the full sample (S.R. Seaman, I.R. White, A.J. Copas, L. Li, 2012). The idea behind inverse probability weighting is straightforward and intuitively attractive. However, there are questions about biased and inefficient estimators if an incomplete or incorrect model is used to produce estimators. A limitation of this method is that cases with missing values on variables that were used to calculate weights must be excluded from the analysis. This means that not all available information can be used. In addition, inverse probability weighting produces relatively large standard errors and reduces statistical power. In general, estimates are unbiased, but the method is found to be inefficient. Advice for future research concerns a hybrid inverse probability weighting - multiple imputation method to increase sample size and make standard errors more efficient (R. Young, D. Johnson, 2013). Both J. Carpenter et al. (2006) and S. Seaman et al. (2011) agree on the disadvantages mentioned for inverse probability weighting. Estimates are consistent, but also inefficient and biased (J.R. Carpenter, M.G. Kenward, S. Vansteelandt, 2006). Inversed probability DAIPEX
7
30 June 2014
Public Document
weighting only includes complete cases in the analysis and is not efficient. However, sometimes analysts may feel more confident using inversed probability weighting instead of other methods if whole blocks of data are missing on some individuals (S.R. Seaman, I.R. White, A.J. Copas, L. Li, 2012). Multiple imputation. Imputation is the process of replacing missing data with substituted values. In multiple imputation, missing values are replaced by a set of plausible values that are found with an appropriate model that takes into account uncertainty about the unknown values. Multiple imputed data sets are created and analysed using standard complete-data methods. Multiple imputation is available in software from SAS and SPSS amongst others. It is assumed that missing data are MAR. Nevertheless, the multiple imputation approach seems to be robust to violations of this assumption. Because of retaining all cases in the analysis, which means all available information is used, multiple imputation has a similar advantage over complete-case analysis and weighting methods as maximum likelihood does. Compared to maximum likelihood, multiple imputation has the advantage of creating a complete data matrix which reduces complication and facilitates analysis of the data by multiple people. A disadvantage of multiple imputation relative to maximum likelihood is the need to generate several imputed data sets which is not required for maximum likelihood. The estimates that are generated by multiple imputation and maximum likelihood are equivalent when a sufficiently large number of data sets are imputed in multiple imputation. From the previous, the conclusion can be drawn that multiple imputation does not simply ‘make up data’, as some researchers fear for, because the results of multiple imputation are similar to those of maximum likelihood, which does not use imputed data (R. Young, D. Johnson, 2013). Multiple imputation can be used to restore variability that is lost during the imputation process of expectation maximization. Variability that is lost with performing expectation maximization can be divided into residual variability and variability in estimates because of the error that follows from missing data. It is stated that multiple imputation can obtain adequate results using a minimum of five imputed data sets (Enders, 2001). Overview. There are many methods that aim to correct for missing data values in data sets. The range of methods goes from simple methods like complete-case analysis to more sophisticated methods like maximum likelihood and multiple imputation. Ad hoc methods like mean imputation and listwise and pairwise deletion are practical to implement and can be found in most well-known statistical program packages. These methods are appropriate when only a small amount of data is missing. However, drawbacks like inefficient and biased estimates do exist for these methods (M. Zhong, P. Lingras , S. Sharma, 2004). Therefore, more sophisticated methods like maximum likelihood, weighting methods, and multiple imputation. These methods are more efficient, but also more complex. The statistical properties of multiple imputation are nearly as good as those of maximum likelihood. Multiple imputation can easily be implemented and is very flexible but does not produce a determinate result, opposed to maximum likelihood (Allison, 2003). If one is unable to carry out one of the more sophisticated approaches, one may be best served by complete-case analysis and avoiding intuitively appealing but potentially disastrous ad hoc approaches (S. Greenland, W.D. Finkle, 1995). J. Luengo et al. (2011) claim that there is no universal imputation method which performs best for all types of systems that were tested in their research. Indeed, the methods that were found seem to have different characteristics which make them better suited for certain situations. Table 2 summarizes the characteristics of the methods that were discussed in this paragraph and gives indicators for the use of a certain method.
DAIPEX
8
30 June 2014
Public Document
Table 1 - Characteristics of general methods to correct missing data values
Method
All available info used? No
Biased?
Complexi ty
Efficiency
Use if..
Yes
Low
Low
Yes
No
High
High
..quick results required ..sample > 100
Weighting methods
No
Yes
Low
Low
Multiple imputation
Yes
Yes
Low
Low
Complete-case analysis Maximum likelihood
are
..whole blocks of data are missing on 1 case ..MAR assumption is violated
2.3.3 Methods to detect incorrect data values Information found on this subject is quite scarce. Information found on missing data is more theoretical and general, while information concerning the detection of incorrect data values is more specific and often focussed on a few practical examples. However, several general methods to detect outliers and duplicates, the most well-known examples of incorrect data values, were found and will be described in this section. Outliers. Outliers are obvious examples of possible incorrect data values. Outliers can be distanceor density-based. A distance-based outlier is an observation that is sufficiently far from most other observations in the data set. Efficient distance-based outlier detection is possible if the user is confident that the threshold can be specified accurately (S. Subramaniam, T. Palanas, D. Papadopoulos, 2006). Well-known methods to find these kind of outliers are clustering methods like k-nearest neighbour, k-means clustering, and neural network methods. However, these methods are not designed for time series, which is considered a disadvantage (J. Chen, W. Li, A. Lau, J. Cao, K. Wang , 2010). Statistical tests like z-values, box plots, scatterplots, and Mahalonobis distance are designed for outlier detection in time series. These methods work under the assumption of normality (J. Hair, W. Black, B. Babin, R. Anderson, 2010). A density-based outlier is a value with a multi granularity deviation factor that is significantly different from that of the local averages. This type of outliers can easily be detected if data points exhibit different densities in different regions of the data space or across time (S. Subramaniam, T. Palanas, D. Papadopoulos, 2006). According to an example from practice, examination of data projections with low density proofs to bear fruit (C.C. Aggarwal, P.S. Yu, 2001). Duplicates. Duplicates are a sub set of the category incorrect, or better put, unwanted data. Sung et al. (2002) describe how to find duplicates in a data set. First, a detection method should be used to determine which records need to be compared. General detection methods concern sorted neighbourhood (i.e. pairwise comparisons) and priority queues (i.e. clusters and members). Second, a comparison method is needed to actually compare the records. General comparison methods concern similarity and edit distance (i.e. the number of changes needed to transform one string into the other). Overview. Incorrect data values can be divided in different subgroups. The most general subgroups are outliers and duplicates. Existing approaches to detect those incorrect data values were discussed in this section. Table 3 summarizes the most common types of incorrect data values, general methods that can be used to detect them, and indicators for detection of a certain type of incorrect data values.
DAIPEX
9
30 June 2014
Public Document
Table 2 - Types of incorrect data values and general methods to detect them
Type of incorrect Methods data values No time series: Distance-based outliers* k-nearest neighbour k-means clustering Neural networks Time series: z-value Box plot Scatterplot Mahalonobis distance Density-based outliers Data-projections Duplicates**
Detection: Sorted neighbourhood (pairwise) Priority queues (cluster members) Comparison: Similarity Edit distance
Use if... ..thresholds specified
can
easily
be
..data points show different densities across space/time ..one wants to exclude duplicate records
* For distance-based outliers, one has the choice between methods that are suited for time series or methods that are not meant for time series. ** For duplicates, both detection and comparison methods are needed. 2.3.4 Correction of incorrect data values Only little information could be found on correction of incorrect data values, compared to the amount of information that was found on correcting missing data values. The greater part of the information that was found in this area is a continuation of the papers in which detection methods for incorrect data values are explained and evaluated. The following subparagraphs will describe general methods to perform transformations, to correct for outliers or discrepancies, and removal of duplicates. Transformations. Transformations are needed when data show inconsistencies in schema, formats and adherence to constraints, due to for example data entry errors or merging multiple sources (i.e. outliers or discrepancies). A general approach for transforming data is to use one of the available transformation tools after an auditing tool is used. These transformation tools lack user feedback and require much user effort. Removal of duplicates. After duplicates are detected and compared, pruning is used to eliminate false positives that were found in the detection phase. False positives can be present if the detection and comparison methods that are used do not take order of characters into consideration. In pruning, candidate duplicates are compared again and false positives are filtered out. This results in the real duplicates, which will not be substituted, but can easily be discarded (S.Y. Sung, Z. Li, P. Sun, 2002).
DAIPEX
10
30 June 2014
Public Document
2.4 Specific methods 2.4.1 Methods to detect missing data values No specific methods to detect missing data values were encountered. This section is mentioned purely for the sake of completeness. 2.4.2 Methods to correct missing data values Next to complete-case analysis, weighting methods, multiple imputation, and maximum likelihood, two more specific methods were described in three of the articles. These methods combine general methods from section 3.3.2 and will be described in this section. Multiple imputation and inversed probability weighting. In the article of S. Seaman et al. (2012) it is proposed to combine multiple imputation and inversed probability weighting. In this method, a case will be included in the analysis if more than a certain percentage of its data is observed. Missing values are multiply imputed and each resulting data set is analysed with inverse probability weighting to account for the exclusion of individuals. By imputing in cases with only few missing values and excluding cases with more missing data, the combined method could inherit the efficiency advantage of multiple imputation and avoids bias resulting from incorrectly imputing larger blocks of data. Next to that, the combined method only needs a single set of weights, while inverse probability weighting needs a different set of weights for each analysis. In this way, the combined method can have advantages over both multiple imputation and inverse probability weighting alone. Results from a practical application of multiple imputation and the method that combines inverse probability weighting with multiple imputation shows no substantial differences between the standard errors of both methods. The combined method is found to be consistent, but complicated. Furthermore, biases are found to be small. One has to keep in mind that when imputation models are correctly specified, multiple imputation on its own is more efficient than the combined method. The combined method can however be used as a check or diagnostic for multiple imputation. The combined method will be most appealing when the model for weights is simple compared to the imputation model (S.R. Seaman, I.R. White, A.J. Copas, L. Li, 2012). Doubly robust estimators. J. Carpenter et al. (2006) propose doubly robust estimators as an improvement of inverse probability weighting and an alternative for multiple imputation. Inverse probability weighting is inefficient and sensitive to the choice of weighting model, but simple. To gain efficiency, information from cases with missing values should contribute to the estimation. The result, a semi-parametric estimator, is not as efficient as the maximum likelihood estimator, but remains consistent contrary to the maximum likelihood estimator. Multiple imputation is not robust to misspecification of the distribution of observed and partially observed part of the data. Hence, it is crucial to obtain a correct imputation model for multiple imputation to give valid results. Doubly robust inverse probability weighting is an attractive alternative for multiple imputation since this method requires simpler estimates. The advantage of doubly robust estimators, under the condition of a correct model that relates response to explanatory variables, is that either an incorrect model for the probability of observing the data or an incorrect model for the joint distribution of the partially and fully observed data will keep the estimators of the first model consistent. However, multiple imputation has a potential advantage over doubly robust inverse probability weighting if one does not wish to condition on variables that are necessary to ensure MAR. The reason for this is that the imputation model can include all terms necessary for the MAR assumption, but the model of interest only needs to include the subset of covariates of interest. Testing shows that double robustness works: unbiased estimators are obtained. Doubly robust estimators have potential to be a competitive and attractive tool for the analysis of social science data with missing observations (J.R. Carpenter, M.G. Kenward, S. Vansteelandt, 2006). J. Ibrahim et al. (2005) agrees with doubly robust estimators as a good alternative for other methods if distributional assumptions are violated. Although the doubly robust estimates may be less efficient than likelihood-based estimates under the condition of correct distributional assumptions, they are robust and consistent. Overview
DAIPEX
11
30 June 2014
Public Document
As claimed in this section, combining general methods to correct missing data values can result in added values. Ideally the advantages of the general method are inherited, while the disadvantages are compensated by the other general method that is used. However, one should take into account that the use of these combined methods is more complicated than the use of general methods. Table 5 summarizes the characteristics of the methods that were discussed in this paragraph and gives indicators for the use of a certain method. Table 3 - Characteristics of specific methods to correct missing data values
Method
All available info used? Multiple imputation No + Inversed probability weighting Doubly estimators
robust Yes
Biased?
Complexi ty
Efficiency
Use if..
No
High
High
..imputation model is complex
No
Low
Medium
..distributional assumptions violated
are
2.4.3 Methods to detect incorrect data values Apart from the general methods that were discussed in section 3.3.3., more specific methods to detect outliers and duplicates exist. The first part of this section will describe specific methods to detect these well-known types of incorrect data values. In addition, specific practical methods that are not directly aimed at finding outliers or duplicates will be discussed. Outliers. The problem of finding outliers can be solved efficiently if an accurate approximation of the data distribution can be found. In order to do this, kernel estimators can be used. For the detection of distance-based outliers, S. Subramaniam et al. (2006) propose the distributed deviation detection algorithm, which makes use of kernel estimators. This method has links to the general method knearest neighbour that was mentioned in section 3.3.3. In order to detect density-based outliers, the multi granular deviation detection algorithm is proposed. Since the latter is based on local conditions, it is more accurate than the distance-based outlier detection method. The multi granular deviation detection algorithm is a practical example of examining data projections and is found to be highly precise and effective (S. Subramaniam, T. Palanas, D. Papadopoulos, 2006). J. Chen et al. (2010) propose B-Spline smoothing and Kernel smoothing as methods to detect corrupted data. Confidence intervals are used to decide on whether an observation should be considered as normal or as corrupted. The challenge is to determine the best smoothness level, which requires user involvement. The combination of automatic detection and user input improves the overall performance of this method (J. Chen, W. Li, A. Lau, J. Cao, K. Wang , 2010). Raman et al. (2001) discuss discrepancy detection. Discrepancies can be considered indicators of outliers. Formerly, data were audited with an existing auditing tool to find discrepancies. These existing methods require much effort from the user. Discrepancy detection methods must match the data domain. Unfortunately, data values are often composite structures of values from different domains. A new tool is proposed to find and solve discrepancies more efficiently: Potter’s Wheel. This software package detects discrepancies automatically by using the latest transformed view of the data and thereby makes it possible to find more discrepancies (V. Raman, J.M. Hellerstein, 2001). It is not clear to which of the general methods to detect outliers this practical method can be linked. Duplicates. Sung et al. (2002) propose new methods for detection and comparison. In the proposed detection method, RAR (reduction using anchor record), a certain record is used as an anchor to compare with other records. The new comparison method that is proposed is called TI-
DAIPEX
12
30 June 2014
Public Document
similarity. This method computes the degree of similarity of two records and uses field weights and a threshold to decide if the records are duplicates. These methods are similar to record similarity, and thus can be linked to the general method ‘sorting neighbourhood’. Running the two proposed new methods reduces the number of comparisons and can therefore significantly save time compared to the sorted neighbourhood methods that were mentioned in section 3.3.3. Results are the same as with the general methods (S.Y. Sung, Z. Li, P. Sun, 2002). Bayesian classification I. X. Zhu et al. (2005) discuss Bayesian classification as a practical example. In Bayesian classification, a distinction is made between attribute noise and class noise. Attribute noise concerns erroneous attribute values, which are comparable to outliers. Class noise refers to contradictory examples and misclassifications, which can be loosely linked to duplicates. It is hard to distinguish instances with erroneous attribute values from normal instances, because the erroneous value can easily be seen as a new training example with valuable information. A partitioning filter is proposed to identify noise. This filter is most probably based on general methods concerning detection of outliers and duplicates. Experiments and comparative studies have shown that the proposed approach is effective and robust to identify noise and improve classification accuracy. Genetic linkage I. S. Lincoln et al. (1992) describe a systematic method to detect errors in laboratory typing results. The method is based on the traditional maximum likelihood approach and compares typing results to categorical variables. In genetics, random errors tend to produce events that are statistically seen unlikely. The proposed model can recognize when an event is more likely to be the result of such an error than recombination of genes (S.E. Lincoln, E.S. Lander, 1992). Traffic loops I. In California, single loop detectors are used as sources for traffic data. C. Chen et al. (2002) state that it is hard to tell if a single sample from such a loop, that is not an outlier, is good or bad. The Washington algorithm can be used to declare samples to be good or bad. This algorithm derives boundaries for the parameters of an acceptable sample from historical data or theory. The disadvantage of this method is that it both generates false positives and false negatives. This method is based on general methods for detection of distance-based outliers without time series. C. Chen et al. (2010) propose a new detector diagnostics algorithm that is based on the empirical observation that good and bad detectors behave very differently over time rather than on the judgement of individual samples. Therefore, this method shows connection to the general methods for detecting distance-based outliers in time series, as discussed in section 3.3.3. Several thresholds are used to determine whether a loop is good or bad. This method gives fewer false negatives and no false positives, which suggests that the algorithm performs well (C. Chen, J. Kwon, J. Rice, A. Skabardonis, P. Varaiya, 2002). VANETs I. A VANET is a vehicular ad hoc network in which cars are used as mobile nodes that can connect to each other. Vehicles can use these connections to communicate with each other for example for safety purposes. It is desirable to detect misbehaviour (i.e. when a person tries to gain access to a particular lane) in such a network. The focus is on distinguishing between correct and false information received from a vehicle. Algorithms are proposed to detect false massages and misbehaving nodes by observing actions after the messages are sent out. It is assumed that most misbehaviour in the network arises from selfish reasons. However, the model can also handle misbehaviour from malicious nodes. The algorithm uses consistency of recent messages and new messages with reported and estimated vehicle positions. Messages contain information about the type, time, and location of the alert and about the location of the vehicle. This information is checked and compared with other available information to come to a judgement about the correctness of new messages. This method is based on a general method for detection of outliers in time series. Comparing the proposed method with existing misbehaviour detection schemes shows that false location information is found, while communication overhead is greatly reduced and location privacy is guaranteed with this new method (S. Ruj, M.A. Cavenaghi, Z. Huang, A. Nayak, I. Stojmenovic, 2011).
DAIPEX
13
30 June 2014
Public Document
Table 4 - Types of incorrect data values and specific methods to detect them Type of incorrect data values Distance-based outliers*
Existing methods (general and specific) No time series: k-nearest neighbour k-means clustering Neural networks Time series: z-value Box plot Scatterplot Mahalonobis distance
Proposed method (specific)
Advantages of proposed method
Use if...
No time series: Distributed deviation detection
More efficient estimators
..thresholds can easily be specified
Density-based outliers
Data-projections
Discrepancies
Auditing tool
Potter’s Wheel
Duplicates**
Detection: Sorted neighbourhood (pairwise) Priority queues (cluster members) Comparison: Similarity Edit distance n/a
Detection: RAR
Partitioning filter
n/a
Maximum likelihood approach
Bayesian misclassification
through the use of kernel
Time series: B-spline smoothing Kernel smoothing
Multi granular deviation detection
Highly precise Effective
..data points show different densities across space/time
Less effort Higher efficiency More discrepancies found Same results in less time
..outliers are not easy to find
Effective Robust Improved classification accuracy
..attribute and class noise are present
DAIPEX
exclude
Comparison: TI similarity
Recognition of events that are more likely to be the result of an error than of gene recombination Washington algorithm Detector diagnostics algorithm Less false positives Less false negatives n/a Misbehaviour detection Detection of false location information algorithm Reduced communication overhead Location privacy is guaranteed * For distance-based outliers, one has the choice between methods that are suited for time series or methods that are not meant for time series. ** For duplicates, both detection and comparison methods are needed. Typing errors in genetic linkage data Erroneous traffic loop data Misbehaving nodes in VANETs
..one wants to duplicate records
..data are from genetic field ..behaviour of good and bad nodes differs over time ..misbehaviour is not always associated with the same node
14
30 June 2014
Public Document
2.4.4 Correction of incorrect data values This section explains specific methods to perform transformations and for correction of incorrect data values in the practical examples that were discussed in the previous paragraph. Concerning duplicates, no specific methods are found since the general method suffices for removal of duplicates. Transformations. As mentioned before, transformations are needed when data show inconsistencies in schema, formats and adherence to constraints, due to for example data entry errors or merging multiple sources (i.e. outliers or discrepancies). Existing transformation tools show a lack of interactivity (i.e. feedback) and require much user effort. Potter’s Wheel integrates transformation and discrepancy detection in a single interface. Users need to specify desired results on example values. The software then automatically infers suitable transformation, which can be specified or cancelled by the user. This iterative way of working makes it possible to make better transformations (V. Raman, J.M. Hellerstein, 2001). L. Berti-Équille et al. (2011) propose a framework that identifies and exploits complex glitch patterns (i.e. incorrect data values) for data cleaning. The framework is called DEC, which stands for Detect-Explore-Clean, and attempts to eliminate the root cause of glitches. First, different detection methods are used and combined to produce a glitch score for each case. Second, statistical test are used to identify interactions between glitch categories in order to formulate data-driven multivariate cleaning strategies. Finally, user-defined ideals and cost and effectiveness constraints will be used to reduce the set of candidate cleaning strategies. The DEC framework links detection and cleaning through iteration and expert feedback and by treating inter-related glitches simultaneously instead of individually. The framework proofs to be more effective than traditional strategies. Clearly, both Potter’s Wheel and the DEC framework, origin from general transformation methods as mentioned in section 3.3.4. Bayesian classification II. Correction for noise that is found in classification comes down to discarding the identified noise. This means that bad cases are not substituted, but simply deleted (X. Zhu, X. Wu, Q. Chen, 2006). In this practical example, the used method is similar to the general method for removal of duplicates. Genetic linkage II. For candidate errors that were detected with a maximum likelihood algorithm, the typing results can simply be rechecked with the data of interest. The correct values are available but were fed to the system in a wrong way because of typing errors (S.E. Lincoln, E.S. Lander, 1992). The method that is used in this practical example can be considered a simple specific version of the general transformation method. Traffic loops II. If single loop traffic detectors are found to be bad, the samples originating from those loops will be discarded. This leaves holes in the data, which can be filled again using different approaches. One approach is prediction using time series analysis. However, this method is inappropriate for errors that do not occur randomly. Unfortunately, this is the case with traffic detectors. Another method, linear interpolation imputation, is intuitive but makes naïve assumptions about the data. C. Chen et al. (2002) propose an algorithm that uses historical data and information from good neighbour loops. This method is less optimal than multiple regression, but more robust because it uses a pairwise model; estimates are generated as long as there is one good neighbour. This means an alternative imputation scheme is required for samples which have no good neighbours at a certain time. The neighbourhood method produces estimates with lower error and performs better than linear interpolation (J. Chen, W. Li, A. Lau, J. Cao, K. Wang , 2010). The method that is used in this practical example finds its roots in imputation methods as mentioned in section 3.3.2.
DAIPEX
15
30 June 2014
Public Document
Concerning traffic, H. Haj-Salem et al. (2002) focus on nonlinear phenomena in traffic, such as congestion. Former algorithms perform not well in nonlinear traffic conditions. They propose the PROPAGE algorithm, which uses a first-order model to reconstruct data because first-order models are claimed to suit well for describing shock waves (H. HajSalem, J.P. Lebacque, 2002). VANETs II. Any misbehaviour that is encountered in VANETs will not be revoked, which saves communication overhead. The providers of incorrect information will simply be imposed with fines. The incorrect data will not be corrected since the data are only relevant at the moment of creation (S. Ruj, M.A. Cavenaghi, Z. Huang, A. Nayak, I. Stojmenovic, 2011). The method that is used in this practical examples links to the general method for removal of duplicates, as discussed in section 3.3.4. Overview. This paragraph describes methods to correct the different types of incorrect data values that were mentioned in the previous paragraph. The immaturity of this area of research causes this paragraph to contain only a few propositions for new methods and even fewer evaluations of the methods that were mentioned. Table 5 shows existing general and specific methods to correct the different types of incorrect data values that were discussed in the previous paragraph. Proposed new methods and their advantages are described where applicable. Table 5 - Specific methods to correct different types of incorrect data values
Type of incorrect data values Distance-based outliers Density-based outliers Discrepancies Duplicates
Existing method (general Proposed method and specific) Transformation tools
Potter’s Wheel DEC framework
Prune to eliminate false n/a positives that were found in the detection phase and discard the real duplicates Delete cases that were n/a Bayesian misclassification identified as noise in the detection phase Typing errors in Recheck candidate errors n/a genetic linkage and correct if necessary data Prediction with time series PROPAGE Erroneous algorithm traffic loop data
Advantages proposed method
of
User feedback Less user effort Integration of detection and correction phase
n/a
n/a
n/a
Uses a model that suits well for describing nonlinear phenomena
interpolation Use historical data Lower error in and information estimates from good Better performance neighbours Impose misbehaviour with n/a n/a Misbehaving Data are not nodes in fines. corrected. VANETs Linear imputation
DAIPEX
16
30 June 2014
Public Document
2.5 Method overview
DAIPEX
x
x
Misbehaviour detection
Washington & detection diagnostics algorithm
Partitioning filter
RAR + TI-similarity
Edit distance
Similarity
Sorted Neighbourhood Methods & Priority queues
PROPAGE algorithm
DEC Framework
Potter’s Wheel
Data projections
Multi granular dev. detection
Kernel smoothing
Statistical tests
x
Methods to detect and correct incorrect data values
Clustering methods
x
Doubly robust estimators
x
MI + Weighting
Maximum Likelihood (ML)
x
Multiple Imputation (MI)
Dummy variable adjustment
x
Weighting methods
Pairwise Deletion (PD)
(C.C. Aggarwal, P.S. Yu, 2001) (Allison, 2003) (L. Berti-Équille, T. Dasu, D. Srivastava, 2011) (J.R. Carpenter, M.G. Kenward, S. Vansteelandt, 2006) (C. Chen, J. Kwon, J. Rice, A. Skabardonis, P. Varaiya, 2002) (J. Chen, W. Li, A. Lau, J. Cao, K. Wang , 2010) (Enders, 2001) (S. Greenland, W.D. Finkle, 1995) (H. Haj-Salem, J.P. Lebacque, 2002) (J.G. Ibrahim, M. Chen, S.R. Lipsitz, A.H. Herring, 2005) (S.E. Lincoln, E.S. Lander, 1992) (J. Luengo, J.A. Sáez, F. Herrera, 2011) (I. Myrtveit, E. Stensrud, U.H. Olsson, 2001) (V. Raman, J.M. Hellerstein, 2001) (S. Ruj, M.A. Cavenaghi, Z. Huang, A. Nayak, I. Stojmenovic, 2011) (S.R. Seaman, I.R. White, A.J. Copas, L. Li, 2012) (S. Subramaniam, T. Palanas, D. Papadopoulos, 2006) (S.Y. Sung, Z. Li, P. Sun, 2002) (R. Young, D. Johnson, 2013) (M. Zhong, P. Lingras , S. Sharma, 2004) (X. Zhu, X. Wu, Q. Chen, 2006)
Listwise Deletion (LD)
Author(s), year
Complete-case analysis
Methods to correct missing data values
x
x x x
x
x
x x
x x
x
x
x x
x
x
x
x
x x
x x
x
x
x x x x
x
x
x x x x
x
x
x x x
x
x
x
x x x
x
x
x x
x
x
x
x x
17
30 June 2014
Public Document
2.6 Summary Summarizing the detailed results above, the various methods for detecting and correcting data values can be classified as follows: 1. General methods a. Methods to detect missing data values i. Empty fields, dots or ‘999’ b. Methods to correct missing data values i. Complete-case analysis, listwise deletion, pairwise deletion, dummy variable adjustment ii. Maximum likelihood iii. Weighting iv. Multiple imputation c. Methods to detect incorrect data values i. k-nearest neighbour, k-means clustering, neural network methods ii. Statistical tests iii. Examination of data projections iv. Sorted neighbourhood or priority queues, followed by similarity or edit distance d. Methods to correct incorrect data values i. Transformation tool ii. Pruning and discarding 2. Specific methods a. Methods to detect missing data values n/a b. Methods to correct missing data values i. Multiple imputation + Inversed probability weighting ii. Doubly robust estimators c. Methods to detect incorrect data values i. Kernel smoothing ii. Multi granular deviation detection algorithm iii. Potter’s Wheel iv. RAR, followed by TI-similarity v. Partitioning filter vi. Maximum likelihood, followed by comparison vii. Washington algorithm, detector diagnostics algorithm viii. Misbehaviour detection d. Methods to correct incorrect data values i. Potter’s Wheel ii. DEC framework iii. Discarding iv. PROPAGE algorithm v. Linear interpolation imputation As mentioned before, the context is important when choosing a method. Therefore, it is hard to point out a single best method for each category. The following subparagraphs describe best methods and considerations for choosing a certain method. Methods to detect missing data values. Depending on the software one works with, either finding empty fields, dots or cells containing ‘999’ is the best method in this category. Methods to correct missing data values. A trade-off should be made between complexity and efficiency; methods with lower complexity in general also score lower on efficiency. Complete-case analysis and weighting methods would be acceptable in a situation with little time and/or little
DAIPEX
18
30 June 2014
Public Document
missing data values. If one wants higher efficiency, maximum likelihood, multiple imputation or a combined method should be chosen. If imputation models are complex, the method that combines multiple imputation and inversed probability weighting serves best. In case of violation of distributional assumptions, one should choose the method of doubly robust estimators. In other situations, a trade-off between complexity and efficiency should be made; doubly robust estimators is less complex, but also less efficient than multiple imputation + inversed probability weighting. Methods to detect incorrect data values. If one wants to find distance-based outliers, several methods are available for situations with and without time series. Concerning situations without time series, distributed deviation detection is pointed out as a good choice. In case the situation does not allow for this method, one of the more general methods can be used. Concerning situations with time series, one can use smoothing for the sake of efficiency. If the situation does not allow for smoothing, the more simple general statistical tests can be carried out to check for outliers. If one wants to find density-based outliers, multi granular deviation detection should be used if the situation allows it. Another option would be the examination of data projections. For detection of duplicates, it is claimed that the combination of detection method RAR and comparison method TIsimilarity would give the same results as current methods, but in less time. In addition, Potter’s Wheel is mentioned as a way to find more discrepancies (i.e. outliers and duplicates) with less effort than current auditing tools that are used. In a situation with Bayesian classification, a partitioning filter is claimed to be a robust and effective method to find errors. Approaches mentioned for very specific practical examples are found too specific to mention as best methods here since their area of applicability is very narrow. Methods to correct incorrect data values. Correcting for outliers can be done by means of Potter’s Wheel or the DEC framework, which is comparable to the first method. Removal of duplicates concerns pruning and discarding. In this section the approaches for very specific practical examples are left out as well for the same reason as mentioned in the previous subparagraph. Concluding, both general and specific methods are currently available for the detection and correction of missing and incorrect data values. Specific methods tend to perform better for specific practical examples than general methods. However, the applicability of those specific methods is not always high. Therefore, sometimes general methods can be preferred over specific methods.
2.7 Discussion This section will be used to discuss several striking findings and to answer the third research question as stated in the introduction. As becomes clear from section 2, the extensive research that is conducted on detection and correction of missing data values is in a more mature stage than the research on detection and correction of incorrect data values. Methods to detect missing data values are the least complicated in the sense that standard software packages employ simple algorithms to find the missing values in a data set. Publications in the area of correction of missing data values focus on a group of common and well known methods and criticise and compare those methods. Complete-case analysis, maximum likelihood, weighting, multiple imputation and combinations of these are the most common methods in this area. Maximum likelihood and doubly robust estimators receive the best evaluations. Concerning incorrect data values, a distinction between the different types of incorrect data values is made. Outliers, duplicates and misclassification are the most common types of incorrect data values. In this section, certain practical examples from genetics and transportation are discussed as well. The publications on detecting incorrect data values do propose new methods, like the use kernel estimators, but are quite specific which makes it harder to apply the methods in a general way, opposite to the more general applicable methods that are mentioned in the field of missing
DAIPEX
19
30 June 2014
Public Document
data value correction. The publications in the area of correction of incorrect data values are still in the exploratory phase which is uttered by the predominantly descriptive texts on existing methods. Only a few researchers come up with new methods and foundations for these new approaches. For both the existing and proposed models in this area, the lack of general applicability that was also found in the area of detection of incorrect data values holds. Data quality problems are becoming more complex. The current state of affairs regarding quantitative data cleaning has several limitations which can impair detection and correction: Approaches are often one-dimensional, which is not the case for real-world data Assumptions like MAR are unrealistic Detection and correction are not linked As mentioned in a lot of the publications that were used for this literature review, future research on the subject of detection and correction of missing and incorrect data values is desirable. In subsequent research, attention should be given to acceptability of assumptions and general applicability of methods. In addition, the integration of detection and correction would lead to a more ideal situation since it enables fast and efficient data cleaning. Considering the current state of affairs concerning this subject, it is still a long way to go. However, the DEC framework from L. Berti-Équille et al. (2011) and Potter’s Wheel from V. Raman et al. (2001) can be considered steps in the right direction.
DAIPEX
20
30 June 2014
Public Document
3 Data clustering techniques Clustering aims to divide data into groups of similar objects while some details are disregarded in exchange for data simplification, P. Berkhin (2006) and R. Xu et al. (2005). Clustering has been applied in a wide variety of fields, i.e., computer sciences (web mining, spatial database analysis, textual document collection, image segmentation, speech recognition, transportation data), life and medical sciences (genetics, biology, microbiology, palaeontology, psychiatry, clinic, pathology), earth sciences (geography, geology, remote sensing), social sciences (sociology, psychology, archaeology, education), and economics (business, marketing). There is no clustering algorithms that can be universally used to solve all problems, R. Xu et al. (2005). Hence clustering is highly application dependent and to certain extent subjective (personal preferences). Traditional clustering techniques are broadly divided into hierarchical and partitioning. Hierarchical clustering is further subdivided into agglomerative and divisive. A tree representing this hierarchy of clusters is known as a dendrogram where individual data objects are the leaves of the tree and the interior nodes are nonempty clusters. We survey these algorithms in subsection 3.1. While hierarchical algorithms gradually assemble points into clusters, partitioning algorithms learn clusters directly. Partitioning algorithms try to discover clusters either by iteratively relocating points between subsets or by identifying areas heavily populated with data. These algorithms are surveyed in subsection 3.2. Due to the plenty and diversity of existing clustering algorithms, we focus on the several features of datasets and therefore the corresponding algorithms as follows. Clustering large datasets presenting scalability problems is reviewed in subsection 3.3. High dimensional data and their clustering methods are surveyed in subsection 3.4. Clustering sequential data, such as protein sequences, retail transactions, intrusion detection, and web-logs, transportation data, is discussed in subsection 3.5. Text data clustering is a broad area focusing on the features of texts and documents, which is presented in subsection 3.6. Imbalanced data are difficult to cluster due to the imbalanced distribution nature. Subsection 3.7 includes a brief discussion on the imbalanced data clustering techniques. Finally the evaluation criteria on these algorithms and discussions are given in subsection 3.8 and 3.9.
3.1 Hierarchical clustering algorithms A divisive clustering starts with a single cluster containing all data points and recursively splits that cluster into appropriate subclusters. For a cluster with N objects, there are 2 1 possible twosubsets divisions, which is very expensive in computation. Therefore, divisive clustering is not commonly used in practice and hence not covered in this subsection. On the contrary, an agglomerative clustering starts with one point clusters and recursively merges two or more of the most similar clusters. The agglomerative based clustering algorithms are very widely used in the reality and are discussed in more detail. The process continues until a stopping criterion is achieved. Hierarchical clustering works with the NxN matrix of distances (dissimilarities) or similarities between training points called connectivity matrix. An important example of dissimilarity between two points is the distance between them. To merge or split subsets of points, the distance between individual points has to be generalized to the distance between subsets. Such a derived proximity measure is called a linkage metric and the linkage metrics are constructed from elements of this matrix. The type of linkage metric used has a significant impact on hierarchical algorithms because it reflects a particular concept of closeness and connectivity. Important intercluster linkage metrics include single link, average link, and complete link, F. Murtagh. (1985). The underlying dissimilarity measure is computed for every pair of nodes with one node in the first set and another node in the second set. A specific operation such as minimum (single link), average (average link), or maximum (complete link) is applied to pairwise dissimilarity measures.
DAIPEX
21
30 June 2014
Public Document
There are various hierarchical clustering algorithms. The Lance–Williams formula gives the idea of conceptual clustering. SLINK (R. Sibson. 1973) implements single Link. Voorhees’ method (E.M. Voorhees. 1986) implements average link. The algorithm CLINK (D. Defays. 1977) implements complete link. Moreover a few of newer representatives are also available, such as BIRCH which is described in subsection 3.3, CURE which is also described in subsection 3.3, and CHAMELEON which is below. The hierarchical agglomerative algorithm CHAMELEON is developed by G. Karypis et al. (1999). It uses the connectivity graph corresponding to the K nearest neighbour model sparsification of the connectivity matrix. The edges of K nearest neighbour points to any given point are preserved while the rest are pruned. According to the idea, CHAMELEON finds the clusters in the data set by using a two phase algorithm. During the first phase, CHAMELEON uses a graph partitioning algorithm to cluster the data items into a large number of relatively small sub-clusters. During the second phase, it uses an agglomerative hierarchical clustering algorithm to find the genuine clusters by repeatedly combining together these sub-clusters. The hierarchical agglomerative based algorithms have certain features. It is flexibility regarding the level of granularity. It can handle at easy any form of similarity or distance. It can be applied to any attribute type. However, the disadvantages of hierarchical clustering are: it is difficult to choose the right stopping criteria, and most hierarchical algorithms do not revisit (intermediate) clusters once they are constructed.
3.2 Partitioning clustering algorithms Algorithms are called Partitioning Relocation Clustering because they discover clusters by iteratively relocating points between subsets. These algorithms can be further classified into probabilistic clustering, K-medoids methods, and the various K-means methods, which are discussed in the following subsections individually. Such methods concentrate on how well points fit into their clusters and tend to build clusters of proper convex shapes. There is another kind of partitioning algorithms which identify areas heavily populated with data and attempt to discover dense connected components of data. The density determined approach allows these data flexible in terms of their shape. Furthermore, these algorithms are less sensitive to outliers and can discover clusters of irregular shape. 3.2.1 Probabilistic clustering Expectation-Maximization (EM) method is a two-step iterative optimization which adopts Loglikelihood as an objective function. Step (E) estimates probabilities and is equivalent to a soft reassignment. Step (M) finds an approximation to the mixture model given the current assignments, which finds the mixture model parameters that maximize the log-likelihood. The process continues until the log-likelihood convergences. Some methods are proposed in order to facilitate finding better local optima. For example, A. Moore. (1999). suggested an acceleration of the EM method based on a special data index, KD tree. Data are divided at each node into two descendants by splitting the widest attribute at the center of its range. Each node stores sufficient statistics so as to allow reconsideration of point assignment decisions. In the end, EM iterations can be accelerated by approximate computing over a pruned tree. Probabilistic clustering has some important features: it can be modified to handle points with a complex structure; it can be stopped and resumed with consecutive batches of data because clusters have representation totally independent from sets of points; at any stage of the iterative process the intermediate mixture model can be used to assign points to clusters; it leads to easily interpretable cluster systems. 3.2.2 K-medoids methods In k-medoid clustering algorithms (Shu-Chuan Chu et al. 2003), a set of points from the original data are used as the medoids around which the clusters are built. The key aim of the algorithm is to determine an optimal set of representative points from the original corpus around which the clusters DAIPEX
22
30 June 2014
Public Document
are built. Each point is assigned to its closest representative from the collection. This creates a running set of clusters from the corpus which are successively improved by a randomized process. The algorithm works with an iterative approach in which the set of k representatives are successively improved with the use of randomized inter-changes. Specifically, the averaged distance or another dissimilarity measure between a point and its closest representative is used as the objective function. The objective function needs to be improved during this interchange process. In each iteration, a randomly picked representative in the current set of medoids is replaced with a randomly picked representative from the collection, if it improves the clustering objective function. This approach is applied until convergence is achieved. The k-medoid methods in the literature are algorithms such as PAM (Partitioning Around Medoids), CLARA (Clustering LARge applications), and CLARANS (Clustering Large Applications based upon RANdomized Search), L. Kaufman et al. (1990), R. Ng (1994). There are two main disadvantages of the use of k-medoids based clustering algorithms. On the one hand, they require a large number of iterations in order to achieve convergence and are therefore quite slow. On the other hand, they do not work very well for sparse data such as text (C. C. Aggarwal et al. 2012). 3.2.3 K-means methods The K-means algorithm is the best-known squared error-based clustering algorithm (J. MacQueen. 1967). The K-means algorithm is very simple and can be easily implemented in solving many practical problems. It can work very well for compact and hyperspherical clusters. The time complexity of K-means is O(Nkd) where k and d are usually much less than N. Hence K-means can be used to cluster large data sets. Parallel techniques for K-means are developed which can largely accelerate the algorithm (K. Stoffel et al. 1999). The K-means algorithm can be summarized in four steps: Step 1: Initialize a K-partition randomly or based on some prior knowledge. Calculate the cluster prototype matrix. Step 2: Assign each object in the data set to the nearest cluster. Step 3: Recalculate the cluster prototype matrix based on the current partition. Step 4: Repeat steps 2–3 until there is no change for each cluster. The drawbacks of K-means are also well studied, which motivates many variants of K-means in order to overcome these obstacles. The major disadvantages are in the following. There is no efficient and universal method for identifying the initial partitions and the number of clusters. In the meanwhile the convergence centroids vary with different initial points. A general solution for this problem is to run the algorithm many times with random initial partitions. Moreover the iteratively optimal procedure of K-means cannot guarantee convergence to a global optimum. K-means is sensitive to outliers and noise. Even if an object is quite far away from the cluster centroid, it is still forced into a cluster and thus distorts the cluster shapes. The method itself limits the application only to numerical variables.
3.3 Large dataset clustering algorithms For a large data set, it is not practical to keep a connectivity matrix in memory. Hence different techniques are used to sparsify (introduce zeroes into) the connectivity matrix, for example, omitting entries smaller than a certain threshold, using only a certain subset of data representatives, or keeping with each point only a certain number of its nearest neighbours. One of the most striking developments in hierarchical clustering is the BIRCH algorithm. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) has been demonstrated to be especially suitable for very large databases, Zhang et al. (1996). BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources, i.e., available memory and time constraints. BIRCH can typically find a good clustering with a single scan of the data and improve the quality further with a
DAIPEX
23
30 June 2014
Public Document
few additional scans. BIRCH is also the first clustering algorithm to handle “noise� effectively. The concept of Clustering Feature and CF tree are at the core of BIRCH’s incremental clustering. A Clustering Feature is a triple summarizing the information about a cluster. CF vector of the cluster keeps the linear sum and the square sum of the data points in the cluster. CF vectors of clusters can be stored and calculated incrementally and accurately as clusters are merged. CF vectors provide summaries of clusters, which are not only efficient because it stores much less than all the data points in the cluster, but also accurate because it is sufficient for calculating all the measurements that is needed for making clustering decisions in BIRCH. The CF tree will be built dynamically as new data objects are inserted. The CF tree is a very compact representation of the dataset because each entry in a leaf node is not a single data point but a subcluster. The CF tree structure captures the important clustering of the original data while reducing the required storage. Outliers are eliminated from the summaries by identifying the objects sparsely distributed in the feature space. After the CF tree is built, an agglomerative hierarchical clustering is applied to the set of summaries to perform global clustering. BIRCH can achieve a computational complexity of O(N). S. Guha et al. (1998) introduced the hierarchical agglomerative clustering algorithm CURE (Clustering Using REpresentatives). In CURE, instead of using a single centroid to represent a cluster, a constant number of representative points are chosen to represent a cluster. The similarity between two clusters is measured by the similarity of the closest pair of the representative points belonging to different clusters. New representative points for the merged clusters are determined by selecting a constant number of well scattered points from all the data points and shrinking them towards the centroid of the cluster according to a shrinking factor. Unlike centroid/medoid based methods, CURE is capable of finding clusters of arbitrary shapes and sizes, as it represents each cluster via multiple representative points. Shrinking the representative points towards the centroid helps CURE avoid the problem of noise and outliers in the single link method. The desirable value of the shrinking factor in CURE is dependent upon cluster shapes and sizes, and amount of noise in the data.
3.4 High dimensional data clustering algorithms Real-life data are often with high dimensionality. For a dimension larger than 20, the algorithm performance will degrade gradually to the level of sequential search (P. Berkhin 2006). In order to address high dimensional data, corresponding developments are suggested in the literature and surveyed in this subsection. The problem with high dimensionality is that a decrease in metric separation will show as the dimension grows. One approach to dimensionality reduction uses attribute transformations (e.g., principal component analysis, I.T. Jolliffe. 1986). Dimension reduction is important in cluster analysis, which not only makes the high-dimensional data addressable and reduces the computational cost, but provides users with a clearer picture and visual examination of the data of interest. However, dimensionality reduction methods inevitably cause some loss of information, and may damage the interpretability of the results, even distort the real clusters. Another way to address the problem is through subspace clustering, such as in algorithms CLIQUE (R. Agrawal et al. 1998), MAFIA (S. Goil et al. 1999), ENCLUS (C. Cheng et al. 1999). Subspace-based clustering addresses the challenge by exploring the relations of data objects under different combinations of features. For example, CLIQUE employs a bottom-up scheme to seek dense rectangular cells in all subspaces with high density of points. Clusters are generated as the connected components in a graph whose vertices stand for the dense units. The resulting minimal description of the clusters is obtained through the merge of these rectangles.
3.5 Sequential data clustering algorithms Sequential data come with many distinct characteristics, e.g., variable length, dynamic behaviours, time constraints, and large volume (D.R. Cutting et al. 1992). Sequential data can be generated from: DNA sequencing, speech processing, text mining, medical diagnosis, stock market, customer transactions, web data mining, transportation data, and robot sensor analysis, etc. Potential patterns hidden in the large number of sequential data are explored during cluster analysis. Generally, strategies for sequential clustering mostly fall into the following categories.
DAIPEX
24
30 June 2014
Public Document
Sequence similarity approach is based on the measure of the distance between each pair of sequences. Based on the distance measure, clustering algorithms which can be either hierarchical or partitional can group sequences. Since many sequential data are expressed in an alphabetic form, conventional measure methods are inappropriate. If a sequence comparison is regarded as a process of transforming a given sequence to another with a series of substitution, insertion, and deletion operations, the distance between the two sequences can be defined by the minimum number of required operations. The defined distance is known as edit distance or Levenshtein distance (D. Sankoff et al. 1999 and C. C. Aggarwal et al. 2012). Another approach begins with the extraction of a set of features from the sequences. All the sequences are then mapped into the transformed feature space. As a result, classical vector spacebased clustering algorithms can be used to form clusters. This approach is so called indirect sequence clustering where feature extraction becomes the essential factor because it decides the effectiveness of these algorithms. Morzy et al. utilized the sequential patterns as the basic element in the agglomerative hierarchical clustering. They also defined a co-occurrence measure, as the standard of fusion of smaller clusters (T. Morzy et al. 1999). Guralnik and Karypis discussed the potential dependency between two sequential patterns and proposed both the global and the local approaches to prune the initial feature sets. Therefore sequences are better represented in the new feature space (V. Guralnik et al. 2001). These methods greatly reduce the computational complexities and can be applied to large-scale sequence databases. However, the process of feature selection inevitably causes the loss of some information in the original sequences and needs extra attention.
3.6 Text clustering algorithms The sparse and high dimensional representation of the different documents necessitate the design of text-specific algorithms for document representation and processing (C. C. Aggarwal et al. 2012). In order to enable an effective clustering process, the word frequencies need to be normalized in terms of their relative frequency of presence in the document and over the entire collection. In general, a common representation used for text processing is the vector-space based TF-IDF representation (P. Jacqumart et al. 2003). In the TF-IDF representation, the term frequency for each word is normalized by the inverse document frequency, or IDF. The inverse document frequency normalization reduces the weight of terms which occur more frequently in the collection. This reduces the importance of common terms in the document collection, ensuring that the matching of documents be more influenced by that of more discriminative words which have relatively low frequencies in the collection. Text clustering algorithms are divided into a wide variety of different types such as agglomerative clustering algorithms, partitioning algorithms, and standard parametric modeling based methods such as the EM-algorithm. Furthermore, text representations may also be treated as strings (rather than bags of words). These different representations necessitate the design of different classes of clustering algorithms. Different clustering algorithms have different tradeoffs in terms of effectiveness and efficiency. Distance-based clustering algorithms are designed by using a similarity function to measure the closeness between the text objects. The most well-known similarity function which is used commonly in the text domain is the cosine similarity function. Effective methods for implementing single-linkage clustering for the case of document data may be found in (E. Buyko et al. 2009). Hierarchical clustering algorithms have also been designed in the context of text data streams. The Scatter-Gather method in (Y. Cai et al. 2009) uses both hierarchical and partitional clustering algorithms to good effect. Specifically, it uses a hierarchical clustering algorithm on a sample of the corpus in order to find a robust initial set of seeds. This robust set of seeds is used in conjunction with a standard k-means clustering algorithm in order to determine good clusters. The size of the sample in the initial phase is carefully tailored so as to provide the best possible effectiveness without this phase becoming a bottleneck in algorithm execution. Frequent pattern mining (B. Alex et al. 2007) is a technique which has been widely used in the data mining literature
DAIPEX
25
30 June 2014
Public Document
in order to determine the most relevant patterns in transactional data. The clustering approach in (M. Ashburner et al. 2000) is designed on the basis of such frequent pattern mining algorithms. The main idea of the approach is to not cluster the high dimensional document data set, but consider the low dimensional frequent term sets as cluster candidates.
3.7 Imbalanced data clustering methods There is vague definition of imbalanced data from the literature. L. Cao et al. (2008) described that impact activities which lead to significant effects to the cluster analysis are normally rare and dispersed in a large activity populations. The sharp difference between the amount of impact activities and the rest ones makes the data become imbalanced. H. He et al. (2009) presented the definition of imbalanced data with an example where one small part of the data are clustered as “Positive” while the rest part is clustered as “negative”. Due to the nature of imbalanced data, when standard clustering algorithms are applied to these data, the clustered results are often biased towards the majority of the data, and hence there is a higher misclassification rate in the minority data because minority data are often both outnumbered and underrepresented. Some work has been done in order to learn from imbalanced data efficiently, N. Chawla et al. (2004), H. Guo et al. (2004), N. Japkowicz. (2000), J. Zhang et al. (2004). These work has pre-assumed that the clustering results are already available, which then allows learning rules etc. from these given clusters of imbalanced data. However clustering techniques on the imbalanced data have not been well addressed in the literature so far.
3.8 Evaluation of clustering results The properties of clustering algorithms of concern in data mining include:
Type of attributes an algorithm can handle Scalability to large datasets Ability to handle high-dimensional data Be sensitive to outliers/noise Time complexity Be able to deal with sequential data Reliance on a priori knowledge and user-defined parameters Capability of working with imbalanced data
As shown in previous subsections in Section 3, with every algorithm we have discussed only some of these properties. Here we put the representative clustering algorithms in the following table with several chosen properties to compare them. Cluster algorithms
Complexity (time)
Large dataset
K-means CLARA CLARANS Hierarchical clustering CHAMELEON
O(Nkd) O( 40 O( ) O( )
BIRCH CURE CLIQUE
DAIPEX
)
O( log O(N) O( ) O(N)
Imbalanced data
Sensitive to outlier
Yes Yes Yes No
High dimensional data No No No No
No No No No
Yes No No Yes
No
No
No
Yes
Yes Yes No
Yes No Yes
No No No
No No No
)
26
30 June 2014
Public Document
3.9 Discussions As an important tool for data exploration, cluster analysis includes a series of steps, ranging from preprocessing (feature selection or extraction), clustering algorithm design and development, to solution validity and evaluation. Each of them is tightly related to each other and exerts great challenges to the scientific disciplines. In this section, we place the focus on the clustering algorithms and review a wide variety of approaches appearing in the literature. From our survey, it is shown that algorithms are designed with certain assumptions and favor some type of biases. Hence there is no clustering algorithm which can be universally used to solve all problems. In this sense, it is possible to compare these algorithms but not accurate to say “best” in the context of clustering algorithms. These comparisons are mostly based on some specific applications, under certain conditions, and the results may become quite different if the conditions change. At the preprocessing phase, feature selection/extraction is as important as the clustering algorithms. Choosing appropriate and meaningful features can greatly reduce the burden of subsequent designs. However feature selection/extraction lacks universal guidance. It is still dependent on the applications themselves. Although there are huge amount of different clustering algorithms, new clustering algorithms are still needed when there comes new challenges from the nature of data. Several features of datasets will continuously motivate to design a new clustering algorithm in order to handle these features. handle large volume of data as well as high-dimensional features with acceptable time and storage complexities; handle imbalanced feature together with sequential and high-dimensional feature in the dataset; detect and remove possible outliers and noise; decrease the reliance of algorithms on users-dependent parameters; provide some insight for the number of potential clusters without prior knowledge; provide users with results that can simplify further analysis.
DAIPEX
27
30 June 2014
Public Document
4 Data obfuscation Data obfuscation (Bakken et al. 2004) or anonymization (Cormode and Srivastava 2009) is the area of research that involves the development of techniques for making it impossible to derive certain information from a data set. Anonymization is the area in which techniques are developed for making it impossible to derive identities from a data set. Obfuscation is more general and concerns the development of techniques for making it impossible to derive other information, while preserving probability distributions, such that the data can be used in (statistical) analysis. Data obfuscation can, for example, be used to obfuscate sales data for specific products of specific companies, while preserving their probability distributions. Data obfuscation and anonymization has seen much use in healthcare (Lin, Owen and Altman 2004; Kalam et al. 2004). However, it has other applications as well. The area of logistics, on which this project focuses, has particular requirements. It should not only be impossible to derive the identity of companies and truck drivers, but it should also be impossible to derive whether certain routes are busy with transportation, as this is important competitive information. It could be used to detect important customers or to drive down the price of transportation on certain routes. If such information could be derived, transportation companies will not want to participate in information sharing. To the best of our knowledge, there exist no data obfuscation techniques for this particular purpose, in particular because it is the explicit goal of obfuscation to preserve statistical properties. Common data obfuscation techniques are: adding random noise, data swapping, partial suppression, and linear transformation.
4.1 Data randomization The data randomization technique can be used to obfuscate specific values of a given variable. The technique works by adding random noise to that variable (Agrawal and Ramakrishnan 2000). For example, if the variable takes the values x1, x2, …, xn in successive observations. Values y1, y2, …, yn can be taken from a random distribution to transform the original values into x1 + y1, x2 + y2, …, xn + yn. As the added values are random, statistical properties like the mean, median and standard deviation are preserved, and if the values are sufficiently large, the original values are obfuscated. While the benefit of this technique is that it is simple to apply, a drawback is that we can approach the original values x1, x2, …, xn.
4.2 Data swapping The data swapping technique can be used to obfuscate particular properties of a given partner, potentially including multiple variables. The technique works by swapping the variables of certain variables between different rows in a table (Dalenius and Reiss 1982), in which each row represents the properties of a specific person or company. In this way, the properties cannot be traced to a person or company, or at least one cannot be sure that the properties given to a person or company are indeed their. The technique has the added benefit that statistical measures can be given about the ‘amount of swapping’ that is done, such that an indication can be given about the (un)certainty that the data that is associated with a person or company is indeed theirs. In addition, it can easily be seen that, like the data randomization technique, the data swapping technique preserves statistical properties. Note, however, that the data swapping technique only preserves statistical properties between variables insofar as these variables are swapped together. For example, when swapping the values belonging to only one variable, it is not possible to compute the correlation between that variable and another variable.
DAIPEX
28
30 June 2014
Public Document
4.3 Partial suppression The partial suppression technique can be used to ensure that it is hard to derive data for a specific person. The technique works by removing variables from a dataset that can be used to identify people directly (Sweeney 2002), but also by grouping variable values to further hinder identification of specific individuals. For example, the combination of initials, surname, address, and date of birth can be used to identify a person uniquely in most cases. Just providing a date of birth is far less identifying. Similarly, using specific ages (51, 53, 54 years old) is more identifying than using age groups (50-59 years old). The k-anonymity method (Sweeney 2002) is based on the elimination and grouping of identifying variables, such that the remaining identifying variables point to at least k individuals. The benefit of this technique is that it preserves the properties of the remaining variables. However, the drawback is that it is only feasible in large data sets, in which sufficiently large values of k can be achieved without removing too many variables.
4.4 Linear transformation The linear transformation technique can be used to transform values of variables into values that cannot reveal the original value. The technique works by performing (linear) mathematical operations on pairs of variables (Oliveira and Zaane 2004). While the technique leads to relatively high security in the sense that it is very hard to derive the original values, it does not preserve statistical properties. Rather, it is specifically meant to preserve distance of points (individuals) in a Euclidian space, such that the individuals can be clustered and information about the clusters can be derived.
4.5 Discussion While there exist many data obfuscation techniques, it is clear that these techniques are specifically meant to preserve statistical properties, such as means, medians, standard deviations and even correlations. This partly fufills the goal of this project, by facilitating anonymization of transportation resources. Thus the techniques discussed above enables transportation service providers to enable sharing of information, without revealing their identitities and, therewith, specific properties such as timeliness and cost. However, the literature does not provide the means to selectively disable the possibility of deriving certain statistical properties, such as busy transportation routes. Therefore, there is room for improving on the current literature, by developing data obfuscation techniques that can disable the derivation of certain statistical properties, while preserving other properties. The literature does provide pointers as to how such techniques can be developed. In particular the linear transformation approach shows how data can be completely transformed, while preserving only those properties that are necessary for the problem at hand. In light of this approach, geographical location can be interpreted as a location in a Euclidian space and obfuscated accordingly. In other words, the exact location does not have to be known; it is more important to know the relative distance to a pick-up location.
DAIPEX
29
30 June 2014
Public Document
5 Conclusions This report provides an overview of data mining and analysis literature in three specific areas: 1. 2. 3.
Detection and correction of missing and incorrect data Data clustering Data obfuscation
The focus on these three topics is driven by three research goals from the logistics domain, for which the primary concern is sharing data in order to improve transportation plans. These three topics contribute to this goal in the following manner. First, in order to improve transportation plans, it is necessary to obtain statistical properties of transportation and service times. While such statistical properties can readily be derived from databases with historical information, incorrect data and even missing values can pollute these properties. For example, the well-known phenomenon of people entering 09-09-99 as a date when required to enter a date, can significantly reduce the accuracy of statistical properties. Second, the data in the logistics domain may, and often is, provided at a level of detail that is too low. For example, instead of providing travel times on a certain route, GPS data is available of when each truck was where on that route. This requires that the lower level data is aggregated to information that can be used for planning. For this purpose, clustering algorithms can be used. In particular, we aim to use clustering algorithms to derive higher level activities from a log of event data of trucks, for which we can subsequently determine the average duration. Third, before data in the logistics domain can be shared between transportation partners, this data must be anonymized and obfuscated to some extent. This is necessary, because transportation partners may not want to share certain information with each other or even with their clients. Should that information be provided or even derivable from the data, then transportation partners will not want to participate in the sharing of information. The literature study shows a large body of literature in all three areas. While there exists literature in both the area of data clustering and the area data obfuscation, this literature is not directly applicable to the practical problems at hand. Clustering of events in a log could lead to the identification of activities (clusters) and associated throughput times. However, it is important that we establish what the unit is that is clustered. It is not logical to cluster individual event, nor is it logical to cluster the entire log. Data obfuscation facilitates the anonymization of data. However, in addition to that, our usage scenario specifically requires certain statistical properties to not be derivable, while most data obfuscation techniques are aimed at preserving statistical properties. These findings provide a clear agenda for further research.
DAIPEX
30
30 June 2014
Public Document
Bibliography C.C. Aggarwal and P.S. Yu. (2001). Outlier Detection for High Dimensional Data. ACM SIGMOD international conference on Management of data (pp. 37-46). New York, USA: ACM. C. C. Aggarwal and C. Zhai. (2012). A Survey of Text Clustering Algorithms. Mining Text Data. pp. 77-128. R. Agrawal and S. Ramakrishnan (2000). Privacy-Preserving Data Mining. In ACM Special Interest Group on Management of Data, pp. 439–450. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. (1998). Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of ACM SIGMOD, pages 94– 105, Seattle, Washington, USA, 1998. P. Allison (2003). Missing Data Techniques for Structural Equation Modeling . Journal of Abnormal Psychology, 545-557. B. Alex, B. Haddow, and C. Grover. (2007). Recognising nested named entities in biomedical text. In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, pages 65–72, 2007. M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cheryy, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock. (2000). Gene ontology: Tool for the unification of biology. Nature Genetics, 25(1):25–29, 2000. D.E. Bakken, R. Parameswaran, D.M. Blough, A.A. Franz, and T.J. Palmer (2004). Data obfuscation: Anonymity and desensitization of usable data sets. IEEE Security and Privacy, 2 (6), 34-41. P. Berkhin. (2006) . A Survey of Clustering Data Mining Techniques. Grouping Multidimensional Data, 25-71. L. Berti-Équille, T. Dasu, D. Srivastava. (2011). Discovery of Complex Glitch Patterns: A Novel Approach to Quantitative Data Cleaning. ICDE Conference 2011 (pp. 733-744). Rennes, France: IEEE. E. Buyko, E. Faessler, J. Wermter, and U. Hahn. (2009). Event extraction from trimmed dependency graphs. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, pages 19–27, 2009. Y. Cai and X. Cheng. (2009). Biomedical named entity recognition with tri-training learning. In Proceedings of the 2009 2nd International Conference on Biomedical Engineering and Informatics, pages 1–5, 2009. L. Cao, Y. Zhao, C. Zhang. (2008). Mining Impact-Targeted Activity Patterns in Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering. Vol. 20, No. 8, August 2008. J.R. Carpenter, M.G. Kenward, S. Vansteelandt. (2006). A Comparison of Multiple Imputation and Doubly Robust Estimation for Analyses with Missing Data. Journal of the Royal Statistical Society, 571-584. N. Chawla, N. Japkowicz, and A. Kotcz. (2004). Editorial: Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, June 2004.
DAIPEX
31
30 June 2014
Public Document
C. Chen, J. Kwon, J. Rice, A. Skabardonis, P. Varaiya. (2007). Detecting errors and imputing missing data for single loop surveillance systems. Journal of the Transportation Research Board, 160-167. J. Chen, W. Li, A. Lau, J. Cao, K. Wang . (2010). Automated Load Curve Data Cleansing in Power Systems. Transactions on Smart Grid, 213-221. C. Cheng, A. Fu, and Y. Zhang. (1999) Entropy-based subspace clustering for mining numerical data. In Proceedings of the 5th ACM SIGKDD, pages 84–93, San Diego, CA, USA, 1999. G. Cormode, and D. Srivastava (2009). Anonymized data: generation, models, usage. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 1015-1018. D.R. Cutting, D.R. Karger, J.O. Pedersen, and J.W. Tukey. (1992). Scatter/gather: A cluster-based approach to browsing large document collection. In Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 318–329, 1992. T. Dalenius, and S.P. Reiss (1982). Data-swapping: A technique for disclosure control. Journal of statistical planning and inference, 6(1), 73-85. D. Defays. (1977). An efficient algorithm for a complete link method. The Computer Journal, 20:364–366, 1977. C. Enders. (2001). A Primer on Maximum Likelihood Algorithms Available for Use With Missing Data. Structural Equation Modeling, 128-141. S. Goil, H. Nagesh, and A. Choudhary. (1999). MAFIA: efficient and scalable subspace clustering for very large data sets. Technical Report CPDC-TR-9906-010, Northwestern University, 1999. S. Greenland, W.D. Finkle. (1995). A Critical look at Methods for Handling Missing Covariates in Epidemiologic Regression Analyses. American Journal of Epidemilogy, 1255-1264. H. Guo and H. Viktor. (2004). Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach. ACM SIGKDD Explorations Newsletter, special issue on learning from imbalanced datasets, vol. 6, no. 1, pp. 30-39, June 2004. V. Guralnik and G. Karypis. (2001). Ascalable algorithm for clustering sequential data. In Proc. 1st IEEE Int. Conf. Data Mining (ICDM’01), 2001, pp. 179–186. J. Hair, W. Black, B. Babin, R. Anderson. (2010). Multivariate Data Analysis. Pearson Prentice Hall. H. Haj-Salem, J.P. Lebacque. (2002). Reconstruction of False and Missing Data with First-Order Traffic Flow Model. Transportation Research Record, 155-165. H. He and E. A. Garcia. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering. Vol. 21, No. 9, Sepetember 2009. J.G. Ibrahim, M. Chen, S.R. Lipsitz, A.H. Herring. (2005). Missing-Data Methods for Generalized Linear Models. Journal of the American Statistical Association, 332-346. P. Jacqumart and P. Zweigenbaum. (2003). Towards a medical questionanswering system: A feasibility study. Studies in Health Technology and Informatics, 95:463–468, 2003.
DAIPEX
32
30 June 2014
Public Document
N. Japkowicz. (2000). Learning from Imbalanced Data Sets: A Comparison of Various Strategies. In Proc. AAAI Workshop Learning from Imbalanced Data Sets, 2000. I.T. Jolliffe. (1986). Principal Component Analysis. Springer, Berlin Heildelberg New York, 1986. E. Kalam, A. Abou, Y. Deswarte, G. Trouessin, and E. Cordonnier (2004). A generic approach for healthcare data anonymization. In Proceedings of the 2004 ACM workshop on Privacy in the electronic society, pp. 31-32. G. Karypis, E.-H. Han, and V. Kumar. (1999). CHAMELEON: hierarchical clustering using dynamic modeling. IEEE Computer, 32(8):68–75, 1999. L. Kaufman and P.J. Rousseeuw. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, 1990. Z. Lin, A.B. Owen, and R.B. Altman, (2004). Genomic research and human subject privacy. Science, 305, 183-183. S.E. Lincoln, E.S. Lander. (1992). Systematic Detection of Errors in Genetic Linkage Data. Genomics, 604-610. J. Luengo, J.A. Sáez, F. Herrera. (2011). Missing data imputation for fuzzy rule-based classification systems. Soft Computing, 863-881. J. MacQueen. (1967). Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symp., vol. 1, 1967, pp. 281–297. A. Moore. (1999). Very fast em-based mixture model clustering using multiresolution kd-trees. Advances in Neural Information Processing Systems, 11, 1999. T. Morzy, M. Wojciechowski, and M. Zakrzewicz. (1999). Pattern-oriented hierarchical clustering. In Proc. 3rd East Eur. Conf. Advances in Databases and Information Systems, 1999, pp. 179–190. F. Murtagh. (1985). Multidimensional Clustering Algorithms. Physica-Verlag, Vienna, Austria, 1985. I. Myrtveit, E. Stensrud, U.H. Olsson. (2001). Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods. IEEE Transactions on Software Engineering, 999-1013. R. Ng and J. Han. (1994). Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), pages 144– 155, Santiago, Chile, 1994. S. Oliveira and O. Zaane (2004). Achieving Privacy Preservation When Sharing Data for Clustering. In Workshop on Secure Data Management in conjunction with VLDB 2004, Toronto, Canada. V. Raman, J.M. Hellerstein. (2001). Potter's Wheel: An Interactive Data Cleaning System. Very Large Data Bases. Roma, Italy. S. Ruj, M.A. Cavenaghi, Z. Huang, A. Nayak, I. Stojmenovic. (2011). On Data-centric Misbehavior Detection in VANETs. Ottawa, Canada: Crown. D. Sankoff and J. Kruskal. (1999). Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Stanford, CA: CSLI Publications, 1999.
DAIPEX
33
30 June 2014
Public Document
S.R. Seaman, I.R. White, A.J. Copas, L. Li. (2012). Combining Multiple Imputation and InverseProbability Weighting. Biometrics, 129-137. L. Sweeney (2002). k-Anonymity: A Model for Protecting Privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5):557–570. C. Shu-Chuan, J.F. Roddick, Tsong-Yi Chen, Jeng-Shyang Pan. (2003). Efficient search approaches for k-medoids-based algorithms. TENCON '02. Proceedings. 2002 IEEE Region 10 Conference on Computers, Communications, Control and Power Engineering, Feb 2003. R. Sibson. (1973). SLINK: an optimally efficient algorithm for the single link cluster method. Computer Journal, 16:30–34, 1973. K. Stoffel and A. Belkoniene. (1999). Parallel K-means clustering for large data sets. In Proc. EuroPar’99 Parallel Processing, 1999, pp. 1451–1454. S. Subramaniam, T. Palanas, D. Papadopoulos, V. Kalogeraki, D. Gunopulos. (2006). Online Outlier Detection in Sensor Data Using Non-Parametric Models. Seoul, Korea: Very Large Data Base Endowment. S.Y. Sung, Z. Li, P. Sun. (2002). A Fast Filtering Scheme for Large Database Cleansing. CIKM (pp. 76-83). McLean, Virginia, USA: ACM. E.M. Voorhees. (1986). Implementing agglomerative hierarchical clustering algorithms for use in document retrieval. Information Processing and Management, 22(6):465–476, 1986. R. Xu, D. Wunsch II. (2005). Survey of Clustering Algorithms. IEEE Transactions on Neural Networks. Vol. 16, No. 3, 645-678. May 2005. R. Young, D. Johnson. (2013). Methods for Handling Missing Secondary Respondent Data. Journal of Marriage and Family, 221-234. J. Zhang, E. Bloedorn, L. Rosen, and D. Venese. (2004). Learning Rules from Highly Unbalanced Data Sets. In Proc. Fourth IEEE Int’l Conf. Data Mining (ICDM ’04), pp. 571-574, 2004. T. Zhang, R. Ramakrishnan, M. Livny. (1996). BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD. Montreal, Canada. M. Zhong, P. Lingras , S. Sharma. (2004). Estimation of Missing Traffic Counts Using Factor, Genetic, Neural, and Regression Techniques. Transportation Research Part C-Emerging Technologies, 139-166. X. Zhu, X. Wu, Q. Chen. (2006). Bridging Local and Global Data Cleansing, Identifying Class Noise in Large, Distributed Data Datasets. Data Mining and Knowledge Discovery, 275-308.
DAIPEX
34