J. Comp. & Math. Sci. Vol.2 (2), 384-390 (2011)
Improved Closest Fit Techniques to Handle Missing Attribute Values SANJAY GAUR and M S DULAWAT Department of Mathematics and Statistics, Maharana, Bhupal Campus Mohanlal Sukhadia University, Udaipur- 313002 India sanjay.since@gmai.com & dulawat_ms@rediffmail.com ABSTRACT Data preparation for data mining is a fundamental stage of data analysis. Completeness, quality and real world data preparation is a key pre-requisite of successful data mining with its aims to discover something new from the facts already recorded in a certain database. Data with missing values complicates analysis and the application of a solution to new data. To overcome this situation, certain statistical techniques are to be employed during the data preparation. With the help of statistical methods and techniques, we can recover incompleteness of missing data and reduce ambiguities. In this paper, we introduce two sequential methods by which missing attribute values are replaced. A comparative study between both the methods is given are based on moving average method for numerical variables of time series data. Keywords: Missing Values, Attribute, Data Incompleteness, Moving average, Chronological.
preparation,
MSC (2010) Subject Classification: 62-07, 62N02, 62Q99.
1. INTRODUCTION
Missing values in database is solitary of the biggest problems faced in data analysis and in data mining applications. This missing values problem provoked imbalanced databases. The effects of these missing values are
reflected on the final results. Our prime goal is to achieve the final result in the consolidated form on which we are taking decision. In this study, three statistical methods are introduced and discussed which provides an approach to find out
Journal of Computer and Mathematical Sciences Vol. 2, Issue 2, 30 April, 2011 Pages (170-398)
385
Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), 384-390 (2011)
pattern to recover or generate missing values from a real imbalanced database with missing values. Therefore, the objective of this comparative study is to find out best fitted method to recover missing values and select records completely filled for further applications. This is based on bivariate analysis. The utility of statistical methods has gained objects in exploring estimation and prediction techniques. Buck2 suggested estimation of missing values for use with an electronic computer. Kim and Curry8 considered the treatment of missing data in their analysis. Rubin10 explored about inference and missing data and multiple imputations for non-response in the survey. Allison1 investigated estimates of linear models with incomplete data and on missing data. Smyth11 and Zhang et al.12 have considered that data preparation is a fundamental stage of data analysis. Chen et al.3 studied and discussed about multiple imputation for missing ordinal data. Qin9 considered the semi-parametric optimization for missing data imputation. Gaur and Dulawat4,5 discussed various algorithms which are useful for estimation of missing values also gave univariate analysis by using mean value at the place of missing values for data preparation. Gyzymala-Busse7 give idea that every missing attributes values is replaced by all possible known values. They also provided global closest fit and
concept closest fit method for missing attribute values. The objective of proposed study is to determine the statistical technique which may be significant in the handling of missing attribute values. 2. FORMULATION OF PROBLEM The proposed methods are based on replacing missing attribute values by the moving average generated values. These methods are very much useful for numerical attributes and accountable under the flag of chronological analysis. In general, these methods are centralized on search of values which is very close to the central tendency of the attribute and closest to the value of just preceding and succeeding value of the missing values. 2.1 Average Fit Approach This is one of the simplest approach of generation of close fit values for missing value place. In this, we first read the complete attribute with missing value cases. Values of attributes are divided under two section that is observed and missing values. Now search of missing case in the attribute get start. The missing value case is pointed by the subscript of the attribute and denoted by the variable . After pointing missing value case, we have to record the preceding value ( ) and succeeding value ( ) from the missing value subscript ( .
2.1.1
2.1.2
Journal of Computer and Mathematical Sciences Vol. 2, Issue 2, 30 April, 2011 Pages (170-398)
Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), 384-390 (2011)
where and NULL At the next stage, after recording the values of just preceding value and succeeding value of the missing value subscript, we compute the average of both values ( ). â „2
2.1.3
386
Now at the average of the values received by the equation (2.1.3) is treated as the estimated values for the current missing values subscript. This estimated value may be as follows: = â „2
2.1.4
The value of is separately computed for every missing values subscripts.
2.1.1 Algorithm (Average Fit Approach) // Attribute with observed and missing values { , ‌ , } where { , ‌ , } // Attribute values observed { , ‌ , } // Attribute values missing For i =1 to n do If ( value (xi) == NULL) then xp = value(xi -1) // Value of preceding of xi // Value of succeeding of xi xs = value(xi +1) = (xp + xs) / 2 // Average of preceding and succeeding // Estimated value xest = value (xi) = xest // Assigning estimated value to missing value place i=i+ 1 repeat un till( i >=n) Stop 10% of the used dataset. Therefore, the 2.2 Moving Average Fitting Approach preceding range would be half (50%) of the The moving average fitting method moving average range and same for is based on the moving average concept. succeeding rage. This approach is also very much useful for Now the searches of missing case in numerical attributes, is search for close fitting value which is very close to the true the attribute get start. The missing value case mean of the attribute and close to the value is pointed by the subscript of the attribute After of just preceding and succeeding value of and denoted by the variable . the missing values in association of the pointing missing value case, we have to record the preceding values central tendency of attribute. In the proposed method, we first ( , , . . , ) and succeeding values find out the range of moving average of the ( , , . . , ) from the missing value attribute. Here we proposed range is at least subscript ( . Read
Journal of Computer and Mathematical Sciences Vol. 2, Issue 2, 30 April, 2011 Pages (170-398)
387
Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), 384-390 (2011)
The values for preceding are computed as xp1 = value (xi -1), xp2 = value (xi -2) and xpm = value(xi -m), for succeeding xs1 = value(xi +1), xs2 = value(xi +2) similarly xsm = value(xi +m). At the next stage, after recording of preceding & Succeeding values, calculate the average of preceding ( and same for succeeding ( )
â „2
2.2.3
The estimated value is moving average values which is computed may be represent as follows =
2.2.4
= (xp1 + xp2 +‌+xpm) / m
2.2.1
The estimated value is replaced at the place of missing values.
= (xs1 + xs2 +‌+xsm) / m
2.2.2
xi =
The average of preceding ( and succeeding ( ) is the moving average or estimated values for missing data cell.
The process of searching of missing value is continuing till the last element of the attribute.
2.2.1 Algorithm (Moving Average Fitting Approach) { , ‌ , } where { , ‌ , } { , ‌ , } xt = int ((count (X) *10)/100)
Read
N= xt %2 If (N==0) then m= xt / 2 else m=( xt + 1)/ 2 Read { , ‌ , } For i =1 to n do If ( value (xi) == NULL) then Xp1 = value(xi -1) xp2 = value(xi -2)
// Attribute values observed // Attribute values missing // Set the range of moving average. Here it is 10% of the dataset. // Find the reminder
// Attribute with observed and missing values
// Value of preceding of xi-1 // Value of preceding of xi-2
‌‌‌
xpm = value(xi -m) xs1 = value(xi +1)
2.2.5
// Value of preceding of xi-m // Value of succeeding of xi+1
Journal of Computer and Mathematical Sciences Vol. 2, Issue 2, 30 April, 2011 Pages (170-398)
Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), 384-390 (2011)
xs2 = value(xi
+2)
388
// Value of succeeding of xi+2
……..
xsm = value(xi +m) = (xp1 + xp2 +…+xpm) / m = (xs1 + xs2 +…+xsm) / m = ( + ) / 2 xest = value (xi) = xest i=i+ 1 repeat un till( i >=n) Stop
// Value of succeeding of xi+m
// Average of preceding and succeeding // Estimated value // Assigning estimated value to missing value place
3. DISCUSSION OF RESULTS Table-A given in appendix shows the world wide emission of carbon dioxide (CO2) from the consumption of Oil and Natural Gas respectively for the years 1960 to 2009. The mean emission of carbon dioxide (CO2) due to Oil and Natural Gas are 2262 and 879 respectively. Table-B shows the variables with observed and missing values. It may be noted that in the planned way 20 % of the values are missing in the random manner for all the variables from Table-A. The means calculated from incomplete data sets are 2259 for Oil and 874 for Natural Gas. It is observed that mean values of incomplete data sets of Table-B are slightly lower than the mean values from all the three variables of Table-A. The proposed Simple average fit method is applied on the data sets of TableB to fill up the missing values. Values recovered or generated from this approach are shown in Table-C for both variables
which are highlighted by underline. Further, it is observed that the mean values obtained after replacing the missing values by the closest fit values in Table-C are quite close to the actual mean as given in Table-A. Another proposed moving average approaches gives similar result as the simple average fit method. This is again near to the original mean as given in the table. 4. CONCLUSION It is universally known that there is not 100 % efficient technique of handling missing attribute values. The proposed moving average fit methods are useful for numerical attribute, having minor deviation from the mean. This method is appropriate for the consolidated report, also more appropriate and suitable to fit individual missing values. Here the estimated value gives a resemblance order from the preceding and succeeding values. Consequently, it is observed that techniques for handling of missing attribute values should be chosen individually or based on the nature and type of data.
Journal of Computer and Mathematical Sciences Vol. 2, Issue 2, 30 April, 2011 Pages (170-398)
389
Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), 384-390 (2011)
“Communicated to publishing�, (2011). Grzymala-Busse, J. W., Data with missing attribute values: Generalization of in-discernibility realtion and rules induction, Transactions of Rough Sets, Lecture Notesin Computer Science Journal Subline, SpringerVerlag, Vol-1, pp. 78-95(2004). Kim, J. O., and Curry, J., The treatment of missing data in multivariate analysis, Social Methods and Research, Vol.-6, pp. 215-240 (1977). Qin, Y. S., Semi-parametric optimization for missing data imputation, Applied Intelligence, Vol.-27, No. 1, pp. 79-88 (2007). Rubin, D.B., Inference and missing data, Biometrika, 63, pp. 581-592 (1976). Smyth, P., Data mining at the interface of computer Science and Statistics, Data mining for scientific and engineering applications, Department of Information and Computer Science, University of California, CA, 92697-3425, Chapter-1, pp. 1-20 (2001). Zhang, S., Zhang, C., and Young, Q., Data preparation for data mining, Applied Artificial Intelligence, Vol.17, pp. 375-381(2003).
5. REFERENCES 1.
2.
3.
4.
5.
Allison, P.D., Estimation of linear models with incomplete data, Social Methodology, San Francisco: Jossey Bass, pp. 71-103 (1987). Buck, S.F., A method of estimation of missing values in multivariate data suitable for use with an electronic computer, J. Royal Statistical Society, Series B, Vol-2, pp. 302-306(1960). Chen, L., Drane, M.T., Valois, R.F., and Drane, J.W., Multiple imputation for missing ordinal data, Journal of Modern Applied Statistical Methods, Vol.-4, No.1, pp. 288-299(2005). Gaur, Sanjay and Dulawat, M.S., A perception of statistical inference in data mining, International Journal of Computer Science and Communication, Vol.-1, No. 2, pp. 653-658 (2010). Gaur, Sanjay and Dulawat, M.S., Univariate Analysis for Data Preparation in context of Missing Values, Journal of Computer and Mathematical Sciences, Vol.-1, No. 5, pp. 628-635 (2010).
6.
7.
8.
9.
10. 11.
12.
Gaur, Sanjay and Dulawat, M.S., A Closest Fit Approach to Missing Attribute Values in Data Mining,
Appendix: Global Carbon Dioxide Emissions from Fossil Fuel Burning by Fuel Type, 1960-2009
Table -A (Original Table) Year Oil Natural Gas
Table -B (Missing Values) Year Oil Natural Gas
Million Tonnes
Table -C (Simple Average) Year Oil Natural Gas
Million Tonnes
Table -D ( Moving Average) Year Oil Natural Gas
Million Tonnes
Million Tonnes
1960
849
235
1960
849
235
1960
849
235
1960
849
235
1961
904
254
1961
904
254
1961
904
254
1961
904
254
1962
980
277
1962
980
277
1962
980
277
1962
980
277
1963
1,052
300
1963
300
1963
1,058
300
1963
1,069
300
Journal of Computer and Mathematical Sciences Vol. 2, Issue 2, 30 April, 2011 Pages (170-398)
390
Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), 384-390 (2011) 1964
1,137
328
1964
1,137
1964
1,137
326
1964
1,137
329
1965
1,219
351
1965
1,219
351
1965
1,219
351
1965
1,219
351
1966
1,323
380
1966
1,323
380
1966
1,323
380
1966
1,323
380
1967
1,423
410
1967
1,423
410
1967
1,423
410
1967
1,423
410
1968
1,551
446
1968
446
1968
1,548
446
1968
1,571
446
1969
1,673
487
1969
1,673
487
1969
1,673
487
1969
1,673
487
1970
1,839
516
1970
1,839
1970
1,839
521
1970
1,839
515
1971
1,946
554
1971
1,946
1972
2,055
583
1972
1973
2,240
608
1973
2,240
1974
2,244
618
1974
2,244
1975
2,131
623
1975
2,131
1976
2,313
650
1976
2,313
1977
2,395
649
1977
1978
2,392
677
1978
1979
2,544
719
1979
1980
2,422
740
1980
2,422
1981
2,289
756
1981
2,289
1982
2,196
746
1982
1983
2,177
745
1983
1984
2,202
808
1984
1985
2,182
836
1985
2,182
1986
2,290
830
1986
2,290
1987
2,302
893
1987
1988
2,408
936
1988
2,408
1989
2,455
972
1989
2,455
1990
2,517
1,026
1990
2,517
1991
2,627
1,069
1991
1992
2,506
1,101
1992
1993
2,537
1,119
1993
1994
2,562
1,132
1995
2,586
1996
554
1971
1,946
554
1971
1,946
554
583
1972
2,093
583
1972
2,012
583
608
1973
2,240
608
1973
2,240
608
1974
2,244
615
1974
2,244
611
623
1975
2,131
623
1975
2,131
623
650
1976
2,313
650
1976
2,313
650
649
1977
2,353
649
1977
2,341
649
2,392
677
1978
2,392
677
1978
2,392
677
2,544
719
1979
2,544
719
1979
2,544
719
1980
2,422
738
1980
2,422
715
756
1981
2,289
756
1981
2,289
756
746
1982
2,233
746
1982
2,303
746
2,177
745
1983
2,177
745
1983
2,177
745
2,202
808
1984
2,202
808
1984
2,202
808
1985
2,182
819
1985
2,182
826
830
1986
2,290
830
1986
2,290
830
893
1987
2,349
893
1987
2,342
893
936
1988
2,408
936
1988
2,408
936
1989
2,455
981
1989
2,455
976
1,026
1990
2,517
1,026
1990
2,517
1,026
1,069
1991
2,511
1,069
1991
2,498
1,069
2,506
1,101
1992
2,506
1,101
1992
2,506
1,101
2,537
1,119
1993
2,537
1,119
1993
2,537
1,119
1994
2,562
1,132
1994
2,562
1,132
1994
2,562
1,132
1,153
1995
2,586
1,153
1995
2,586
1,153
1995
2,586
1,153
2,624
1,208
1996
1,208
1996
2,634
1,208
1996
2,645
1,208
1997
2,707
1,211
1997
2,707
1997
2,707
1,227
1997
2,707
1,217
1998
2,763
1,245
1998
2,763
1,245
1998
2,763
1,245
1998
2,763
1,245
1999
2,716
1,272
1999
2,716
1,272
1999
2,716
1,272
1999
2,716
1,272
2000
2,831
1,291
2000
1,291
2000
2,779
1,291
2000
2,796
1,291
2001
2,842
1,314
2001
2,842
2002
2,819
1,349
2002
2,819
2003
2,928
1,399
2003
2,928
2004
3,032
1,436
2004
3,032
2005
3,079
1,479
2005
2006
3,092
1,527
2006
3,092
2007
3,087
1,551
2007
3,087
1,551
2008
3,079
1,589
2008
3,079
2009
3,019
1,552
2009
3,019
Average
2,262
Average
2,259
879
1,314
2001
2,842
1,314
2001
2,842
1,314
2002
2,819
1,357
2002
2,819
1,365
1,399
2003
2,928
1,399
2003
2,928
1,399
1,436
2004
3,032
1,436
2004
3,032
1,436
1,479
2005
3,062
1,479
2005
3,006
1,479
2006
3,092
1,515
2006
3,092
1,501
2007
3,087
1,551
2007
3,087
1,551
1,589
2008
3,079
1,589
2008
3,079
1,589
1,552
2009
3,019
1,552
2009
3,019
1,552
Average
2,260
Average
2,259
878
874
879
Source: www.earth-policy.org
Journal of Computer and Mathematical Sciences Vol. 2, Issue 2, 30 April, 2011 Pages (170-398)