Cmjv02i02p0384 by Journal of Computer and Mathematical Sciences

J. Comp. & Math. Sci. Vol.2 (2), 384-390 (2011)

Improved Closest Fit Techniques to Handle Missing Attribute Values SANJAY GAUR and M S DULAWAT Department of Mathematics and Statistics, Maharana, Bhupal Campus Mohanlal Sukhadia University, Udaipur- 313002 India sanjay.since@gmai.com & dulawat_ms@rediffmail.com ABSTRACT Data preparation for data mining is a fundamental stage of data analysis. Completeness, quality and real world data preparation is a key pre-requisite of successful data mining with its aims to discover something new from the facts already recorded in a certain database. Data with missing values complicates analysis and the application of a solution to new data. To overcome this situation, certain statistical techniques are to be employed during the data preparation. With the help of statistical methods and techniques, we can recover incompleteness of missing data and reduce ambiguities. In this paper, we introduce two sequential methods by which missing attribute values are replaced. A comparative study between both the methods is given are based on moving average method for numerical variables of time series data. Keywords: Missing Values, Attribute, Data Incompleteness, Moving average, Chronological.

preparation,

MSC (2010) Subject Classification: 62-07, 62N02, 62Q99.

1. INTRODUCTION

Missing values in database is solitary of the biggest problems faced in data analysis and in data mining applications. This missing values problem provoked imbalanced databases. The effects of these missing values are

reflected on the final results. Our prime goal is to achieve the final result in the consolidated form on which we are taking decision. In this study, three statistical methods are introduced and discussed which provides an approach to find out

Journal of Computer and Mathematical Sciences Vol. 2, Issue 2, 30 April, 2011 Pages (170-398)

385

Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), 384-390 (2011)

pattern to recover or generate missing values from a real imbalanced database with missing values. Therefore, the objective of this comparative study is to find out best fitted method to recover missing values and select records completely filled for further applications. This is based on bivariate analysis. The utility of statistical methods has gained objects in exploring estimation and prediction techniques. Buck2 suggested estimation of missing values for use with an electronic computer. Kim and Curry8 considered the treatment of missing data in their analysis. Rubin10 explored about inference and missing data and multiple imputations for non-response in the survey. Allison1 investigated estimates of linear models with incomplete data and on missing data. Smyth11 and Zhang et al.12 have considered that data preparation is a fundamental stage of data analysis. Chen et al.3 studied and discussed about multiple imputation for missing ordinal data. Qin9 considered the semi-parametric optimization for missing data imputation. Gaur and Dulawat4,5 discussed various algorithms which are useful for estimation of missing values also gave univariate analysis by using mean value at the place of missing values for data preparation. Gyzymala-Busse7 give idea that every missing attributes values is replaced by all possible known values. They also provided global closest fit and

concept closest fit method for missing attribute values. The objective of proposed study is to determine the statistical technique which may be significant in the handling of missing attribute values. 2. FORMULATION OF PROBLEM The proposed methods are based on replacing missing attribute values by the moving average generated values. These methods are very much useful for numerical attributes and accountable under the flag of chronological analysis. In general, these methods are centralized on search of values which is very close to the central tendency of the attribute and closest to the value of just preceding and succeeding value of the missing values. 2.1 Average Fit Approach This is one of the simplest approach of generation of close fit values for missing value place. In this, we first read the complete attribute with missing value cases. Values of attributes are divided under two section that is observed and missing values. Now search of missing case in the attribute get start. The missing value case is pointed by the subscript of the attribute and denoted by the variable . After pointing missing value case, we have to record the preceding value ( ) and succeeding value ( ) from the missing value subscript ( .

2.1.1

2.1.2

Journal of Computer and Mathematical Sciences Vol. 2, Issue 2, 30 April, 2011 Pages (170-398)

Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), 384-390 (2011)

where and NULL At the next stage, after recording the values of just preceding value and succeeding value of the missing value subscript, we compute the average of both values ( ). â &#x201E;2

2.1.3

386

Now at the average of the values received by the equation (2.1.3) is treated as the estimated values for the current missing values subscript. This estimated value may be as follows: = â &#x201E;2

2.1.4

The value of is separately computed for every missing values subscripts.

2.1.1 Algorithm (Average Fit Approach) // Attribute with observed and missing values { , â&#x20AC;Ś , } where { , â&#x20AC;Ś , } // Attribute values observed { , â&#x20AC;Ś , } // Attribute values missing For i =1 to n do If ( value (xi) == NULL) then xp = value(xi -1) // Value of preceding of xi // Value of succeeding of xi xs = value(xi +1) = (xp + xs) / 2 // Average of preceding and succeeding // Estimated value xest = value (xi) = xest // Assigning estimated value to missing value place i=i+ 1 repeat un till( i >=n) Stop 10% of the used dataset. Therefore, the 2.2 Moving Average Fitting Approach preceding range would be half (50%) of the The moving average fitting method moving average range and same for is based on the moving average concept. succeeding rage. This approach is also very much useful for Now the searches of missing case in numerical attributes, is search for close fitting value which is very close to the true the attribute get start. The missing value case mean of the attribute and close to the value is pointed by the subscript of the attribute After of just preceding and succeeding value of and denoted by the variable . the missing values in association of the pointing missing value case, we have to record the preceding values central tendency of attribute. In the proposed method, we first ( , , . . , ) and succeeding values find out the range of moving average of the ( , , . . , ) from the missing value attribute. Here we proposed range is at least subscript ( . Read

Journal of Computer and Mathematical Sciences Vol. 2, Issue 2, 30 April, 2011 Pages (170-398)

387

Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), 384-390 (2011)

The values for preceding are computed as xp1 = value (xi -1), xp2 = value (xi -2) and xpm = value(xi -m), for succeeding xs1 = value(xi +1), xs2 = value(xi +2) similarly xsm = value(xi +m). At the next stage, after recording of preceding & Succeeding values, calculate the average of preceding ( and same for succeeding ( )

â &#x201E;2

2.2.3

The estimated value is moving average values which is computed may be represent as follows =

2.2.4

= (xp1 + xp2 +â&#x20AC;Ś+xpm) / m

2.2.1

The estimated value is replaced at the place of missing values.

= (xs1 + xs2 +â&#x20AC;Ś+xsm) / m

2.2.2

xi =

The average of preceding ( and succeeding ( ) is the moving average or estimated values for missing data cell.

The process of searching of missing value is continuing till the last element of the attribute.

2.2.1 Algorithm (Moving Average Fitting Approach) { , â&#x20AC;Ś , } where { , â&#x20AC;Ś , } { , â&#x20AC;Ś , } xt = int ((count (X) *10)/100)

Read

N= xt %2 If (N==0) then m= xt / 2 else m=( xt + 1)/ 2 Read { , â&#x20AC;Ś , } For i =1 to n do If ( value (xi) == NULL) then Xp1 = value(xi -1) xp2 = value(xi -2)

// Attribute values observed // Attribute values missing // Set the range of moving average. Here it is 10% of the dataset. // Find the reminder

// Attribute with observed and missing values

// Value of preceding of xi-1 // Value of preceding of xi-2

â&#x20AC;Śâ&#x20AC;Śâ&#x20AC;Ś

xpm = value(xi -m) xs1 = value(xi +1)

2.2.5

// Value of preceding of xi-m // Value of succeeding of xi+1

Journal of Computer and Mathematical Sciences Vol. 2, Issue 2, 30 April, 2011 Pages (170-398)

Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), 384-390 (2011)

xs2 = value(xi

+2)

388

// Value of succeeding of xi+2

……..

xsm = value(xi +m) = (xp1 + xp2 +…+xpm) / m = (xs1 + xs2 +…+xsm) / m = ( + ) / 2 xest = value (xi) = xest i=i+ 1 repeat un till( i >=n) Stop

// Value of succeeding of xi+m

// Average of preceding and succeeding // Estimated value // Assigning estimated value to missing value place

3. DISCUSSION OF RESULTS Table-A given in appendix shows the world wide emission of carbon dioxide (CO2) from the consumption of Oil and Natural Gas respectively for the years 1960 to 2009. The mean emission of carbon dioxide (CO2) due to Oil and Natural Gas are 2262 and 879 respectively. Table-B shows the variables with observed and missing values. It may be noted that in the planned way 20 % of the values are missing in the random manner for all the variables from Table-A. The means calculated from incomplete data sets are 2259 for Oil and 874 for Natural Gas. It is observed that mean values of incomplete data sets of Table-B are slightly lower than the mean values from all the three variables of Table-A. The proposed Simple average fit method is applied on the data sets of TableB to fill up the missing values. Values recovered or generated from this approach are shown in Table-C for both variables

which are highlighted by underline. Further, it is observed that the mean values obtained after replacing the missing values by the closest fit values in Table-C are quite close to the actual mean as given in Table-A. Another proposed moving average approaches gives similar result as the simple average fit method. This is again near to the original mean as given in the table. 4. CONCLUSION It is universally known that there is not 100 % efficient technique of handling missing attribute values. The proposed moving average fit methods are useful for numerical attribute, having minor deviation from the mean. This method is appropriate for the consolidated report, also more appropriate and suitable to fit individual missing values. Here the estimated value gives a resemblance order from the preceding and succeeding values. Consequently, it is observed that techniques for handling of missing attribute values should be chosen individually or based on the nature and type of data.

Journal of Computer and Mathematical Sciences Vol. 2, Issue 2, 30 April, 2011 Pages (170-398)

389

Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), 384-390 (2011)

â&#x20AC;&#x153;Communicated to publishingâ&#x20AC;?, (2011). Grzymala-Busse, J. W., Data with missing attribute values: Generalization of in-discernibility realtion and rules induction, Transactions of Rough Sets, Lecture Notesin Computer Science Journal Subline, SpringerVerlag, Vol-1, pp. 78-95(2004). Kim, J. O., and Curry, J., The treatment of missing data in multivariate analysis, Social Methods and Research, Vol.-6, pp. 215-240 (1977). Qin, Y. S., Semi-parametric optimization for missing data imputation, Applied Intelligence, Vol.-27, No. 1, pp. 79-88 (2007). Rubin, D.B., Inference and missing data, Biometrika, 63, pp. 581-592 (1976). Smyth, P., Data mining at the interface of computer Science and Statistics, Data mining for scientific and engineering applications, Department of Information and Computer Science, University of California, CA, 92697-3425, Chapter-1, pp. 1-20 (2001). Zhang, S., Zhang, C., and Young, Q., Data preparation for data mining, Applied Artificial Intelligence, Vol.17, pp. 375-381(2003).

5. REFERENCES 1.

Allison, P.D., Estimation of linear models with incomplete data, Social Methodology, San Francisco: Jossey Bass, pp. 71-103 (1987). Buck, S.F., A method of estimation of missing values in multivariate data suitable for use with an electronic computer, J. Royal Statistical Society, Series B, Vol-2, pp. 302-306(1960). Chen, L., Drane, M.T., Valois, R.F., and Drane, J.W., Multiple imputation for missing ordinal data, Journal of Modern Applied Statistical Methods, Vol.-4, No.1, pp. 288-299(2005). Gaur, Sanjay and Dulawat, M.S., A perception of statistical inference in data mining, International Journal of Computer Science and Communication, Vol.-1, No. 2, pp. 653-658 (2010). Gaur, Sanjay and Dulawat, M.S., Univariate Analysis for Data Preparation in context of Missing Values, Journal of Computer and Mathematical Sciences, Vol.-1, No. 5, pp. 628-635 (2010).

10. 11.

12.

Gaur, Sanjay and Dulawat, M.S., A Closest Fit Approach to Missing Attribute Values in Data Mining,

Appendix: Global Carbon Dioxide Emissions from Fossil Fuel Burning by Fuel Type, 1960-2009

Table -A (Original Table) Year Oil Natural Gas

Table -B (Missing Values) Year Oil Natural Gas

Million Tonnes

Table -C (Simple Average) Year Oil Natural Gas

Million Tonnes

Table -D ( Moving Average) Year Oil Natural Gas

Million Tonnes

1960

849

235

1960

849

235

1960

849

235

1960

849

235

1961

904

254

1961

904

254

1961

904

254

1961

904

254

1962

980

277

1962

980

277

1962

980

277

1962

980

277

1963

1,052

300

1963

300

1963

1,058

300

1963

1,069

300

Journal of Computer and Mathematical Sciences Vol. 2, Issue 2, 30 April, 2011 Pages (170-398)

390

Sanjay Gaur, et al., J. Comp. & Math. Sci. Vol.2 (2), 384-390 (2011) 1964

1,137

328

1964

1,137

1964

1,137

326

1964

1,137

329

1965

1,219

351

1965

1,219

351

1965

1,219

351

1965

1,219

351

1966

1,323

380

1966

1,323

380

1966

1,323

380

1966

1,323

380

1967

1,423

410

1967

1,423

410

1967

1,423

410

1967

1,423

410

1968

1,551

446

1968

446

1968

1,548

446

1968

1,571

446

1969

1,673

487

1969

1,673

487

1969

1,673

487

1969

1,673

487

1970

1,839

516

1970

1,839

1970

1,839

521

1970

1,839

515

1971

1,946

554

1971

1,946

1972

2,055

583

1972

1973

2,240

608

1973

2,240

1974

2,244

618

1974

2,244

1975

2,131

623

1975

2,131

1976

2,313

650

1976

2,313

1977

2,395

649

1977

1978

2,392

677

1978

1979

2,544

719

1979

1980

2,422

740

1980

2,422

1981

2,289

756

1981

2,289

1982

2,196

746

1982

1983

2,177

745

1983

1984

2,202

808

1984

1985

2,182

836

1985

2,182

1986

2,290

830

1986

2,290

1987

2,302

893

1987

1988

2,408

936

1988

2,408

1989

2,455

972

1989

2,455

1990

2,517

1,026

1990

2,517

1991

2,627

1,069

1991

1992

2,506

1,101

1992

1993

2,537

1,119

1993

1994

2,562

1,132

1995

2,586

1996

554

1971

1,946

554

1971

1,946

554

583

1972

2,093

583

1972

2,012

583

608

1973

2,240

608

1973

2,240

608

1974

2,244

615

1974

2,244

611

623

1975

2,131

623

1975

2,131

623

650

1976

2,313

650

1976

2,313

650

649

1977

2,353

649

1977

2,341

649

2,392

677

1978

2,392

677

1978

2,392

677

2,544

719

1979

2,544

719

1979

2,544

719

1980

2,422

738

1980

2,422

715

756

1981

2,289

756

1981

2,289

756

746

1982

2,233

746

1982

2,303

746

2,177

745

1983

2,177

745

1983

2,177

745

2,202

808

1984

2,202

808

1984

2,202

808

1985

2,182

819

1985

2,182

826

830

1986

2,290

830

1986

2,290

830

893

1987

2,349

893

1987

2,342

893

936

1988

2,408

936

1988

2,408

936

1989

2,455

981

1989

2,455

976

1,026

1990

2,517

1,026

1990

2,517

1,026

1,069

1991

2,511

1,069

1991

2,498

1,069

2,506

1,101

1992

2,506

1,101

1992

2,506

1,101

2,537

1,119

1993

2,537

1,119

1993

2,537

1,119

1994

2,562

1,132

1994

2,562

1,132

1994

2,562

1,132

1,153

1995

2,586

1,153

1995

2,586

1,153

1995

2,586

1,153

2,624

1,208

1996

1,208

1996

2,634

1,208

1996

2,645

1,208

1997

2,707

1,211

1997

2,707

1997

2,707

1,227

1997

2,707

1,217

1998

2,763

1,245

1998

2,763

1,245

1998

2,763

1,245

1998

2,763

1,245

1999

2,716

1,272

1999

2,716

1,272

1999

2,716

1,272

1999

2,716

1,272

2000

2,831

1,291

2000

1,291

2000

2,779

1,291

2000

2,796

1,291

2001

2,842

1,314

2001

2,842

2002

2,819

1,349

2002

2,819

2003

2,928

1,399

2003

2,928

2004

3,032

1,436

2004

3,032

2005

3,079

1,479

2005

2006

3,092

1,527

2006

3,092

2007

3,087

1,551

2007

3,087

1,551

2008

3,079

1,589

2008

3,079

2009

3,019

1,552

2009

3,019

Average

2,262

Average

2,259

879

1,314

2001

2,842

1,314

2001

2,842

1,314

2002

2,819

1,357

2002

2,819

1,365

1,399

2003

2,928

1,399

2003

2,928

1,399

1,436

2004

3,032

1,436

2004

3,032

1,436

1,479

2005

3,062

1,479

2005

3,006

1,479

2006

3,092

1,515

2006

3,092

1,501

2007

3,087

1,551

2007

3,087

1,551

1,589

2008

3,079

1,589

2008

3,079

1,589

1,552

2009

3,019

1,552

2009

3,019

1,552

Average

2,260

Average

2,259

878

874

879

Source: www.earth-policy.org

Journal of Computer and Mathematical Sciences Vol. 2, Issue 2, 30 April, 2011 Pages (170-398)