Mathematical Computation June 2013, Volume 2, Issue 2, PP.19-23
Influence Analysis of the Missing Observations Model Baoguang Tian, Chunyan Liang College of Mathematics & Physics, Qingdao University of Science and Technology, Qingdao, 266041, China #Email: tianbaoguangqd@163.com
Abstract The influence of the deletion of data on the observations estimator has been studied according to the fitted value and estimate efficiency point of view in the several data missing model. The inequality relation between W-K statistics and generalized correlation coefficient is built and the equality relation between ratio of variance and zhang’s generalized correlation coefficient is obtained. In addition, the relationship between the ratio of generalized variance and the generalized coefficient of correlation has been discovered defined by Hoetelling. Keywords: Missing Observation Linear Model; W-K Statistics; Ratio of Variance; Generalized Correlation of Coefficient; Estimate of Missing Observations Ratio of Generalized Variance
1 INTRODUCTION Study of experimental data on the influence of the linear model is a problem which has applied value and theoretical value. In article [1-5], Cook and Weisberg as well as many authors have discussed the influence of experimental data on the linear regression model. In this paper, the influence of data on missing observations model has been discussed based on the fitted value and the estimate efficiency of view, and then W-K statistics, ratio variance and ratio of generalized variance were established to discover the relation between them and generalized correlation of coefficient. The general linear model with several missing observations is summarized by Y1 X1 e1 Y2 X 2 e2
Y3 X 3 e3
(1) (2) (3)
where
( X 2 ) ( X1) ( X 3 ) ( X1) {0}
(4) (5) And Yi is an ni 1 observable responses, X i is an ni p matrix of known constants, ei is an ni 1 vector of unobservable errors, is a p 1 vector of unknown parameters. E (ei ) 0 , cov(ei , e j ) 2ij I ni , i, j 1, 2,3 , ij are Kronecker symbols, ( A) stands for the linear space which is generated by the column vectors of matrix A , Y2 and Y3 are missing observations. In this paper, it is supposed that the rank 0f X 2 is n2 .
Lemma 1 let A , B , U , V be appropriate matrixes and AAUBV UBV . Then ( A UBV ) AUB( B BVAUB) BVA
(6)
where A is the generalized inverse matrix of A . The proof comes from article [6]. The article [7] proved that the estimate Yˆ2 of missing observations Y2 and Yˆ2 satisfied the following equation
Yˆ2 X 2 ( X1X1 ) X 1Y1 X 2 ˆ - 19 www.ivypub.org/mc
(7)
where ˆ ( X1X1 ) X 1Y1 .
2 W-K STATISTICS Let I {i1 , i2 ,..., im } be an m-vector that indicates selected cases and the subscript ( I ) means with the m cases indexed by I deleted, Yˆ2( I ) X 2 ˆ( I ) , ˆ( I ) is the LS solution to be obtained from(1) while m cases indexed by I are deleted, ˆ( I ) ( X1( I ) X1( I ) ) X1( I )Y1( I ) , X 1( I ) 、 Y1( I ) are obtained from X 1 and Y1 while i1 , i2 ,..., im are deleted, respectively. X 1I ( x1i1 , x1i2 ,..., x1im ) , and the rank of X 1I is m .Therefore H I X 1( I ) ( X1X1 ) X 1I is nonsingular matrix. Definition 1 WK ( X 2 ( X1X1 ) X 2 )
1 2
(Yˆ2 Yˆ2( I ) )
ˆ ( I ) is called as the W-K statistics of the missing
observation model. Where ˆ ( I ) is variance estimate matrix while m cases indexed by I are deleted of (1) . X 2 ( X1X1 ) X 2 is nonsingular matrix, because ( X 2 ) ( X1) and the rank of X 2 is n2 . The statistics meaning of W K is explicit, which measures the influence of the m cases indexed by I deleted on the fitted value from the fitted value of view. Because it is a vector which is inconvenient to apply, its norm is expressed as (Yˆ2 Yˆ2( I ) )( X 2 ( X1X1 ) X 2 ) 1 (Yˆ2 Yˆ2( I ) )) (8) WK 2 ˆ (2I ) If the value of WK 2 is large, I has a great influence on estimate of missing observations Yˆ2 , otherwise, its influence is limited. Theorem 1 i 1 1 rZ ( X 2 , X 2 ˆ( I 0 ) iI 1) WK 2 1 m rZ[4]( X ˆ , X ˆ ) ˆ (2I )
2
[4]
2
1 1 rZ ( X 2 , X 2 ˆ( I 0 ) 2) WK 1 1 rZ[3]( X ˆ , X ˆ ) [3]
2
2
(9)
2 (I )
2 (I )
iI
2 i
ˆ (2I )
(10)
Where m and 1 denote the maximum and minimum eigenvalues of matrix H I respectively. i yi xi ˆ (i I ) is the i-th component of the residual vector I YI X1I ˆ , rZ[3](.,.) , rZ[4](.,.) are the generalized coefficient of correlation defined by Zhang(see [8]) Proof: By Lemma 1, we have Yˆ2 Yˆ2( I ) X 2 ( X1X1 ) X 1I ( I H I )1 I , Therefore
WK 2
I ( I H I )1 X1I ( X1X1 ) X 2 ( X 2 ( X1X1 )1 X 2 )1 X 2 ( X1X 1 )1 X 1I ( I H I ) 1 I ˆ (2I )
For simplification, let A X 1( I ) ( X1X1 ) X 2 , B X 2 ( X1X1 ) X 2 , we have
WK 2
I ( I H I )1 AB 1 A( I H I )1 I ˆ (2I )
Let
t I2
I ( I H I )1 I I ( I H I )1 AB 1 A( I H I )1 I Q , I ˆ (2I ) I ( I H I )1 I
Therefore, WK 2 t I2 QI . By the external property of eigenvalue, we obtain - 20 www.ivypub.org/mc
max QI max ( ( I H I )1 AB1 A ) max ( A( I H I )1 AB1 ) I
min QI min ( ( I H I )1 AB1 A ) min ( A( I H I )1 AB1 ) I
Here max ( A) and max ( A) denote the maximum eigenvalues of matrix A respectively. On the other hand,
X 2 ˆ B B 2 cov X ˆ B B A( I H I )1 A 2 (I )
(11)
Let 0 m2 ... 12 1 be canonical correlation coefficient of X 2 ˆ and X 2 ˆ( I ) , then i2 is the eigenvalue of the following matrix.
B( B A( I H I )1 A)1 (I A ( I H I )1 AB1 ) 1
(12)
Again by the definition of Zhang’s generalized correlation coefficient, we can get
m
1 rZ[4]( X ˆ , X ˆ ) 2 2 (I ) 1 max ( A( I H I )1 AB 1 )
(13)
1
1 rZ[3]( X ˆ , X ˆ ) 2 2 (I ) 1 min ( A( I H I )1 AB 1 )
(14)
Let 0 1 2 ... m 1 be the eigenvalues of the matrix H I , then
1 1 ... 1 are of the eigenvalues 1 m 1 1
of the matrix ( I H I )1 , and let P1 , P2 ,...Pm be the corresponding standardized orthonormal eigenvectors, thus the spectral decomposition of the matrix ( I H I )1 is ( I H I )1
m
1
1 P i 1
i
Pi
i
Therefore,
I ( I H I )1 I
1 1 m
m
i 1
I
P i Pi I
1 1 m
iI
2 i
(15)
By the same methods, we have
1 1 1
I ( I H I )1 I
iI
2 i
(16)
According to the above, Theorem 1 has been proved.
3 THE RATIO OF VARIANCE Var (C X 2 ˆ ) Definition 2 e( ˆ( I ) ) is named as the ratio of variance of missing observation model. Here C is a Var (C X 2 ˆ( I ) )
vector. (see [9] ) Ratio of variance is a measure from the estimate efficiency of view. The test point which makes the ratio of variance small is influential point, which has a great effect on estimate of missing observations. Theorem 2 1) inf e( ˆ( I ) ) rZ[4]( X C
2) rZ[3]( X
ˆ
ˆ
ˆ
2 , X 2 ( I ) )
Proof: Var (C X 2 ˆ ) C X 2 ( X1X1 ) X 1X1 ( X1X1 ) X 2 C 2 C X ( X X ) X C 2 C BC 2 2
1
1
ˆ
2 , X 2 ( I ) )
2
- 21 www.ivypub.org/mc
(17) (18)
By Lemma, we have Var (C X 2 ˆ( I ) ) C [ B A( I H I )1 A] C 2
Therefore,
e( ˆ( I ) )
C BC C ( B A( I H I )1 A)C
With the definition of the maximum and the minimum relative eigenvalue, sup e( ˆ( I ) ) and inf e( ˆ( I ) ) which are the C C roots of the following equation respectively have been acquired, (19) ( B A( I H I )1 A)1 B i2 I 0 According to the formula (12) and the definition of the generalized coefficient of correlation made by Zhang, Theorem 2 has been proved. The theorem builds up the inequality relation between the ratio of variance and the generalized coefficient of correlation defined by Zhang. The upper and lower bounds of the influence of the m cases deleted on the estimate of missing observations are the generalized coefficient of correlation rZ[3]( X ˆ , X ˆ ) and rZ[4]( X ˆ , X ˆ ) respectively. 2
2 (I )
2
2 (I )
4 THE RATIO OF GENERALIZED VARIANCE
Definition 3 e X 2 ˆ I
CovX 2 ˆ CovX 2 ˆ I
is called as the ratio of generalized variance of variance of missing
observation model. In the following, it is proved that there is equality relation between the ratio of generalized variance and the generalized coefficient of correlation defined by Hoetelling. Theorem 3
e X 2 ˆ I rH
X ˆ , X ˆ 2
2
(20)
I
Where rH , is generalized coefficient of correlation defined by Hoetelling.
Proof: e X 2 ˆ I
CovX 2 ˆ CovX 2 ˆ I
B B A I HI A '
1
I B 1 A' I H I A 1
1
By (19) and generalized coefficient of correlation defined by Hoetelling, it is known that Theorem 3 is correct.
Because 0 e X 2 ˆ I 1 ,the test point which makes the ratio of generalized variance small is influential point, which has a great influence on estimate of missing observations, revealing the inner relationship between the ratio of generalized variance and the generalized coefficient of correlation defined by Hoetelling.
REFERENCES [1]
Cook R D, Weisberg S. Residuals and influence in regression [M].New York .Chapman Hall, 1982.
[2]
WANG Ming, SHI Lei. Influence analysis of covariance analysis model. Journal of Yunnan University, (natural science), 2003, Vol. 25, Issue (5): 391-394.
[3]
Cook R D.Assessment of local influence [J].J Roy Statist Soc, B, 1986, 48(2):133-169.
[4]
Silvia L P, Ferrari, Francisco Cribari-neto. Beta regression for modelling rates and proportions[J].J Applied Statistics,2004, 31(7):799-815. - 22 www.ivypub.org/mc
[5]
LI Ai-ping,
XIE Feng-chang, LIU Ying-an.
Influence diagnostics in Beta regression model.
Applied Mathematics Journal of
Chinese Universities. 2007, 22(3): 293-300. [6]
Henderso H, Searle S R. Qn deriving the inverse of sum of matrices [J].SIAM Review.1981, 23(1):53-59.
[7]
Tian Baoguang, Ji Qingzhong. The influence of deleting data on missing observations model. Journal of Nanjing university mathematical biquarterly [J].2005. 22(2):349-354.
[8]
Zhang Y T. Generalized Coefficient of Correlation and Its Application [J].Acta Math. Appl. Sin., 1978,1(3): 312-320.
[9]
Wang JR, Gao DD, The relative efficiency of GLES under Euclidean more [J]. Chinese Journal of Applied Probability and Statistics, 1991, 7(4):361-366.
AUTHORS 1Tian
Baoguang was born in Shandong,
2Liang
Chunyan was born in Shandong, China. Currently, she
China, in 1962. Professor. He received
is a postgraduate. Her research interests are in probability and
his M.S.degree in Applied Mathematics
mathematical statistics.
from Shanxi Normal University. His research interests are in Probability and mathematical statistics.
- 23 www.ivypub.org/mc