After verifying that the linear correlation between two variables is significant, next we determine the equation of the line that can be used to predict the value of y for a given value of x. Observed yvalue
y
d1
d2
For a given x-value, d = (observed y-value) – (predicted y-value)
Predicted y- d 3 value x Each data point di represents the difference between the observed y-value and the predicted y-value for a given x-value on the line. These differences are called residuals.
Pharmaceutical Biostatistics: Correlation & Regression
• Simple regression analysis is predicting one variable Y, in terms of the other X. • Variable being predicted (Y) is called the dependent variable. • Variable used to make prediction (X) is and independent variable. • estimate the relationship of one variable with another by expressing the one in terms of a linear (or a more complex) function of the other. Pharmaceutical Biostatistics: Correlation & Regression
① Values of the independent variable X are said to be “fixed”. The Y values are statistically independent. ② The variable X is measured without error. ③ For each value of X there is a subpopulation of Y values. Residuals must be normally distributed. ④ The variances of the subpopulations of Y are all equal (homoscedascity) and denoted by σ2. ⑤ The means of the subpopulations of Y all lie on the same straight line (linearity assumption) ⑥ When the data are time series, the errors often are correlated. Error terms that are correlated over time are said to be autocorrelated or serially correlated. For linear regression, the data should not be autocorrelated. Pharmaceutical Biostatistics: Correlation & Regression
– Method of least squares - the method commonly employed for obtaining the best desired line. – The least-squares line – the resulting line. – General equation for a straight line:
y = mx+b ① y is value on the vertical axis ② x is a value on the horizontal axis ③ b is the point where the line crosses the vertical axis (y-intercept) ④ m shows the amount by which y changes for each unit change in x. Pharmaceutical Biostatistics: Correlation & Regression
• GOAL - minimize the sum of the square of the errors of the data points (actual and predicted values).
This minimizes the Mean Square Error
n
Pharmaceutical Biostatistics: Correlation & Regression
A regression line, also called a line of best fit, is the line for which the sum of the squares of the residuals is a minimum. The Equation of a Regression Line The equation of a regression line for an independent variable x and a dependent variable y is ŷ = mx + b where ŷ is the predicted y-value for a given x-value. The slope m and y-intercept b are given by n ∑ xy − ∑ x ∑ y ∑ y −m ∑ x m= and b = y − mx = 2 2 n n n∑x − ∑x y x where is the mean of the y-values and is the mean of x-values. The regression line always passes through (x , y ). Pharmaceutical Biostatistics: Correlation & Regression
( )( ) ( )
Example: Find the equation of the regression line. x 1 2 3 4 5
y –3 –1 0 1 2
∑ x = 15 ∑ y = −1
xy –3 –2 0 4 10
x2 1 4 9 16 25
∑ xy = 9
∑ x 2 = 55
y2 9 1 0 1 4 ∑ y 2 = 15
n ∑ xy − (∑ x )(∑ y ) 5(9) − (15)(−1) 60 = 1.2 = m= = 2 2 2 50 n ∑ x − (∑ x ) 5(55) − (15) Continued. Pharmaceutical Biostatistics: Correlation & Regression
Example continued:
b = y − mx = ∑ y − m ∑ x = −1 − (1.2) 15 = −3.8 n n 5 5 The equation of the regression line is ŷ = 1.2x – 3.8. y
2 1 x -1 -2
1
2
3
4
5
-3 Pharmaceutical Biostatistics: Correlation & Regression
−1 5
( )
(x , y ) = 3,
Exercise: The following data represents the number of hours 12 different students watched television during the weekend and the scores of each student who took a test the following Monday. a) Find the equation of the regression line. b) Use the equation to find the expected test score for a student who watches 9 hours of TV.
Hours, x Test score, y
0 96
1 85
2 82
3 74
3 95
5 68
5 76
5 84
Pharmaceutical Biostatistics: Correlation & Regression
6 58
7 65
7 75
10 50
Example: The following data represents the number of hours 12 different students watched television during the weekend and the scores of each student who took a test the following Monday. a) Find the equation of the regression line. b) Use the equation to find the expected test score for a student who watches 9 hours of TV. 0 1 2 3 3 Hours, x Test score, y 96 85 82 74 95 0 85 164 222 285 xy 0 1 4 9 9 x2 9216 7225 6724 5476 9025 y2 ∑ x = 54
∑ y = 908
5 5 5 68 76 84 340 380 420 25 25 25
6 7 7 10 58 65 75 50 348 455 525 500 36 49 49 100
4624 5776 7056 3364 4225 5625 2500
∑ xy = 3724
∑ x 2 = 332
Pharmaceutical Biostatistics: Correlation & Regression
∑ y 2 = 70836
Example continued:
n ∑ xy − (∑ x )(∑ y ) 12(3724) − (54)(908) ≈ −4.067 = 2 2 2 n ∑ x − (∑ x ) 12(332) − (54) y
b = y − mx ∑ ∑ = y −m x n n = 908 − (−4.067) 54 12 12
≈ 93.97 ŷ = –4.07x + 93.97
(x , y ) =
100 Test score
m=
≈ (4.5,75.7) (1254 , 908 12 )
80 60 40 20 x 2
4 6 8 10 Hours watching TV
Pharmaceutical Biostatistics: Correlation & Regression
Continued.
Example continued: Using the equation ŷ = –4.07x + 93.97, we can predict the test score for a student who watches 9 hours of TV. ŷ = –4.07x + 93.97 = –4.07(9) + 93.97 = 57.34
A student who watches 9 hours of TV over the weekend can expect to receive about a 57.34 on Monday’s test.
Pharmaceutical Biostatistics: Correlation & Regression