Study Guides
Big Picture Bivariate data is often represented with scatterplots and line plots. The main reason to display bivariate data in such a way is to find a relationship between the two variables. The relationship is described through the correlation coefficient. Often, transformations on the data must be done so that the correlation coefficient can be used.
Key Terms Bivariate: Two variables. Scatterplot: A graph where each point represents a pair of measurements (two variables). Correlation: The relationship between bivariate data. Correlation Coefficient: A number that describes the correlation (relation) between bivariate data. Linear Regression: Using data to calculate a line that best fits that data. The line can be used to make predictions. Residual: The distance between the observed value and the expected value.
Displaying Bivariate Data Bivariate data is primarily examined to show some sort of relationship between two variables. Bivariate data usually has an independent variable and a dependent variable. The independent variable influences the dependent variable. Time is often the independent variable. We are often interested in the change a variable exhibits over time. We can see if there is any relationship between the variables by showing the data in a scatterplot. You may often see scatterplots as a series of disconnected points. They are a good way of representing bivariate data. The independent variable is on the x-axis while the dependent variable is on the y-axis.
Probability & Statistics
Bivariate Data
Line plots are also used to show change over time. A line plot is basically a scatterplot where the dots are connected chronologically (in order by time).
Correlation Three important characteristics of bivariate data:
• shape (linear, exponential, etc.) • direction • strength We are usually most interested in finding if there is any correlation in the data. The correlation describes the direction of the direction. One way to visualize the correlation is with a scatterplot. We can describe the correlation as: • positive correlation: positive slope • negative correlation: negative slope
• zero correlation: points do not have a linear trend
straight line - can be positive or negative
The more linear the data is, the stronger the linear correlation. Another way to view the strength of this correlation is to draw an ellipse (oval) around all of the data. The narrower or skinnier the ellipse is, the stronger the linear correlation.
This guide was created by Lizhi Fan and Jin Yu. To learn more about the student authors, visit http://www.ck12.org/about/about-us/team/interns.
Page 1 of 3 v1.1.9.2012
Disclaimer: this study guide was not created to replace your textbook and is for classroom or individual use only.
• perfect correlation: points on a scatterplot lie on a
Probability & Statistics
Bivariate Data
cont .
Correlation (cont.) Correlation Coefficient
Transformations to Achieve Linearity
Correlation coefficient (r) can be used to express correlation.
Curvilinear relationships are nonlinear relationships. Just because they are nonlinear does not mean they don’t have a strong correlation. However, the r correlation coefficient by itself will not be able to tell us about the strength of a nonlinear relationship.
• Can
have values between -1 and +1. Signs indicate negative (-) and positive (+) correlations
• The
closer the absolute value of the coefficient (|r|) is to 1, the stronger the relationship
• Perfect negative correlation is -1; perfect positive correlation is 1
• Only
describes linear relationships (nonlinear relationships have r = 0)
There is a way to manipulate data points to make a nonlinear relationship linear. By doing this, we can use the correlation coefficient to describe the strength of the relationship. For example, if we were dealing with an exponential relationship: y = axb
• By
taking the log of both sides, we can change the data to become a linear relationship. After doing this, we can describe the relationship with a correlation coefficient. log y = log (axb) log y = log a + log xb log y = log a + b log x
• Still
can have a strong relationship even if correlation coefficient is low
One statistic that measures the strength and direction of a linear correlation is the Pearson product-moment correlation coefficient. To calculate the correlation r of two variables X and Y, use the formula: , z is the z-score and n is the sample size If we have the raw scores and not the standardized scores, we can use this formula:
We can define two new variables:
• Y = log y • X = b log x
The new relationship is Y = log a + X. log a is a constant, so we have transformed the exponential relationship into a linear one.
Correlation only describes linearity. It does not tell us if one variable caused the other.
Least-Squares Regression Line Linear regression is a mathematical way to determine the best fit line through a set of data.
• The least-squares regression line (also known as a linear regression line) is created by finding the line that minimizes the calculated distance from the data points to the respective places on the line. This is also known as the residual.
• Residual = Observed - Expected • Generally, the smaller the residuals,
the better fit the least-squares regression line is to the data. If all the residuals were added together, the sum would be zero.
• A straight line that would represent the change in one variable associated with the change in the other • Often used to predict values of future data points. This is done simply by substituting a value of a predictor variable (X) into the equation to find the outcome variable Y. (The predictor variable predicts the outcome). The regression line is a straight line with the form: Y = bX + a
• Y is what we are trying to predict • b is the slope of the line (regression coefficient) • a is the value of Y when X = 0 (regression constant) • X is the predictor variable To calculate the line, we need to find b and a.
•
or
• r is the correlation between X and Y • sY is the standard deviation of Y • sX is the standard deviation of X • Plotting Residuals and Testing for Linearity We can plot the residuals by plotting the x-value for each data pair on the x-axis and the residual on the y-axis.
• A residual plot with no outliers and with a linear relationship would appear to have no correlation. • If the residual plot has an obvious pattern, you may want to try other models of the data, such exponential functions, to see if they are a better fit Page 2 of 3
as power or
cont .
Inferences Hypothesis Testing The least-squares line y = a + bx is for samples. To predict the line for the entire population, we use ρ = α + βx, where ρ is the population correlation coefficient. CAREFUL: Here α and β are not the level of significance and the power of the test.
• Make sure that the set of data is for a random sample. • Make sure the y values have a normal distribution. If these are true, we can use hypothesis testing.
• Null hypothesis is that the regression coefficient β = some number • Ha hypothesis is that β does NOT equal the given number (≠ or > or <) • Use the test statistic where
SSE = sum of residual error squared
Notes
Page 3 of 3
Probability & Statistics
Bivariate Data