Intro Stats, 5ERichard D De Veaux Instructor Guide

Page 1

ONLINE TEST BANK AND RESOURCE GUIDE JARED DERKSEN

I NTRO S TATS FIFTH EDITION

Richard De Veaux Williams College

Paul Velleman Cornell University

David Bock Cornell University


Contents Part I

Exploring and Understanding Data Chapter 1

Stats Starts Here

1-1

Chapter 2

Displaying and Describing Data

2-1

Chapter 3

Relationships Between Categorical Variables—Contingency Tables

3-1

Chapter 4

Understanding and Comparing Distributions

4-1

Chapter 5

The Standard Deviation as a Ruler and the Normal Model

5-1

Review of Part I: Exploring and Understanding Data Part II

Exploring Relationships Between Variables Chapter 6

Scatterplots, Association, and Correlation

6-1

Chapter 7

Linear Regression

7-1

Chapter 8

Regression Wisdom

8-1

Chapter 9

Multiple Regression

9-1

Review of Part II: Exploring Relationships Between Variables Part III

Chapter 10

Sample Surveys

10-1

Chapter 11

Experiments and Observational Studies

11-1 Part III-1

From the Data at Hand to the World at Large Chapter 12

From Randomness to Probability

12-1

Chapter 13

Sampling Distribution Models and Confidence Intervals for Proportions

13-1

Chapter 14

Confidence Intervals for Means

14-1

Chapter 15

Testing Hypotheses

15-1

Chapter 16

More About Tests and Intervals

16-1

Review of Part IV: From the Data at Hand to the World at Large Part V

Part II-1

Gathering Data

Review of Part III: Gathering Data Part IV

Part I-1

Part IV-1

Inference for Relationships Chapter 17

Comparing Groups

17-1

Chapter 18

Paired Samples and Blocks

18-1

Chapter 19

Comparing Counts

19-1

Chapter 20

Inferences for Regression

20-1

Review of Part V: Inference for Relationships .

iii

Part V-1


How to use this Test Bank and Resource Guide This guide is a supplement to be used in conjunction with the Instructor’s Edition of Intro Stats, 5th edition by De Veaux, Velleman, and Bock. The authors have integrated many instructor’s resources into the text, and these sections precede each chapter. In this Test Bank and Resource Guide, all or some of the following features may be found for each chapter and unit. Solutions to Class Examples Answers are provided to the chapter examples presented in the Instructor’s Edition of the text. Investigative Tasks Instead of a quiz, you may choose to have students do a written assignment that applies the major concepts of the chapter. Along with each classroom-tested task, there is a proposed solution to the task and a scoring rubric. Returning the completed rubric to the students will provide them the guidance needed to learn to write clear, complete, and concise statistical analyses. Chapter Quizzes You might choose to give a quiz after completing a chapter. For each chapter, there are two or three quizzes that you can choose from, along with solutions. If not used as a quiz, the questions can be used as additional class examples, homework assignments, or extra practice. Unit Tests Two or three sample exams (and solutions) are available for you at the end of each of the text’s five units. These exams include multiple-choice questions, short questions requiring some calculations or written explanations, and longer questions requiring more in-depth analysis. They are not easy. Understanding Statistics means thinking about the world. All of the problems ask for clear understanding of important statistical concepts, accurate application of statistical techniques, and proper interpretation of the results. Expecting this from the start helps students establish the habit of clear statistical thinking. Supplemental Resources We’ve tried lots of things over the years to help students understand the beauty and power of Statistics. Where applicable, we’ve included some extra materials. These might be worksheets, group assignments, or class activities.

.

iv


Chapter 1 Stats Starts Here Solutions to Class Examples: Consumer Reports Who: energy bars What: brand name, flavor, price, calories, protein, fat When: not specified Where: not specified How: not specified. Are data collected from the label? Are independent tests performed? Why: information for potential consumers Categorical variables: brand name, flavor Quantitative variables: price (US$), number of calories (calories), protein (grams), fat(grams) Boston Marathon Who: Boston Marathon runners What: gender, country, age, time When: not specified Where: Boston How: not specified. Presumably, the data were collected from registration information. Why: race result reporting Categorical variables: gender, country Quantitative variables: age (years), time (hours, minutes, seconds) Supplemental Resources: The following page contains a list of the 50 United States of America. We have found it to be helpful if you collect class data on the number of States visited. On the next page is a potential blank survey that you can pass around on the first day of class to collect some data. Some of the survey questions are left deliberately vague, so that you can discuss potential sources of bias, informally of course.

.


1-2

Part I Exploring and Understanding Data

States – Count the number you have visited Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois

Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana

Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania

Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming

States – Count the number you have visited Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois

Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana

Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania

Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming

States – Count the number you have visited Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois

Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana

Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania

.

Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming


Chapter 1 Stats Starts Here

Statistics – Class Survey Gender (M/F)

Politics (L, M, C)

Number of Siblings

.

States Visited

Shoe Size

1-3


1-4

Part I Exploring and Understanding Data

Statistics Quiz A – Chapter 1

Name __________________________

1. One of the reasons that the Monitoring the Future (MTF) project was started was “to study changes in the beliefs, attitudes, and behavior of young people in the United States.” Data are collected from 8th, 10th, and 12th graders each year. To get a representative nationwide sample, surveys are given to a randomly selected group of students. In Spring 2016, students were asked about alcohol, illegal drug, and cigarette use. Describe the W’s, if the information is given. If the information is not given, state that it is not specified. 

Who:

What:

When:

Where:

How:

Why:

2. Consider the following part of a data set: Age Height Weight Credit Sex Only child? GPA Major (years) (inches) (pounds) Hours 21 Female Yes 67.00 140.0 16 3.60 animal science 20 Female No 62.00 130.0 18 3.86 biology 28 Female No 64.00 188.0 21 3.25 psychology 21 Male No 65.00 140.0 15 2.95 psychology 24 Female No 67.00 130.0 20 3.00 anthropology 22 Male Yes 68.00 135.0 15 2.94 journalism List the variables in the data set. Indicate whether each variable is treated as categorical or quantitative in this data set. If the variable is quantitative, state the units.

.


Chapter 1 Stats Starts Here

1-5

Statistics Quiz A – Chapter 1 – Key 1. One of the reasons that the Monitoring the Future (MTF) project was started was “to study changes in the beliefs, attitudes, and behavior of young people in the United States.” Data are collected from 8th, 10th, and 12th graders each year. To get a representative nationwide sample, surveys are given to a randomly selected group of students. In Spring 2016, students were asked about alcohol, illegal drug, and cigarette use. Describe the W’s, if the information is given. If the information is not given, state that it is not specified.  Who: 8th, 10th, and 12th graders 

What: alcohol, illegal drug, and cigarette use

When: Spring 2016

Where: United States

How: survey

Why: “to study changes in the beliefs, attitudes, and behavior of young people in the United States”

2. Consider the following part of a data set: Age Height Weight Credit Sex Only child? GPA Major (years) (inches) (pounds) Hours 21 Female Yes 67.00 140.0 16 3.60 animal science 20 Female No 62.00 130.0 18 3.86 biology 28 Female No 64.00 188.0 21 3.25 psychology 21 Male No 65.00 140.0 15 2.95 psychology 24 Female No 67.00 130.0 20 3.00 anthropology 22 Male Yes 68.00 135.0 15 2.94 journalism List the variables in the data set. Indicate whether each variable is treated as categorical or quantitative in this data set. If the variable is quantitative, state the units. Categorical: sex, only child?, major Quantitative: age (years), height (inches), weight (pounds), credit hours, GPA

.


1-6

Part I Exploring and Understanding Data

Statistics Quiz B – Chapter 1

Name ________________________

In November 2003 Discover published an article on the colonies of ants. They reported some basic information about many species of ants and the results of some discoveries found by myrmecologist Walter Tschinkel of the University of Florida. Information included the scientific name of the ant species, the geographic location, the depth of the nest (in feet), the number of chambers in the nest, and the number of ants in the colony. The article documented how new ant colonies begin, the ant-nest design, and how nests differ in shape, number, size of chambers, and how they are connected, depending on the species. It reported that nest designs include vertical, horizontal, or inclined tunnels for movement and transport of food and ants. 1. Describe the W’s, if the information is given:  Who: 

What:

When:

Where:

How:

Why:

2. List the variables. Indicate whether each variable is categorical or quantitative. If the variable is quantitative, tell the units.

.


Chapter 1 Stats Starts Here

1-7

Statistics Quiz B – Chapter 1 – Key In November 2003 Discover published an article on the colonies of ants. They reported some basic information about many species of ants and the results of some discoveries found by myrmecologist Walter Tschinkel of the University of Florida. Information included the scientific name of the ant species, the geographic location, the depth of the nest (in feet), the number of chambers in the nest, and the number of ants in the colony. The article documented how new ant colonies begin, the ant-nest design, and how nests differ in shape, number, size of chambers, and how they are connected, depending on the species. It reported that nest designs include vertical, horizontal, or inclined tunnels for movement and transport of food and ants. 1. Describe the W’s, if the information is given:  Who: Colonies of ants. “Many species of ants,” but no indication of exactly how many. 

What: scientific name, geographic location, average nest depth, average number of chambers, average

When: November 2003

Where: not specified

How: The results of some discoveries found by myrmecologist Walter Tschinkel of the University of

Why: Information of interest to readers of the magazine

colony size, how new ant colonies begin, the ant-nest design, and how nests differ in architecture.

Florida

2. List the variables. Indicate whether each variable is categorical or quantitative. If the variable is quantitative, tell the units. Categorical: species, geographic location, how new ant colonies begin, and nest design. Quantitative: nest depth (feet), number of chambers (units), and colony size (units).

.


1-8

Part I Exploring and Understanding Data

Statistics Quiz C – Chapter 1

Name __________________________

In May 2017, Wirecutter published an article entitled “The Best True Wireless Headphones So Far” (http://thewirecutter.com/reviews/best-true-wireless-headphones/). They tested 11 “of the most promising true wireless in-ear headphones.” Among other things, the article told the brand of each pair of headphones, its price, battery life, audio quality, ease of setup, and other characteristics. The article provides a number of recommendations including best for the money, best for the gym, best for Apple, and best for Android. The author, Lauren Dragan, describes herself as a voice actor with an audio production degree who has spent hundreds of hours testing headphones for Wirecutter. 1. Describe the W’s, if the information is given:  Who: 

What:

When:

Where:

How:

Why:

2. List the variables. Indicate whether each variable is categorical or quantitative. If the variable is quantitative, tell the units.

.


Chapter 1 Stats Starts Here

1-9

Statistics Quiz C – Chapter 1 – Key In May 2017, Wirecutter published an article entitled “The Best True Wireless Headphones So Far” (http://thewirecutter.com/reviews/best-true-wireless-headphones/). They tested 11 “of the most promising true wireless in-ear headphones.” Among other things, the article told the brand of each pair of headphones, its price, battery life, audio quality, ease of setup, and other characteristics. The article provides a number of recommendations including best for the money, best for the gym, best for Apple, and best for Android. The author, Lauren Dragan, describes herself as a voice actor with an audio production degree who has spent hundreds of hours testing headphones for Wirecutter. 1. Describe the W’s, if the information is given:  Who: 11wireless in-ear headphones. 

What: brand, price, battery life, audio quality, ease of setup, and other characteristics.

When: May 2017

Where: not specified, probably the United States

How: presumably lab tests of the 11 models

Why: information for potential consumers

2. List the variables. Indicate whether each variable is categorical or quantitative. If the variable is quantitative, tell the units. Categorical: brand, audio quality, ease of setup Quantitative: price (US$), battery life (probably hours)

.


1-10

Part I Exploring and Understanding Data

Statistics Quiz D – Chapter 1

Name ________________________

1. In the fall of 2007, the Pew Internet & Life Project conducted telephone interviews with a sample of American adults aged 18 and older about online shopping. American adults aged 18 and older constitute the ______ of the study. A. Who B. What C. When D. Where E. How 2. A few of the variables for which data were collected in the Pew Internet & Life Project study about online shopping include age, gender, income, and number of hours spent shopping online per month. Which of the variables is categorical? A. Age B. Gender C. Income D. Number of hours spent shopping online E. None 3. The Pew Internet & Life Project study about online shopping asked respondents to indicate their education level on the following scale: Less than High School, High School, Some College, College +. Which of the following statements is (are) true? A. Education level is a categorical variable. B. Education level is nominal scaled. C. Education level is ordinal scaled. D. Both A and B E. Both A and C 4. Consumer Reports Health routinely compares drugs in terms of effectiveness and safety. In summer 2008 they reviewed drugs used to treat arthritis. Among the information reported was convenience of use (how many pills required each day) and possible side effects (e.g., dizziness, stomach upset). Convenience of use and possible side effects constitute the ________ of the study. A. Who B. What C. When D. Where E. How

.


Chapter 1 Stats Starts Here

1-11

5. What is the “Who” in a Consumer Reports Health study on the effectiveness and safety of drugs used to treat arthritis? A. drugs to treat arthritis currently on the market B. convenience of use and possible side effects C. summer 2008 D. the United States E. testing on drugs 6. A Consumer Reports Health study on the effectiveness and safety of arthritis drugs collected data on possible side effects. This is what kind of variable? A. Quantitative B. Categorical C. Nominal D. Both A and C E. Both B and C 7. A Consumer Reports Health study on arthritis drugs takes into consideration cost. Cost is A. is a nominal variable. B. is a categorical variable. C. is a quantitative variable. D. is an ordinal variable. E. is an irrelevant variable. 8. The Human Resources Department of a large corporation maintains records on its employees. Data are maintained of the following variables: Age, Employment Category, Education, Whether or not the employee participates in a wellness program, and Paycheck benefit deductions. Which of these variables are categorical? A. Age, Employment Category, and Education B. Employment Category, Education, and Whether or not the employee participates in a wellness program C. Education, Whether or not the employee participates in a wellness program, and Paycheck benefit deductions D. All of the variables E. None of the variables

.


1-12

Part I Exploring and Understanding Data

7. A Consumer Reports study on tipping takes into consideration median amount of tipping for service providers. Tipping is A. is a nominal variable. B. is a categorical variable. C. is a quantitative variable. D. is an ordinal variable. E. is an irrelevant variable. 8. The Human Resources Department of a large corporation maintains records on its employees. Data are maintained of the following variables: Age, Employment Category, Education, and Whether or not the employee has an advanced degree. Which of these variables are categorical? A. Age, Employment Category, and Education B. Employment Category and Education C. Education and Whether or not the employee has an advanced degree D. All of the variables E. None of the variables

.


Chapter 1 Stats Starts Here

Statistics Quiz D – Chapter 1 – Key 1. A 2. B 3. E 4. B 5. A 6. E 7. C 8. B 9. C 10. B

.

1-13


1-14

Part I Exploring and Understanding Data

Statistics Quiz E – Chapter 1

Name ________________________

1. A university is interested in gauging student satisfaction in its online MBA program. A survey is designed and administered via the Internet to a sample of students currently active in the program. Which of the following would best describe the cases? A. Participants B. Respondents C. Experimental Units D. Subjects E. Variables 2. In a survey undertaken by a university to gauge student satisfaction in its online MBA program, one question asked students to indicate their employment status (unemployed, employed part-time, employed full-time). Which of the following is true? A. This variable is categorical. B. This variable is quantitative. C. This is an identifier variable. D. Both A and C. E. Both B and C. 3. In a survey undertaken by a university to gauge student satisfaction in its online MBA program, one question asked students to indicate the number of credits they had transferred into the program. Which of the following is true? A. This variable is categorical. B. This variable is transactional. C. This variable is quantitative. D. This is an identifier variable. E. This variable is nominal. 4. Researchers in e-commerce design an experiment to determine what factors are most important to online consumers when completing a transaction via the Internet. Individuals perform tasks on a set of Web sites and record their impressions about various attributes. Which of the following would best describe the cases? A. Participants B. Respondents C. Experimental Units D. Identifiers E. Variables

.


Chapter 1 Stats Starts Here

1-15

5. A popular travel magazine regularly reviews hotels worldwide. In a recent issue, it focused on hotels in Hawaii. Among the variables for which it provided data was whether or not the hotel included a spa. This is best described as a A. quantitative variable. B. identifier variable. C. ordinal variable. D. categorical variable. E. nominal variable. 6. A popular travel magazine regularly reviews hotels worldwide. In a recent issue, it focused on hotels in Hawaii. Among the variables for which it provided data was the price range for rooms with an ocean view. Which of the following statements is true? A. This variable is quantitative and has no units. B. This variable is quantitative and the units are $. C. This variable is quantitative and the units are number of rooms. D. This variable is qualitative and ordinal. E. This variable is qualitative and nominal. 7. A mid-priced chain of hotels, Hometown Suites, strives to make its guests “feel at home” by providing amenities such as microwaves in every room. Comment cards are used to get feedback on the importance of such amenities by asking guests to rate them using the scale:___ Essential ___ Important ___ Not Important. These data are A. qualitative. B. nominal. C. ordinal. D. both A and B. E. both A and C. 8. Businesses are interested in the work experience of recent graduates from a local business school. Whether or not the graduates have work experience constitutes the ________ of the study. A. Who B. What C. When D. Where E. How

.


1-16

Part I Exploring and Understanding Data

9. What is the “What” in a Consumer Reports Tipping study on the level of tipping during the current holiday season compared to the last holiday season? A. whether or not tipped B. amount of tip compared to last year C. the type of tip D. the United States 10. A Consumer Reports survey on the level of tipping for service providers. This is what kind of variable? A. Quantitative B. Ordinal C. Nominal D. Both A and C E. Both B and C

.


Chapter 1 Stats Starts Here

Statistics Quiz E – Chapter 1 – Key 1. B 2. A 3. C 4. A 5. D 6. B 7. E 8. B 9. B 10. B

.

1-17



Chapter 2 Displaying and Describing Data Solutions to Class Examples: 1. Answers will vary according to your data. 2. Answers will vary according to your data. 3. See Class Example 3. 4. Answers will vary according to your data. 5. See Class Example 5. Investigative Task The task, Dollars for Students, examines education spending in the United States.

.


2-2

Part I Exploring and Understanding Data

Statistics Quiz A – Chapter 2

Name __________________________

1. Students in a statistics course were asked to answer a variety of questions. One question asked for students to pick a favorite Superpower. Results are listed to the right.

Superpower Fly Freeze Time Invisibility Super strength Telepathy

a. Create a visual display for these data.

Num of Students 19 11 7 4 12

b. Describe the distribution of the students’ choices.

2. Which of the following variables would be appropriate to graph using a pie or bar graph? a. Annual income for 20 employees

Yes

or

No

b. The favorite baseball team of 30 students

Yes

or

No

c. The number of pages in 15 textbooks

Yes

or

No

d. The country of origin of 25 immigrants

Yes

or

No

3. Describe why the area principle is important in making a bar graph. It might be fun to ask an artist to liven up a bar graph by turning the bars into images. But include in your explanation why this might be risky.

.


Chapter 2 Displaying and Describing Data

2-3

Statistics Quiz A – Chapter 2 KEY 1. Students in a statistics course were asked to answer a variety of questions. One question asked for students to pick a favorite Superpower. Results are listed to the right.

Superpower Fly Freeze Time Invisibility Super strength Telepathy

a. Create a visual display for these data.

Num of Students 19 11 7 4 12

Superpower

b. Describe the distribution of the students’ choices. The ability to fly is the students’ most popular choice, followed by telepathy and freeze time. Super strength is the least popular, with invisibility also not very popular.

2. Which of the following variables would be appropriate to graph using a pie or bar graph? a. Annual income for 20 employees

Yes

or

No

b. The favorite baseball team of 30 students

Yes

or

No

c. The number of pages in 15 textbooks

Yes

or

No

d. The country of origin of 25 immigrants

Yes

or

No

3. Describe why the area principle is important in making a bar graph. It might be fun to ask an artist to liven up a bar graph by turning the bars into images. But include in your explanation why this might be risky. If the reader of the graphs is going to understand the relationship between the groups, it is important that the area allotted to each group match the proportion of group in the sample. If an artist turns the bars into some kind of image, he might be not make each picture the proper area, and thus distort the actual percentage breakdown of each group.

.


2-4

Part I Exploring and Understanding Data

Statistics Quiz B – Chapter 2

Name __________________________

1. A survey conducted in a college intro stats class 10 10 12 14 15 15 15 15 asked students about the number of credit hours 17 17 19 20 20 20 20 22 they were taking that quarter. The number of credit hours for a random sample of 16 students is given in the table. a. Sketch a histogram of these data b. Find the mean and standard deviation for the number of credit hours.

c. Find the median and IQR for the number of credit hours.

d. Is it more appropriate to use the mean and standard deviation or the median and IQR to summarize theses data? Explain.

2. Suppose that the student taking 22 credit hours in the data set in the previous question was actually taking 28 credit hours instead of 22 (so we would replace the 22 in the data set with 28). Indicate whether changing the number of credit hours for that student would make each of the following summary statistics increase, decrease, or stay about the same: a. mean b. median c. range d. IQR e. standard deviation

.


Chapter 2 Displaying and Describing Data

Statistics Quiz B – Chapter 2 - KEY

2-5

Name __________________________

1. A survey conducted in a college intro stats class 10 10 12 14 15 15 15 15 asked students about the number of credit hours 17 17 19 20 20 20 20 22 they were taking that quarter. The number of credit hours for a random sample of 16 students is given in the table. a. Sketch a histogram of these data b. Find the mean and standard deviation for the number of credit hours. Histogram of Credit Hours

Frequency

5

x  16.3 credit hours; s = 3.7 credit hours

4 3 2 1 0

10

12

14 16 18 20 Credit Hours

22

c. Find the median and IQR for the number of credit hours.

d. Is it more appropriate to use the mean and standard deviation or the median and IQR to summarize theses data? Explain.

The median is 16.0 credit hours. IQR = Q3 – Q1 = 20 – 14.5 = 5.5 credit hours

It is more appropriate to use the median and IQR to summarize these data, because these data are not unimodal and symmetric.

2. Suppose that the student taking 22 credit hours in the data set in the previous question was actually taking 28 credit hours instead of 22 (so we would replace the 22 in the data set with 28). Indicate whether changing the number of credit hours for that student would make each of the following summary statistics increase, decrease, or stay about the same: a. mean increase b. median

stay about the same

c. range

increase

d. IQR

stay about the same

e. standard deviation

increase

.


2-6

Part I Exploring and Understanding Data

Statistics Quiz C – Chapter 2

Name __________________________

1. The students in a biology class kept a record of the height (in centimeters) of plants for a class experiment.

49 67 38 55 62 54 36 41 56 43 48 75 44 60 48 52 48 53 59 32 b. Find the mean and standard deviation of the plant heights.

a. Sketch a histogram for these data.

c. Is it appropriate to use the mean and standard deviation to summarize these data? Explain.

d. Describe the distribution of plant heights.

2. All students in a physical education class completed a basketball free-throw shooting event and the highest number of shots made was 32. The next day a student who had just transferred into the school completed the event, making 35 shots. Indicate whether adding the new student’s score to the rest of the data made each of these summary statistics increase, decrease, or stay about the same: a. mean b. median c. range d. IQR e. standard deviation .


Chapter 2 Displaying and Describing Data

2-7

Statistics Quiz C – Chapter 2 – KEY 1. The students in a biology class kept a record of the height (in centimeters) of plants for a class experiment.

49 67 38 55 62 54 36 41 56 43 48 75 44 60 48 52 48 53 59 32 b. Find the mean and standard deviation of the plant heights.

a. Sketch a histogram for these data. Histogram of Plant Heights Frequency

4

x  51.0 cm; s  10.6 cm

3 2 1 0

30

40

50 60 Plant Heights

70

80

c. Is it appropriate to use the mean and standard deviation to summarize these data? Explain.

d. Describe the distribution of plant heights. The data are roughly symmetric with no outliers; however there is a small gap from 70 to 75 cm. The average plant height is 51.0 centimeters, with a standard deviation of 10.6 centimeters. The range of plant heights is 43 centimeters. The distribution of plant heights has a mode between 45 and 49 centimeters.

Yes, the data are roughly unimodal and symmetric with no outliers.

2. All students in a physical education class completed a basketball free-throw shooting event and the highest number of shots made was 32. The next day a student who had just transferred into the school completed the event, making 35 shots. Indicate whether adding the new student’s score to the rest of the data made each of these summary statistics increase, decrease, or stay about the same: a. mean

increase

b. median

stay about the same

c. range

increase

d. IQR

stay about the same

e. standard deviation

increase .


2-8

Part I Exploring and Understanding Data

Statistics Quiz D – Chapter 2

Name __________________________

1. A brake and muffler shop reported the repair bills, in dollars, for their customers yesterday.

88 283 312 290 172 154 400 381 346 181 203 118 143 252 227 56 192 292 213 422 b. Find the mean and standard deviation of the repair costs.

a. Sketch a histogram for these data.

c. Is it appropriate to use the mean and standard deviation to summarize these data? Explain.

d. Describe the distribution of repair costs.

2. In a survey of 1,500 millennials (ages 1834), Business Insider asked which services were used to consume video content. Use the data on the right to construct an appropriate display and describe your graph.

Service YouTube Netflix Hulu Amazon Prime HBO Crackle None Other

.

Percent 81% 79% 37% 34% 21% 12% 3% 2%


Chapter 2 Displaying and Describing Data

2-9

Statistics Quiz D – Chapter 2 – KEY 1. A brake and muffler shop reported the repair bills, in dollars, for their customers yesterday.

88 283 312 290 172 154 400 381 346 181 203 118 143 252 227 56 192 292 213 422 b. Find the mean and standard deviation of the repair costs.

a. Sketch a histogram for these data. Repair Costs Frequency

4

x  $236.25 ; s  $103.43

3 2 1 0

100

200 300 Repair Costs ($)

400

c. Is it appropriate to use the mean and standard deviation to summarize these data? Explain.

d. Describe the distribution of repair costs. The repair costs averaged $236.25, ranging from $56 to $422 with a standard deviation of $103.43. The distribution was approximately symmetric, with typical repair costs clustered between $150 and $300.

Yes, the data are roughly unimodal and symmetric with no outliers.

Service YouTube Netflix Hulu Amazon Prime HBO Crackle None Other

2. In a survey of 1,500 millennials (ages 1834), Business Insider asked which services were used to consume video content. Use the data on the right to construct an appropriate display and describe your graph.

Percent 81% 79% 37% 34% 21% 12% 3% 2%

YouTube and Netflix are clearly the dominant choices. And as these percentages add to more than 100%, it is clear that most millennials use more than one service, making the pie chart an inappropriate choice. Hulu and Amazon are used by about one-third of the millennials, while Crackle and HBO are clearly the least popular.

.


2-10

Part I Exploring and Understanding Data

Statistics - Investigative Task

Chapter 2

Dollars for Students In 2008 the U.S. Census Bureau published Public Education Finances, reporting the average amount (dollars per student) spent by public schools in each state during the 2006 school year. (The table seen here divides states according to whether they lie east or west of the Mississippi River). Write a report describing the amounts states spent to educate their children. Eastern State $ per Student Western State $ per Student Alabama 7,646 Alaska 11,460 Connecticut 12,323 Arizona 6,472 Delaware 11,633 Arkansas 7,927 D.C. 13,446 California 8,486 Florida 7,759 Colorado 8,057 Georgia 8,565 Hawaii 9,876 Illinois 9,149 Idaho 6,440 Indiana 8,793 Iowa 8,360 Kentucky 7,662 Kansas 8,392 Maine 10,586 Louisiana 8,402 Maryland 10,670 Minnesota 9,138 Massachusetts 11,981 Missouri 8,107 Michigan 9,572 Montana 8,581 Mississippi 7,221 Nebraska 8,736 New Hampshire 10,079 Nevada 7,345 New Jersey 14,630 New Mexico 8,086 New York 14,884 North Dakota 8,603 North Carolina 7,388 Oklahoma 6,961 Ohio 9,598 Oregon 8,545 Pennsylvania 11,028 South Dakota 7,651 Rhode Island 11,769 Texas 7,561 South Carolina 8,091 Utah 5,437 Tennessee 6,883 Washington 7,830 Vermont 12,614 Wyoming 11,197 Virginia 9,447 West Virginia 9,352 Wisconsin 9,970 A complete report will include a visual display (stem-and-leaf plot), appropriate statistics, and a well-written description of the expenditures (in context, of course).

.


Chapter 2 Displaying and Describing Data

Statistics - Investigative Task

Chapter 2

Components Comments Demonstrates clear understanding of statistical Think concepts, vocabulary, and procedures in analyzing and describing these data. Visual/Numerical: o stem-and-leaf plot o plot well-labeled Show o plot correctly constructed o summary statistics correct Verbal: Describes the distribution of expenditures in context, including… o shape (skewed right by higher spending in the East) o center (median) o spread (IQR) Tell The written analysis… o also interprets at least one quartile, or the max or min in context o identifies the W’s o uses statistical vocabulary correctly o avoids speculation Components are scored as Essentially correct, Partially correct, or Incorrect 1: Visual/Numerical E – Has all 4 features P – Has only 3 of the 4 features, but attempts an appropriate graph (ex: histogram) I – Graph is not appropriate (ex: bar chart), has many errors, or is missing 2: Shape E – Identifies skewness to the right, attributed to generally higher spending in the East. P – Mis-identifies skewness OR overlooks East/West spending difference I – Description of shape is missing or incorrect 3: Center and Spread E – Correctly interprets median and IQR in context P – Correctly interprets only one of median/IQR OR lacks context OR uses mean and SD I – Has more than one of the three shortcomings described in “P” 4: Written Analysis E – Has all 4 listed properties. P – Has only 2 or 3 of the listed properties. I – Has fewer than of the properties. Scoring

 E’s count 1 point, P’s are 1/2  Grade: A = 4, B = 3, etc., with +/- based on rounding (ex: 3.5 rounded to 3 is a B+) Name ___________________________________ Grade ____

.

2-11


2-12

Part I Exploring and Understanding Data

NOTE: We present a model solution with some trepidation. This is not a scoring key, just an example. Many other approaches could fully satisfy the requirements outlined in the scoring rubric. That (not this) is the standard by which student responses should be evaluated. Model Solution – Investigative Task – Dollars for Students The U.S. Census published Public Education Finances, a collection of educational information in 2008. Among the data reported was information on average educational spending by state during the 2006 school year, in dollars per student. Spending by many states was near the nationwide median of $8581 per student. The middle 50% of states spent between $7759 and $10586 per student, an interquartile range of $2827. The distribution of the average number of education dollars spent per student was skewed to the right. Many Eastern States spent in excess of $12,000 per student, including the maximum of $14884 per student in New York. Utah spent the least, $5437 per student.

.


Chapter 2 Displaying and Describing Data

Statistics Quiz E – Chapter 2

2-13

Name __________________________

1. A automobile marketing firm conducts a study to see what types of cars people owned before buying an American car. The results are shown below. Previous Ownership American Japanese Korean German Other Total

Frequency 760 375 72 37 24 1268

The relative frequency of those who owned Japanese cars previously who now bought American cars is A. 59.9 % B. 29.6% C. 5.7% D. 14.9% E. 2.9% 2. A restaurant uses comment cards to get feedback from its customers about newly added items to the menu. It recently introduced homemade organic veggie burgers. Customers who tried the new burger were asked if they would order it again. The data are summarized in the table below. What percentage of customers would definitely order the veggie burger again? Response Definitely would. Most likely would. Maybe Definitely would not. A. 10% B. 15% C. 20% D. 40% E. 77%

.

Frequency 10 40 12 3


2-14

Part I Exploring and Understanding Data

3. A restaurant uses comment cards to get feedback from its customers about newly added items to the menu. It recently introduced homemade organic veggie burgers. Customers who tried the new burger were asked if they would order it again. The data are summarized in the table below. What percentage of customers would most likely or definitely order the veggie burger again? Response Definitely would. Most likely would. Maybe Definitely would not.

Frequency 10 40 12 3

A. 10% B. 15% C. 40% D. 50% E. 77% 4. A restaurant uses comment cards to get feedback from its customers about newly added items to the menu. It recently introduced homemade organic veggie burgers. Customers who tried the new burger were asked if they would order it again. Which of the following would be an appropriate method for displaying the data shown in the table? Response Definitely would. Most likely would. Maybe Definitely would not. A. Histogram. B. Dotplot. C. Pie chart. D. Stem-and-leaf display. E. Both C and D.

.

Frequency 10 40 12 3


Chapter 2 Displaying and Describing Data

5. Below is a histogram of salaries (in $) for a sample of U.S. marketing managers. Histogram of Mktg Manager Salaries 12

Frequency

10 8 6 4 2 0

60000

80000 100000 Mktg Manager Salaries

120000

The shape of this distribution is A. symmetric. B. bimodal. C. right skewed. D. left skewed. E. normal. 6. Here is the five number summary for salaries of U.S. marketing managers. Min Q1 Median Q3 Max 46360 69693 77020 91750 129420 The IQR is A. $83,060. B. $22.057. C. $69,693. D. $77.020. E. $14,566.

.

2-15


2-16

Part I Exploring and Understanding Data

7. Below is a histogram of salaries (in $) for a sample of U.S. marketing managers. Histogram of Mktg Manager Salaries 12

Frequency

10 8 6 4 2 0

60000

80000 100000 Mktg Manager Salaries

120000

The most appropriate measure of central tendency for these data is the A. median. B. mean. C. mode. D. range. E. standard deviation. 8. Consider the five number summary for salaries of U.S. marketing managers. Min Q1 Median Q3 Max 46360 69693 77020 91750 129420 Suppose the marketing manager who was earning $129,420 got a raise and is now earning $140,000. Which of the following statement is true? A. The mean would increase. B. The median would increase. C. The range would increase. D. Both A and C. E. All of the above.

.


Chapter 2 Displaying and Describing Data

2-17

9. The following table shows data for total assets ($ billion) for a small sample of U.S. banks (late 2013). BANK State Street Bank and Trust Discover Bank BancWest Citizens Bank Northern Trust Huntington Bank Key Bank People’s United

ASSETS ($ billion) 160.5 63.9 72.8 130.0 83.8 53.8 91.8 27.9

The mean for the total assets data ($ billion) is A. $78.3. B. $56.3. C. $85.6. D. $120.5. E. $42.4. 10. The following table shows representative recent closing share prices for a small sample of companies based in India in late 2013. COMPANY 20 Microns ABC Paper Bank of MA Photoquip Saksoft Marg LTD Galaxy ENT Sonatasoft EDynamics DB Corp.

CLOSING SHARE PRICE 30.95 24.30 36.20 37.00 67.20 13.99 10.40 68.75 49.95 287.95

The standard deviation in closing share prices is A. $81.6. B. $25.8. C. $36.6. D. $62.7. E. $67.6.

.


2-18

Part I Exploring and Understanding Data

Statistics Quiz E – Chapter 2 – KEY 1. B 2. B 3. E 4. C 5. C 6. B 7. A 8. D 9. C 10. A

.


Chapter 2 Displaying and Describing Data

Statistics Quiz F – Chapter 2

2-19

Name __________________________

1. A clothing store uses comment cards to get feedback from its customers about newly added items. It recently introduced plus size fashion wear. Customers who purchased the items were asked to fill out an online comment survey giving 10% off the next purchase. The data are summarized in the table below. What percentage of customers were at least satisfied with the item(s) purchased (Satisfied or Very satisfied)? Response Very satisfied. Satisfied. Less than fully satisfied. Not satisfied.

Frequency 15 30 12 4

A. 49.2% B. 73.8% C. 24.6% D. 26.2% E. 68.9% 2. A clothing store uses comment cards to get feedback from its customers about newly added items. It recently introduced plus size fashion wear. Customers who purchased the items were asked to fill out an online comment survey giving 10% off the next purchase. The data are summarized in the table below. What percentage of customers would be less likely to purchase another item (Less or Not fully satisfied)? Response Very satisfied. Satisfied. Less than fully satisfied. Not satisfied. A. 10% B. 15% C. 40% D. 50% E. 77%

.

Frequency 15 30 12 4


2-20

Part I Exploring and Understanding Data

3. A clothing store uses comment cards to get feedback from its customers about newly added items. It recently introduced plus size fashion wear. Customers who purchased the items were asked to fill out an online comment survey giving 10% off the next purchase. The data are summarized in the table below. Which of the following would be an appropriate method for displaying the data shown in the table? Response Very satisfied. Satisfied. Less than fully satisfied. Not satisfied.

Frequency 15 30 12 4

A. Histogram. B. Dotplot. C. Pie chart. D. Stem-and-leaf display. E. Both C and D. 4. The following table shows total assets ($ billion) for a small sample of U.S. banks. BANK Bank of New York Regions Financial Fifth Third Bank State Street Bank and Trust Branch Banking and Trust Company Chase Bank Key Bank PNC Bank The mean for these data is A. $ 80.25 billion. B. $ 100.35 billion. C. $ 75.68 billion. D. $ 84 billion. E. $ 89 billion.

.

ASSETS ($ billion) 88 80 58 92 81 70 89 84


Chapter 2 Displaying and Describing Data

5. The following table shows total assets ($ billion) for a small sample of U.S. banks. BANK Bank of New York Regions Financial Fifth Third Bank State Street Bank and Trust Branch Banking and Trust Company Chase Bank Key Bank PNC Bank

ASSETS ($ billion) 88 80 58 92 81 70 89 84

The standard deviation for these data is A. $12.78 billion. B. $ 11.27 billion. C. $ 127.01 billion. D. $ 21.67 billion. E. $ 34 billion. 6. Consider the five number summary of hourly wages ($) for a sample of sales managers. Min Q1 Median Q3 Max 20.94 37.64 44.77 49.34 67.11 The range for these data is A. $11.70 B. $46.17 C. $67.11 D. $20.94 E. $44.77 7. Consider the five number summary of hourly wages ($) for a sample of sales managers. Min Q1 Median Q3 Max 20.94 37.64 44.77 49.34 67.11 The IQR for these data is A. $11.70 B. $46.17 C. $67.11 D. $20.94 E. $44.77

.

2-21


2-22

Part I Exploring and Understanding Data

8. Consider the five number summary of hourly wages ($) for a sample of sales managers. Suppose the mean hourly wage is $38.50. What can we say about the shape of the distribution? Min Q1 Median Q3 Max 20.94 37.64 44.77 49.34 67.11 A. The distribution of hourly wages for sales managers is symmetric. B. The distribution of hourly wages for sales managers is skewed right. C. The distribution of hourly wages for sales managers is skewed left. D. The distribution of hourly wages for sales managers is bimodal. E. None of the above. 9. Consider the five number summary of hourly wages ($) for a sample of advertising / promotion managers. Min Q1 Median Q3 Max 19.64 29.36 34.18 40.86 57.26 Suppose there had been an error and that the lowest hourly wage was $15.50 instead of $19.64. This would result in A. an increase in the median. B. an increase in the standard deviation. C. a decrease in the range. D. a decrease in the IQR. E. an increase in the mean. 10. Of the following stem-and-leave plots of four data sets each containing 11 observations, which represents the set of data that has the greatest standard deviation?

A. Set A. B. Set B. C. Set C. D. Set D. E. both Set C and Set D.

.


Chapter 2 Displaying and Describing Data

Statistics Quiz F – Chapter 2 – KEY 1. B 2. E 3. C 4. A 5. B 6. B 7. A 8. C 9. B 10. B

.

2-23



Chapter 3 Relationships Between Categorical Variables—Contingency Tables Solutions to Class Examples: 1. See Class Example 1. 2. Answers will vary according to your data. 3. See Class Example 3. 4. See Class Example 4. Investigative Task Race and the Death Penalty uses a three-way contingency table and requires comparing marginal and conditional distributions. Supplemental Resources After the Investigative Task, there is a worksheet on the relationship between smoking and education level.

.


3-2

Part I Exploring and Understanding Data

Statistics Quiz A – Chapter 3

Name __________________________

Has the percentage of young girls drinking milk changed over time? The following table is consistent with the results from “Beverage Choices of Young Females: Changes and Impact on Nutrient Intakes” (Shanthy A. Bowman, Journal of the American Dietetic Association, 102(9), pp. 1234-1239): Nationwide Food Survey Years 1987-1988 1989-1991 1994-1996 Total Yes 354 502 366 1222 Drinks Fluid Milk No 226 335 366 927 Total 580 837 732 2149 1. Find the following: a. What percent of the young girls reported that they drink milk? b. What percent of the young girls were in the 1989-1991 survey? c. What percent of the young girls who reported that they drink milk were in the 1989-1991 survey? d. What percent of the young girls in 1989-1991 reported they drink milk? 2. What is the marginal distribution of milk consumption? 3. Do you think that milk consumption by young girls is independent of the nationwide survey year? Use statistics to justify your reasoning. 4. Consider the following pie charts of a subset of the data above: Pie Chart of 1989-1991, 1994-1996 vs Milk? 1989-1991*Milk?

1994-1996*Milk?

C ategory Yes No

Do the pie charts above indicate that milk consumption by young girls is independent of the nationwide survey year? Explain.

.


Chapter 3 Relationships Between Categorical Variables—Contingency Tables

3-3

Statistics Quiz A – Chapter 3 - Key Has the percentage of young girls drinking milk changed over time? The following table is consistent with the results from “Beverage Choices of Young Females: Changes and Impact on Nutrient Intakes” (Shanthy A. Bowman, Journal of the American Dietetic Association, 102(9), pp. 1234-1239): Nationwide Food Survey Years 1987-1988 1989-1991 1994-1996 Total Yes 354 502 366 1222 Drinks Fluid Milk No 226 335 366 927 Total 580 837 732 2149 1. Find the following: a. What percent of the young girls reported that they drink milk? b. What percent of the young girls were in the 1989-1991 survey? c. What percent of the young girls who reported that they drink milk were in the 1989-1991 survey? d. What percent of the young girls in 1989-1991 reported they drink milk?

56.9% 38.9% 41.1% 60.0%

2. What is the marginal distribution of milk consumption? Yes: 56.9%; No: 43.1%

3. Do you think that milk consumption by young girls is independent of the nationwide survey year? Use statistics to justify your reasoning. No. 56.9% of all young girls surveyed reported drinking milk, but 60% of the young girls reported drinking milk in the 1989-1991 survey. Since these percentages differ, milk consumption and year are not independent.

4. Consider the following pie charts of a subset of the data above: Do the pie charts above indicate that milk consumption by young girls is independent of the nationwide survey year? Explain. No. It looks like there is some sort of relationship between milk consumption and nationwide survey year, since the percentage of young girls who reported drinking milk is a larger slice of the pie chart for the 1989-1991 survey than the same response for the 1994-1996 survey.

.


3-4

Part I Exploring and Understanding Data

Statistics Quiz B – Chapter 3

Name __________________________

To determine if people’s preference in dogs had changed in the recent years, organizers of a local dog show asked people who attended the show to indicate which breed was their favorite. This information was compiled by dog breed and gender of the people who responded. The table summarizes the responses. 1. Identify the variables and tell whether each is categorical or quantitative.

2. Which of the W’s are unknown for these data?

Yorkshire Terrier Dachshund Golden Retriever Labrador Dalmatian Other breeds Total

Female 73 49 58 37 45 86 348

Male 59 47 33 41 28 67 275

Total 132 96 91 78 73 153 623

3. Find each percent. a. What percent of the responses were from males who favor Labradors? b. What percent of the male responses favor Labradors? c. What percent of the people who choose Labradors were males? 4. What is the marginal distribution of breeds?

5. Write a sentence or two about the conditional relative frequency distribution of the breeds among female respondents.

6. Do you think the breed selection is independent of gender? Give statistical evidence to support your conclusion.

.


Chapter 3 Relationships Between Categorical Variables—Contingency Tables

3-5

Statistics Quiz B – Chapter 3 – Key To determine if people’s preference in dogs had changed in the recent years, organizers of a local dog show asked people who attended the show to indicate which breed was their favorite. This information was compiled by dog breed and gender of the people who responded. The table summarizes the responses. 1. Identify the variables and tell whether each is categorical or quantitative. Gender and Breed; both categorical.

2. Which of the W’s are unknown for these data?

Yorkshire Terrier Dachshund Golden Retriever Labrador Dalmatian Other breeds Total

Female 73 49 58 37 45 86 348

Male 59 47 33 41 28 67 275

Total 132 96 91 78 73 153 623

We do not know how or when the people were surveyed, or where the local dog show was located.

3. Find each percent. a. What percent of the responses were from males who favor Labradors?

6.6%

b. What percent of the male responses favor Labradors?

14.9%

c. What percent of the people who choose Labradors were males?

52.6%

4. What is the marginal distribution of breeds? There were 132 Yorkshire terrier responses, 96 Dachshund responses, 91 Golden Retriever responses, 78 Labrador responses, 73 Dalmatian responses, and 153 Other responses.

5. Write a sentence or two about the conditional relative frequency distribution of the breeds among female respondents. Among females, 21.0% chose Yorkshire Terriers, 14.1% Dachshunds, 16.7% Golden Retrievers, 10.6% Labs, and 12.9% Dalmatians. The remaining 24.7% of females preferred other breeds.

6. Do you think the breed selection is independent of gender? Give statistical evidence to support your conclusion. The breed selection does not appear to be independent of gender. Overall, 56% of the respondents were females, but females were over-represented among those who favored Golden Retrievers (64%) and Dalmatians (62%), yet a much lower percentage (47%) among those who chose Labradors.

.


3-6

Part I Exploring and Understanding Data

Statistics Quiz C – Chapter 3

Name __________________________

In order to plan transportation and parking needs at a private high school, administrators asked students how they get to school. Some rode a school bus, some rode in with parents or friends, and others used “personal” transportation – bikes, skateboards, or just walked. The table summarizes the responses from boys and girls.

Male

Female

Total

Bus

30

34

64

Ride

37

45

82

Personal

19

23

42

Total

86

102

188

1. Identify the variables and tell whether each is categorical or quantitative. 2. Which of the W’s are unknown for these data?

3. Find each percent. a. What percent of the students are girls who ride the bus?

_________

b. What percent of the girls ride the bus?

_________

c. What percent of the bus riders are girls?

_________

4. What is the marginal distribution of gender?

5. Do you think mode of transportation is independent of gender? Give statistical evidence to support your conclusion.

6. After the survey is conducted, a student suggests that the distance a student lives from school is a more important variable to consider. Describe how this third variable might affect the analyses.

.


Chapter 3 Relationships Between Categorical Variables—Contingency Tables

3-7

Statistics Quiz C – Chapter 3 – Key

In order to plan transportation and parking needs at a private high school, administrators asked students how they get to school. Some rode a school bus, some rode in with parents or friends, and others used “personal” transportation – bikes, skateboards, or just walked. The table summarizes the responses from boys and girls.

Male

Female

Total

Bus

30

34

64

Ride

37

45

82

Personal

19

23

42

Total

86

102

188

1. Identify the variables and tell whether each is categorical or quantitative. Gender and mode of transportation, both categorical.

2. Which of the W’s are unknown for these data? We don’t know how or when the students were surveyed, nor where the school is.

3. Find each percent. a. What percent of the students are girls who ride the bus?

18.1%

b. What percent of the girls ride the bus?

33.3%

c. What percent of the bus riders are girls?

53.1%

4. What is the marginal distribution of gender? There are 86 males and 102 females.

5. Do you think mode of transportation is independent of gender? Give statistical evidence to support your conclusion. The way students get to school does seem to be independent of gender. Overall, 34% of students ride the bus, compared to 35% of the boys and 33% of the girls. 44% of all students caught rides with someone and 22% used personal transportation, almost the same as the percentages for boys (43% and 22%) or girls (44% and 23%) separately. These data provide little indication of a difference in mode of transportation between boys and girls at this school.

6. After the survey is conducted, a student suggests that the distance a student lives from school is a more important variable to consider. Describe how this third variable might affect the analysis. While gender is independent of mode of transportation, it might be that distance from school is a more important factor. This could further affect the Why? of the analysis. Is the student body changing over time? More students enrolling from a closer or further distance? A mosaic plot could be added to examine all three variables at once.

.


3-8

Part I Exploring and Understanding Data

Statistics – Investigative Task

Chapter 3

Race and the Death Penalty In 1976 the Supreme Court ruled that the death penalty does not violate the U.S. Constitution’s ban on “cruel and unusual punishments.” Since then many states have passed capital punishment statutes, and over 500 convicted murderers have been executed nationwide. Capital punishment may be constitutional, but there continues to be a debate about whether or not it is fair. One of the major issues in this debate involves race – the race of both the defendant and the murder victim. The central question: is justice blind? In 1998 the Death Penalty Information Center published The Death Penalty in Black and White, a study examining the sentences following 667 murder convictions in Philadelphia courts between 1983 and 1993. This 3-way table shows how many death sentences were given among all the murder convictions. DEATH SENTENCES

Black Victim

White Victim

Total

Black Defendant 76 of 422

21 of 99

97 of 521

White Defendant

17 of 121

18 of 146

1 of 25

Total 77 of 447 38 of 220 115 of 667

Is our system of justice colorblind in the administration of the death penalty? Based upon the above information, write a newspaper article discussing the association between race and death sentences in the United States. (Don’t forget: the best analyses of data usually combine visual, numerical, and verbal descriptions.) Optional: As there are three variables on this table, a mosaic plot is a visual display to consider.

.


Chapter 3 Relationships Between Categorical Variables—Contingency Tables

Statistics Task

Think

Show

Tell

Chapter 3

Components 1. Identifies useful marginal and conditional distributions (or %’s) to make effective comparisons o Numerical – no major errors 2. Visual – includes comparative pie charts or (segmented? mosaic?) bar graphs well-labeled & fairly accurate 3. Verbal - Written article … o is clear and concise o identifies the W’s o is in the proper context o uses vocabulary correctly o avoids speculation 4. States a conclusion about the association between race and the death penalty, explaining at least two examples of statistical support

Comments

Components are scored as Essentially correct, Partially correct, or Incorrect 1: Use of marginal and conditional distributions E - Demonstrates understanding of marginal vs conditional for comparison P - Calculates useful %’s but may not understand why, or has major arithmetic errors I – Calculations are not %’s, or not useful for comparisons 2: Graphical display E – Uses comparative pie/bar graph, well-labeled and fairly accurate P – Graph is comparative but poorly constructed or explained I – Graph is not comparative or is missing 3: News article E – Article has all 5 listed properties P – Article has 3 or 4 of the listed properties I – Article has fewer than 3 of the properties 4: Conclusion E – Correct general conclusion is supported by 2 appropriate comparisons P – Conclusion is not clearly stated, or only one supporting comparison is given I – Conclusion is incorrect, unsupported by statistics, or missing. Scoring  E’s count 1 point, P’s are 1/2  Grade: A = 4, B = 3, etc., with +/- based on rounding (ex: 3.5 rounded to 3 is a B+)

Name _______________________________

.

Grade ____

3-9


3-10

Part I Exploring and Understanding Data

NOTE: We present a model solution with some trepidation. This is not a scoring key, just an example. Many other approaches could fully satisfy the requirements outlined in the scoring rubric. That (not this) is the standard by which student responses should be evaluated. Model Solution – Investigative Task – Race and the Death Penalty In the 10 years between 1983 and 1993, the city of Philadelphia saw 667 defendants convicted of murder. Of these, 17% were sentenced to death. Was the death penalty administered without regard to the race of the defendant or the victim? Is justice colorblind? Overall, blacks convicted of murder were sentenced to the death penalty in 18.6% of cases. The death penalty was the sentence for only 12.3% of whites convicted of murder. There may be evidence that black defendants fared worse in regard to death penalty sentences. At first, there doesn’t appear to be an association between the race of the victim and the rate of death sentences. Defendants convicted of murdering whites were sentenced to death about 17% of the time, as were defendants convicted of murdering blacks. But these percentages are misleading. It’s not until we look a bit deeper that we see the true picture. When race of the defendant and race of the victim are both taken into account, blacks are convicted at higher rates across the board than whites. When the victim was black, 18% of black defendants were sentenced to death, compared to only 4% for white defendants. Likewise, when the victim was white, blacks were sentenced to death 21% of the time, while only 14% of white defendants were sentenced to death. Note also that, no matter what the race of the defendant, death penalty rates for killing whites are higher than the rates for killing blacks, 21% to 18% for black defendants, and 14% to 4% for white defendants. These startling statistics present evidence that justice is not blind to color, at least not in the city of Philadelphia. Optional mosaic plot:

.


Chapter 3 Relationships Between Categorical Variables—Contingency Tables

Statistics

3-11

Chapter 3

Smoking and Education 200 adults shopping at a supermarket were asked about the highest level of education they had completed and whether or not they smoke cigarettes. Results are summarized in the table. 1. Discuss the W’s.

Smoker

Non-smoker

Total

High school

32

61

93

2 yr college

5

17

22

4+ yr college

13

72

85

Total

50

150

200

2. Identify the variables.

3. a. What percent of the shoppers were smokers with only high school educations? ______ b. What percent of the shoppers with only high school educations were smokers? ______ c. What percent of the smokers had only high school educations?

______

4. Create a segmented bar graph comparing education level among smokers and non-smokers. Label your graph clearly

5. Do these data suggest there is an association between smoking and education level? Give statistical evidence to support your conclusion.

6. Follow-up question: Does this indicate that students who start smoking while in high school tend to give up the habit if they complete college? Explain.

.


3-12

Part I Exploring and Understanding Data

Statistics – Smoking and Education Key 1. Who: 200 adults What: education level and smoking habits When: not specified Where: shopping mall How: not specified. Was this a random sample, or were some people simply asked? Why: to examine possible links between smoking and education level 2. Categorical variables: Education level, and whether or not the person was a smoker. 3. a. 16.0% b. 34.4% c. 64.0% 4. The segmented bar graph comparing education level among smokers and nonsmokers is at the right. 5. These data provide evidence of an association between smoking and education level. 64% of smokers had only a high school diploma, while only 40.7% of non-smokers had only high school diplomas. Only 26% of smokers had four or more years of college, compared to 48% of non-smokers.

Smoking and Education 100% 90% 80%

4+-year college

70%

2-yr

4+-year college

60% 50%

2-yr

40% 30% 20%

High School

10% 0%

1

Smoker

High School 2

Nonsmoker

6. These data do not indicate that students who start smoking in high school tend to give up the habit if they complete college. These data were gathered at one time, about two different groups, smokers and non-smokers. We have no idea if smoking behavior changes over time.

.


Chapter 3 Relationships Between Categorical Variables—Contingency Tables

Statistics Quiz D – Chapter 3

3-13

Name __________________________

1. A large national retailer of electronics conducted a survey to determine consumer preferences for various brands of digital cameras and the data are summarized in the table shown below. Female 73 49 58 35 45 37 86 383

Cannon Power Shot Nikon Cool Pix Sony Cyber Shot Panasonic Lumix Fujifilm Finepix Olympus S/V Other Brands Total

Male 59 47 33 30 28 41 67 305

Total 132 96 91 65 73 78 153 688

The percentage of consumers who are male and prefer Fujifilm is A. 44.3 % (305/688). B. 10.6% (73/688). C. 38.4% (28/73). D. 56.2% (41/73). E. 4.1% (28/688). 2. A large national retailer of electronics conducted a survey to determine consumer preferences for various brands of digital cameras and the data are summarized in the table shown below. Female 73 49 58 35 45 37 86 383

Cannon Power Shot Nikon Cool Pix Sony Cyber Shot Panasonic Lumix Fujifilm Finepix Olympus S/V Other Brands Total

Male 59 47 33 30 28 41 67 305

Of the consumers who are male, the percentage who prefer Sony is A. 44.3 % (305/688). B. 10.8% (33/305). C. 36.3% (33/91). D. 4.8% (33/688). E. 13.2% (91/688).

.

Total 132 96 91 65 73 78 153 688


3-14

Part I Exploring and Understanding Data

3. A large national retailer of electronics conducted a survey to determine consumer preferences for various brands of digital cameras and the data are summarized in the table shown below. Of the consumers who prefer Olympus, what percentage is female? Female 73 49 58 35 45 37 86 383

Cannon Power Shot Nikon Cool Pix Sony Cyber Shot Panasonic Lumix Fujifilm Finepix Olympus S/V Other Brands Total

Male 59 47 33 30 28 41 67 305

Total 132 96 91 65 73 78 153 688

A. 47.4 % (37/78). B. 6.0% (41/688). C. 52.6% (41/78). D. 11.7% (45/383). E. 11.3% (78/688). 4. Based on the side-by-side bar chart summarizing consumer preferences for various brands of digital cameras by gender, which of the following statement(s) are true?

A. It appears that camera preference and gender are at least somewhat related. B. If Other Brands are ignored, it appears that camera preference and gender are independent. C. If Other Brands are ignored, it is not obvious that camera preference and gender are independent. D. More males than females prefer Cannon. E. More females than males prefer Sony.

.


Chapter 3 Relationships Between Categorical Variables—Contingency Tables

3-15

5. The following is a bar chart summarizing consumer preferences for various brands of digital cameras.

This bar chart shows A. the marginal distribution of brands. B. the conditional distribution of brands. C. the contingency distribution of brands. D. the distribution for a quantitative variable. E. none of the above. 6. A company interested in the health of its employees started a health program including monitoring blood pressure. Based on age, employees were categorized according to ranges of blood pressure by age intervals. Data are shown in the table below. BP Low Normal High Total

Under 30 27 48 23 98

Age 30-49 38 90 59 187

Over 50 31 92 72 195

Total 96 230 154 480

The percentage of employees who are over age 50 and have high blood pressure is A. 46.8% (72/154). B. 32.1% (154/480). C. 31.6% (59/187). D. 36.9% (72/195). E. 15.0% (72/480).

.


3-16

Part I Exploring and Understanding Data

7. A company interested in the health of its employees started a health program including monitoring blood pressure. Based on age, employees were categorized according to ranges of blood pressure by age intervals. Data are shown in the table below. BP Low Normal High Total

Under 30 27 48 23 98

Age 30-49 38 90 59 187

Over 50 31 92 72 195

Total 96 230 154 480

Of all employees, the percentage who are over 50 and have high blood pressure is A. 46.8% (72/154). B. 15.0% (72/480). C. 31.6% (59/187). D. 36.9% (72/195). E. 47.2% (92/195). 8. A company interested in the health of its employees started a health program including monitoring blood pressure. Based on age, employees were categorized according to ranges of blood pressure by age intervals. Data are shown in the table below. BP Low Normal High Total

Under 30 27 48 23 98

Age 30-49 38 90 59 187

Over 50 31 92 72 195

Of all employees, the percentage of those under 50 years old is A. 17.1% (82/480). B. 40.6% (195/480). C. 13.5% (65/480). D. 36.9% (72/195). E. 49.4% (285/480).

.

Total 96 230 154 480


Chapter 3 Relationships Between Categorical Variables—Contingency Tables

3-17

9. A company interested in the health of its employees started a health program including monitoring blood pressure. Based on age, employees were categorized according to ranges of blood pressure by age intervals. Data are shown in the table below. BP Low Normal High Total

Under 30 27 48 23 98

Age 30-49 38 90 59 187

Over 50 31 92 72 195

The percentage of employees with normal or low blood pressure is A. 67.9% (326/480). B. 47.9% (230/480). C. 41.7% (96/230). D. 80.0% (384/480). E. 20.0% (96/480). 10. Here is a stacked bar chart for data collected about employee blood pressure.

This chart shows A. the distribution of a quantitative variable. B. the contingency distribution of blood pressure type. C. the conditional distribution of blood pressure type. D. the marginal distribution of blood pressure type. E. the joint distribution of blood pressure type.

.

Total 96 230 154 480


3-18

Part I Exploring and Understanding Data

Statistics Quiz D – Chapter 3 – Key 1. E 2. B 3. A 4. A,C,E 5. A 6. D 7. B 8. E 9. A 10. C

.


Chapter 3 Relationships Between Categorical Variables—Contingency Tables

Statistics Quiz E – Chapter 3

3-19

Name __________________________

1. In May, 2010, the Pew Research Center for the People & the Press carried out a national survey to gauge opinion on the Arizona Immigration Law. Responses (Favor, Oppose, Don’t Know) were examined according to groups defined by political party affiliation (Democrat, Republican, Independent). Which of the following would be appropriate for displaying these data? A. Contingency table. B. Pie charts. C. Segmented bar chart. D. Side by side bar chart. E. All of the above. 2. A regional survey was carried out to gauge public opinion on the controversial Arizona Immigration Law (results shown below). How many respondents are Republican and favor the law? Response Favor Oppose Don't Know

Democrat 50 85 5

Republican 93 45 7

A. 93 B. 45 C. 145 D. 7 E. 85

.

Independent 35 60 20


3-20

Part I Exploring and Understanding Data

3. A regional survey was carried out to gauge public opinion on the controversial Arizona Immigration Law. The results are displayed in the segmented bar chart below. Which of the following statements is true? Opinion of Arizona Immigration Law by Political Party Response Don't Know Oppose Favor

100

Data

80

60

40

20

0

Democrat

Republican

Independent

Percent within variables.

A. A greater percentage of Republicans oppose the law compared to Democrats. B. A greater percentage of Republicans oppose the law compared to Independents. C. Opinion about the law appears to be independent of political party affiliation. D. A greater percentage of Democrats oppose the law compared to Republicans. E. The segmented bar chart is not appropriate for these data. 4. A regional survey was carried out to gauge public opinion on the controversial Arizona Immigration Law. Based on the results displayed in the table below, what percent of respondents is Independent? Response Democrat Republican Independent Favor 50 93 35 Oppose 85 45 60 Don't Know 5 7 20 A. 35% B. 9% C. 29% D. 45% E. 25%

.


Chapter 3 Relationships Between Categorical Variables—Contingency Tables

3-21

5. A regional survey was carried out to gauge public opinion on the controversial Arizona Immigration Law (results shown below). What percent oppose the law? Response Favor Oppose Don't Know

Democrat 50 85 5

Republican 93 45 7

Independent 35 60 20

A. 48% B. 45% C. 32% D. 25% E. 61% 6. A regional survey was carried out to gauge public opinion on the controversial Arizona Immigration Law (results shown below). Of respondents who are Democrat, what percent oppose the law? Response Favor Oppose Don't Know

Democrat 50 85 5

Republican 93 45 7

Independent 35 60 20

A. 13% B. 35% C. 22% D. 45% E. 61% 7. A regional survey was carried out to gauge public opinion on the controversial Arizona Immigration Law (results shown below). Of respondents who oppose the law, what percent is Democrat? Response Favor Oppose Don't Know

Democrat 50 85 5

Republican 93 45 7

A. 13% B. 35% C. 22% D. 45% E. 6

.

Independent 35 60 20


3-22

Part I Exploring and Understanding Data

8. Accenture, a consulting firm, conducted an online survey of 500 US consumers from in 2013. Response Too busy to shop earlier More time to save for gifts Better discounts available Part of the holiday tradition None of the above

Male 115 50 65 15 120

Female 75 80 20 5 60

What percentage of men were felt that better discounts were available on “Black Friday”? A. 26.5% B. 65% C. 20% D. 17.8% E. 5.5% 9. Accenture, a consulting firm, conducted an online survey of 500 US consumers from in 2013. Response Too busy to shop earlier More time to save for gifts Better discounts available Part of the holiday tradition None of the above

Male 115 50 65 15 120

Female 75 80 20 5 60

What percentage of those who thought that better discounts were available on “Black Friday” were female? A. 81.3% B. 33.3% C. 11.1% D. 47.2% E. 23.5% 10. Accenture, a consulting firm, conducted an online survey of 500 US consumers in September 2013. Based on their response to the question “What is your motive for shopping late in the season?” which of the following would be appropriate method(s) for displaying the male only data shown in the table? Response Too busy to shop earlier More time to save for gifts Better discounts available Part of the holiday tradition None of the above A. Contingency table. B. Pie chart. C. Segmented bar chart. D. Side by side bar chart. E. All of the above. .

Male 115 50 65 15 120

Female 75 80 20 5 60


Chapter 3 Relationships Between Categorical Variables—Contingency Tables

Statistics Quiz E – Chapter 3 – Key 1. E 2. A 3. D 4. C 5. A 6. E 7. D 8. D 9. E 10. B,C

.

3-23



Chapter 4 Understanding and Comparing Distributions Solutions to Class Examples: 1. See Class Example 1. 2. Answers will vary depending on your data. 3. Answers will vary depending on your data. 4. Think Sally’s salary at McTofu is an outlier. The median and interquartile range are better measures of center and spread than the mean and standard deviation, since these are highly affected by the presence of outliers. Show Side-by-side boxplots of the salaries from the two restaurants are at the right. Tell The distribution of salaries at Mooseburgers is symmetric, with a typical salary of about $131.00, and fairly compact, with an interquartile range of $21. The distribution of salaries at McTofu is also symmetric, with the exception of one very high salary, Sally’s $360. A typical salary at McTofu is around $124.50, and the interquartile range of $20.50 points to a distribution of salaries that is fairly compact. In fact, with the exception of the outlier, the distribution of salaries at McTofu is similar to the distribution of salaries at Mooseburgers, but generally about $7 lower. Even though McTofu had a higher mean salary, $145.90, compared to Mooseburgers’ mean of $131.90, I would rather work at Mooseburgers. Salaries are typically higher there. 5. Answers will vary depending on the displays used. 6. See the lab activity on the next page Investigative Tasks Two similar versions of an Investigative Task are provided. Supplemental Resources A Data Desk lab activity is provided. If you want to go a bit more in-depth with the Mooseburgers-McTofu example in #4, a worksheet and key is provided after the Data Desk lab activity.

.


4-2

Part I Exploring and Understanding Data

Statistics Quiz A – Chapter 4

Name __________________________

1. The five-number summary for midterm scores (number of points; the maximum possible score was 50 points) from an intro stats class is: Min Q1 Median Q3 Max 16.5 32 39 43.5 48.5 a. Would you expect the mean midterm score of all students who took the midterm to be higher or lower than the median? Explain.

b. Based on the five-number summary, are any of the midterm scores outliers? Explain.

2. The side-by-side boxplots show the cumulative college GPAs for sophomores, juniors, and seniors taking an intro stats course in Autumn 2012. a. Which class (sophomore, junior, or senior) had the lowest cumulative college GPA? What is the approximate value of that GPA?

b. Which class has the highest median GPA, and what is that GPA?

c. Which class has the largest range for GPA, and what is it?

d. Which class has the most symmetric set of GPAs? The most skewed set of GPAs?

.


Chapter 4 Understanding and Comparing Distributions

4-3

3. On the right are two dotplots made for the heights of 200 randomly chosen students. The heights are separated by gender. Describe and compare the distributions.

4. A few of the male students are considerably taller than the rest. Describe the effect that these taller males will have on the mean and median of the male distribution.

.


4-4

Part I Exploring and Understanding Data

Statistics Quiz A – Chapter 4 – Key 1. The five-number summary for midterm scores (number of points; the maximum possible score was 50 points) from an intro stats class is: Min Q1 Median Q3 Max 16.5 32 39 43.5 48.5 a. Would you expect the mean midterm score of all students who took the midterm to be higher or lower than the median? Explain. The mean midterm scores of all students would probably be lower than the median. Using the 5-number summary, it appears that the data are skewed to the left.

b. Based on the five-number summary, are any of the midterm scores outliers? Explain. IQR = 43.5 – 32 =11.5

Q1 – 1.5IQR = 32 – 1.5(11.5) = 14.75 Q3 + 1.5IQR = 43.5 +1.5(11.5) = 60.75 Since both the maximum and minimum scores fall between these “fences,” there are not outliers in this data set.

2. The side-by-side boxplots show the cumulative college GPAs for sophomores, juniors, and seniors taking an intro stats course in Autumn 2012. a. Which class (sophomore, junior, or senior) had the lowest cumulative college GPA? What is the approximate value of that GPA? The junior class had the lowest cumulative GPA, around 1.6.

b. Which class has the highest median GPA, and what is that GPA? The sophomore class had the highest median cumulative GPA, around 3.2.

c. Which class has the largest range for GPA, and what is it? The junior class had the largest range for GPA, about 2.4.

d. Which class has the most symmetric set of GPAs? The most skewed set of GPAs? The senior class had the most symmetric set of GPAs. The sophomore class had the most skewed set of GPAs, skewed to the left.

.


Chapter 4 Understanding and Comparing Distributions

4-5

3. On the right are two dotplots made for the heights of 200 randomly chosen students. The heights are separated by gender. Describe and compare the distributions. Both distributions are unimodal and roughly symmetric. The male distribution appears to be centered around 165 cm, the female is lower at around 155 cm. The heights for the males appear to be more spread out than those for the females.

4. A few of the male students are considerably taller than the rest. Describe the effect that these taller males will have on the mean and median of the male distribution. The taller males will increase the mean of the distribution. This effect on the mean will pull the mean slightly higher than the median.

.


4-6

Part I Exploring and Understanding Data

Statistics Quiz B – Chapter 4

Name__________________________

1. The body temperature of students is taken each time a student goes to the nurse’s office. The five-number summary for the temperatures (in degrees Fahrenheit) of students on a particular day is: Min Q1 Median Q3 Max 96.6º 97.85º 98.25º 98.6º 101.8º a. Would you expect the mean temperature of all students who visited the nurse’s office to be higher or lower than the median? Explain.

b. After the data were picked up in the afternoon, three more students visited the nurse’s office with temperatures of 96.7º, 98.4º, and 99.2º. Were any of these students outliers? Explain.

Age (years)

2. The boxplots show the age of people involved in accidents according to their role in the accident. a. Which role involved the youngest person, and what is the age?

b. Which role had the lowest median age, and what is the age?

c. Which role had smallest range of ages, and what is it?

Cyclist

d. Which role had the largest IQR of ages, and what is it?

e. Which role generally involves the oldest people? Explain.

.

Driver Pass. ROLE

Pedest.


3. One thousand students from a local university were sampled to gather information such as gender, high school GPA, college GPA, and total SAT scores. The results were used to create histograms displaying high school grade point averages (GPA’s) for both males and females. Compare the grade distribution of males and females.

Count Gender

Chapter 4 Understanding and Comparing Distributions

4-7

60 M 30 0 60 F 30 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 HighSchoolGPA

.


4-8

Part I Exploring and Understanding Data

Statistics Quiz B – Chapter 4 – Key 1. The body temperature of students is taken each time a student goes to the nurse’s office. The five-number summary for the temperatures (in degrees Fahrenheit) of students on a particular day is: Min Q1 Median Q3 Max 96.6º 97.85º 98.25º 98.6º 101.8º a. Would you expect the mean temperature of all students who visited the nurse’s office to be higher or lower than the median? Explain. The five-number summary provides mixed evidence. The right whisker is longer than the left whisker, suggesting some skewness to the right. But the right box is shorter than the left box. The mean temperature of all students may be somewhat higher than the median, but the effect is not clear.

b. After the data were picked up in the afternoon, three more students visited the nurse’s office with temperatures of 96.7º, 98.4º, and 99.2º. Were any of these students outliers? Explain. IQR = 98.6º – 97.85º = 0.75º. Since 1.5(IQR) = 1.125º, the fences are 97.85º – 1.125º = 96.725º and 98.6º + 1.125º = 99.725º. The lowest temperature (96.7º) being added to the data set is smaller than the lower fence (96.725º) so it is an outlier on the low end. The highest temperature (99.2º) being added to the data set is not above the upper fence (99.725º) so it is not an outlier on the high end. All these calculations do not include the three additional observations in finding the descriptive statistics.

2. The boxplots show the age of people involved in accidents according to their role in the accident. a. Which role involved the youngest person, and what is the age? Passenger,

less than 1 year.

Passenger,

Age (years)

b. Which role had the lowest median age, and what is the age? 21 yrs

c. Which role had smallest range of ages, and what is it? Cyclist,

40 yrs Cyclist

d. Which role had the largest IQR of ages, and what is it? Pedestrian,

44 yrs

.

Driver Pass. ROLE

Pedest.


Chapter 4 Understanding and Comparing Distributions

4-9

e. Which role generally involves the oldest people? Explain.

3. One thousand students from a local university were sampled to gather information such as gender, high school GPA, college GPA, and total SAT scores. The results were used to create histograms displaying high school grade point averages (GPA’s) for both males and females. Compare the grade distribution of males and females.

Count Gender

Pedestrian. While the oldest person involved in an accident is not a pedestrian, the median age for pedestrians is almost 45 years, while the median age in the other groups are between 22 and 35 years old. The oldest 50% of the Pedestrian group, from 45 to 87 years, is generally older than the youngest 75% of two groups – Cyclist and Passenger, and only the Driver group has any of its middle 50% as old. The driver and passenger groups have a few people older than the pedestrian group.

60 M 30 0 60 F 30 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

HighSchoolGPA The distributions of high school GPA for both males and females are skewed to the left, and both distributions appear to be centered at a GPA of about 3.0. The distribution of male GPA appears slightly more spread out than the distribution of female GPA.

.


4-10

Part I Exploring and Understanding Data

Statistics Quiz C – Chapter 4 1.

Name __________________________

The five-number summary for the weights (in pounds) of fish caught in a bass tournament is: Min Q1 Median Q3 Max 2.3 2.8 3.0 3.3 4.5 a. Would you expect the mean weight of all fish caught to be higher or lower than the median? Explain.

b. You caught 3 bass weighing 2.3 pounds, 3.9 pounds, and 4.2 pounds. Were any of your fish outliers? Explain.

The boxplots show prices of used cars (in thousands of dollars) advertised for sale at three different car dealers. a. Which dealer offers the cheapest car offered, and at what price? ______________ _____________ b. Which dealer has the lowest median price, and how much is it? ______________ _____________ c. Which dealer has the smallest price range, and what is it? ______________ _____________ d. Which dealer’s prices have the smallest IQR, and what is it? ______________ _____________ e. Which dealer generally sells cars cheapest? Explain.

.

Price (thousands of $)

2.


Chapter 4 Understanding and Comparing Distributions

3.

4-11

At www.census.gov you can create a “population pyramid” for any country. These pyramids are back-to-back histograms. This pyramid shows Mexico’s 2000 female population and the census bureau’s projection for 2050. Write a few sentences summarizing the changes that are forecast.

.


4-12

Part I Exploring and Understanding Data

Statistics Quiz C – Chapter 4 – Key 1.

The five-number summary for the weights (in pounds) of fish caught in a bass tournament is: Min Q1 Median Q3 Max 2.3 2.8 3.0 3.3 4.5 a. Would you expect the mean weight of all fish caught to be higher or lower than the median? Explain. Probably higher. The data appear to be skewed to the right.

b. You caught 3 bass weighing 2.3 pounds, 3.9 pounds, and 4.2 pounds. Were any of your fish outliers? Explain. IQR = 3.3 – 2.8 = 0.5. Since 1.5(IQR) = 0.75, the fences are 2.8 – 0.75 = 2.05 and 3.3 + 0.75 = 4.05. The fish weighing 4.2 pounds is more than 1.5 IQRs outside the quartiles, so it could be considered an outlier.

The boxplots show prices of used cars (in thousands of dollars) advertised for sale at three different car dealers. a. Which dealer offers the cheapest car offered, and at what price? Car Z:$5000

b. Which dealer has the lowest median price, and how much is it? BuyIt:$10,000

c. Which dealer has the smallest price range, and what is it? Ace: $10,000

d. Which dealer’s prices have the smallest IQR, and what is it? CarZ:$3000

.

Price (thousands of $)

2.


Chapter 4 Understanding and Comparing Distributions

4-13

e. Which dealer generally sells cars cheapest? Explain. BuyIt; half of their cars are cheaper than any of the cars at Ace, and 25% of their cars are cheaper than all but one car at CarZ. The third quartile of their prices is well below the third quartile at CarZ, and below even the median price at Ace.

3.

At www.census.gov you can create a “population pyramid” for any country. These pyramids are back-to-back histograms. This pyramid shows Mexico’s 2000 female population and the census bureau’s projection for 2050. Write a few sentences summarizing the changes that are forecast.

The Census Bureau projects dramatic changes in the female population of Mexico over the next 50 years. The current distribution of ages is strongly skewed to the right with most of the women under 30 and far fewer 50 and above. By 2050 the population will become more uniform across age groups from 0 to 60, and we anticipate an unusually large number of women over 80.

.


4-14

Part I Exploring and Understanding Data

Statistics - Investigative Task A

Chapter 4

Auto Safety You work for an automobile insurance company. Your boss has assigned you the task of reviewing recent auto safety records and thinking about how that information may be relevant to your company. You found these data on the web. HIGHWAY LOSS DATA INSTITUTE (www.hwysafety.org) AUTOMOBILE INSURANCE INJURY LOSSES (2000-01) SMALL CARS Model Rating A4 Quattro 63 Audi A4 67 VW Beetle 69 Volvo S40 82 VW Golf 91 Impreza 98 Subaru G20 104 Saturn SL 104 Honda Civic 105 VW Jetta 107 Chevy Prizm 118 Ford Focus 120 Acura Integra 133 Nissan Sentra 134 Eclipse 136 Chevy Metro 138 Ford Escort 141 Civic Coupe 143 Protoge 144 Corolla 147 Dodge Neon 167 Accent 181 Elantra 190 Kia Sephia 221 Mirage 246 Esteem 247

MID-SIZE CARS Model Rating Saab 9-3 58 Buick Century 63 Toyota Avalon 66 Volvo S70 72 Lexus ES 300 75 Saturn LS 76 Acura 78 VW Passat 80 Infiniti I30 82 Maxima 94 Chevy Malibu 97 Olds Alero 97 Cirrus 98 Honda Accord 102 Toyota Camry 102 Grand Am 105 Mystique 108 Diamante 109 Millenia 114 Ford Contour 118 Dodge Stratus 122 Breeze 128 Cavalier 134 Mazda 626 135 Sunfire 140 Nissan Altima 144 Galant 164 Sonata 169

LARGE CARS Model Rating Buick LeSabre 39 Bonneville 53 Grand Marquis 53 Buick Regal 57 Olds Intrigue 58 Crown Victoria 63 Concorde 66 Chrysler LHS 66 Chrysler 300M 70 Chevy Impala 73 Chevy Lumina 73 Grand Prix 74 Mercury Sable 80 Monte Carlo 85 Ford Taurus 88 Dodge Intrepid 92

Insurance injury loss results are stated in relative terms. 100 represents average result for all cars. Lower numbers are better. For example, a rating of 122 means 22% worse than average.

Write a report to your boss, including: appropriate comparative plots and summary statistics; descriptions of the injury ratings for each group of cars; a comparison of injury ratings for the three sizes of cars; your recommendation to your boss about your company’s insurance policies.

.


Chapter 4 Understanding and Comparing Distributions

Statistics - Investigative Task A

Think

Show

Tell

4-15

Chapter 4

Components Demonstrates clear understanding of statistical concepts and techniques in comparing the three distributions o uses parallel boxplots o has consistent scale (any kind of graphs) o graphs are correct and clearly labeled o 5 # summaries and IQR’s are correct Compares centers: o numerically (probably medians) o compares groups to each other o compares each group to average (100) o discusses all three groups Compares variability: o notes differences in IQR’s o correctly interprets those differences o notes outliers o notes that groups overlap (ex: ??% of small cars safer than median of midsize) States conclusion: o in context (W’s, insurance co. memo) o interprets the ratings properly o makes a recommendation o avoids speculation (drivers, accidents, etc)

Comments

4 Components are scored as Essentially correct, Partially correct, or Incorrect 1: Graph Boxplots, on the same scale, clearly labeled with correct numerical summaries. E – All four requirements. P – Only 2 or 3. I – Fewer than 2 2: Compare the centers Correctly compares all three groups to each other and to the overall average (100), with proper use of numerical summaries (medians). E – All four requirements. P – Only 2 or 3. I – Fewer than 2 3: Compare the spreads Compares variability within groups (IQRs), noting consistent safety in large cars and greater variability elsewhere. Notes outliers. Discusses overlap between groups, probably using medians and quartiles. E – All four requirements. P – Only 2 or 3. I – Fewer than 2 4: General conclusion Clearly written, in the proper context, the general conclusion correctly interprets the ratings and recommends some course of action (perhaps lower premiums for large cars). E – All four requirements. P – Only 2 or 3. I – Fewer than 2 Scoring  E’s count 1 point, P’s are 1/2  Grade: A = 4, B = 3, etc., with +/- based on rounding (ex: 3.5 rounded to 3 is a B+)

Name ___________________________________ Grade ____ .


4-16

Part I Exploring and Understanding Data

NOTE: We present a model solution with some trepidation. This is not a scoring key, just an example. Many other approaches could fully satisfy the requirements outlined in the scoring rubric. That (not this) is the standard by which student responses should be evaluated. Chapter 4 – Investigative Task A – Model Solution – Auto Safety To: Boss

From: Employee

Re: Automobile insurance injury losses and rates.

The Highway Data Institute has collected data on automobile insurance injury losses for 2000-2001. These losses are reported in relative terms, with a rating of 100 being average for all cars. For instance, the Buick LeSabre, with a rating of 39, was 100 – 39 = 61% better than average. The Toyota Corolla, with a rating of 147, was 147 – 100 = 47% worse than average. These ratings, when viewed with the size of the car in mind, will have a great impact on the rates we charge to insure cars. The summary statistics for the insurance injury losses are organized in the table below. I have chosen to use 5-number summaries and interquartile range as a measure of variability, since one of the distributions of ratings for the small cars contained outliers. The presence of outliers makes means and standard deviations unreliable statistics. Since the ratings are measuring relative safety, parallel boxplots will be useful for comparing the ratings of the three different sizes of cars. Group Count Min Q1 Med Q3 Max IQR small 26 63 104 133.5 147 247 43 mid-size 28 58 79 102 125 169 46 large 16 39 57.5 68 77 92 19.5 The median of the distribution of large car ratings is 68, meaning the typical larger car has a rating that is 32% better than average. The median of the distribution of mid-size car ratings is 102, which means that a typical mid-size car has a rating that is about average for all cars. The median of the distribution of ratings of small cars is 133.5, which means that a typical small car is expected to be 33.5% worse than average with regards to injury losses. When comparing the cars by median rating, large cars are safest, followed by mid-size cars. Small cars are the least safe. In addition to being safer in general than mid-size and small cars, large cars are also consistently safe. The interquartile range, which measures the spread of the middle 50% of ratings, is only 19.5. Small and mid-size cars have much more variability in their ratings, with IQRs of 43 and 46, respectively. It should also be noted that several small cars, the Kia Sephia, the Mitsubishi Mirage, and the Suzuki Esteem, had extremely high injury losses, the highest of which was 147% worse than average. Using the parallel boxplots to compare the quartiles, we can see that every large car is safer than the median mid-size car, and every large car is safer than more than 75% the small cars. Also, the safest 50% of mid-size cars are rated safer than about 75% of small cars. Our company can expect to pay more in claims for cars with higher insurance injury losses. Owners of small cars should pay the highest insurance premiums, with premiums for mid-size car owners slightly lower. Large car owners should have the lowest premiums, since their cars generally have much lower insurance injury losses. Additionally some attention should be paid to individual models, since some models are very safe, while others are quite unsafe.

.


Chapter 4 Understanding and Comparing Distributions

Statistics – Data Desk Lab – Student Body

4-17

Chapter 4

Learn to use Data Desk by following these instructions. Check off each step as you go. 1. Start Data Desk and get the data:  Launch ActivStats  Open (double-click)the Resources tab on the right side of the Lesson Book.  Scroll down and choose ActivStats DataSets: to launch the List Dataset window.  Choose the file Students, then hit the Data Desk button.  Load the data by following the instructions.  Have a look at some of the data: open the Sex icon, and open HS GPA, too.  Scroll down the data to see what’s there. Understand what you have?  Close the data lists: click the stop signs at the upper right corner of the title bars. 2. Create visual and numerical displays of categorical data:  Select (single click) Sex. You’ll see a “Y” appear on the icon.  Plot a Bar Chart.  See the little triangle at the left of the bar chart’s title bar? Click there to see more options (the “hypermenu”). Ask for Frequencies of Sex. Understand what it shows?  You can drag or resize the displays for convenient viewing. Try it.  Make a pie chart of the distribution of Sex. 3. Create visual and numerical displays of quantitative data:  Select HS GPA, and then Plot a Histogram.  In the hypermenu, play with the Plot Scale. Try aligning the bars at 1.0 with a bar width of 0.25 and 4 bars per tick interval. Understand what that did? Try other plot scales to see what histogram best displays shape, center, and spread.  Create a boxplot of CUM GPA. (Use the Side by Side option.)  Let’s see the statistics. Under the Calculate menu, choose Summaries and Reports.  Nice list, but no quartiles. In the hypermenu, Select Summary Statistics and ask for the upper and lower 25th percentiles.  Select CUM GPA as Y, then (holding the shift key down) select Sex as X.  Create Boxplot Y by X. Cool, huh?  Calculate Summaries Reports by Groups. Use the hypermenu to add the quartiles.  In the Manipulation menu, Split into Variables by Groups. Make separate histograms of the cumulative GPA’s of the male and female students.  Make boxplots and find summary stats to compare SAT scores for males and females. 4. Do-it-yourself:  Create a New Datafile. Let’s examine the shoe sizes we collected the first day.  To enter our class Data, create New Blank Variables named Gender and ShoeSize. Enter the data by tabbing across each row, not down the columns as on the TI-83.  Create these displays: o a pie chart of the genders; o a histogram of the shoe sizes; o parallel boxplots of shoe sizes by gender; o summary statistics comparing shoe sizes of males and females.  Create a word processing document with your name on it. Cut and paste the 4 displays onto one page, resizing them if necessary. Print your report. .


4-18

Part I Exploring and Understanding Data

Data Desk Tips To get started: File – New Datafile. To enter data: Data – New Blank Variable.  You must create all your variables first.  Remember that you are creating a single data table with variables as columns and individuals as rows.  Different groups are identified by categorical variables. (Ex: to enter heights and weights of males and females requires three variables: a variable named gender that would have an entry of M or F in each row, and variables named height and weight that would contain all the heights and weights for both males and females.)  If you want to analyze proportions, enter success/failure as 1’s and 0’s.  Once the table structure is set up, enter the data by going across the rows using the tab key after each value is entered. (You cannot go down the columns like you do on the TI-83. For the height/weight example, you would enter M (tab) 70 (tab) 155 (tab) F (tab) 66 (tab) 128 (tab) and so on.) To access your data:  double click on the File folder, then the Data table icon. You need to see the little rectangular icons for the variables themselves.  you always must select the variables you want to work with: click for Y, shift-click for X. To make graphs: Plot any graph you want.  Note that selecting height as Y and gender as X allows you to create parallel boxplots with Boxplot Y by X.  Remember the hypermenus (little triangles in the title bars) have lots of goodies. You can change the plot scales, get a frequency distribution for categories in a pie chart, add a regression line to a scatterplot, etc. To find your statistics: Calc  Set up what you want with Calc Options – select summary statistics.  Create Calc Summaries of an entire variable or Calc Summaries by Groups for Y broken into categories of X.  Summarize associations between categorical variables - Calc Contingency table  Find the equation of the regression line in your scatterplot’s hypermenu. To change the data: Manip  Transform offers lots of reexpressions (log y, x/y, etc) or you can create your own using New derived variable.  Split into variables by groups allows you to separate, say, boys’ heights from girl’s heights so you can make comparative histograms or create a confidence interval for the difference. To do inference:  Select z-tests, t-tests, confidence intervals, etc. under Calc - Test  Find t-test for linear regression in your scatterplot’s hypermenu.  Find the chi-square test in your contingency table’s hypermenu. .


Chapter 4 Understanding and Comparing Distributions

Statistics – Random Matters

4-19

Chapter 4

It is easy to think that the mean or IQR we find is “the answer”. However, since we usually have just a portion of the entire population (hopefully a random sample!), the data we have is may not be telling us the whole story. Let’s examine how different random samples are… different! Go to http://ww2.amstat.org/CensusAtSchoolcfm. Here you will find a large set of data that has been collected from statistics students around the world. Follow these steps:  Read about the survey that the students filled out.  Click on the Random Sampler.  Download 100 randomly chosen students. You can choose if you want to restrict your random sample to a particular year, region, age group, etc. …  Find some summary statistics for your sample. And make some graphs.  For example: o Find the mean and the standard deviation of the heights of the students. o Make boxplots comparing the reaction times. Separate the boxplots by gender. o Find the 5-number summary for travel time to school. Check to see if the data set contains outliers. There is a lot of data here! There are many stories that could be told. But now comes the fun part! Compare! Now you need another random sample. Or two. Or three. You might work with a partner or a group. Or you might just do this yourself.  Acquire another sample of 100 students.  Make sure that you (or your partner(s)) follow the same process that you did the first time. Same year, same region, same age group, etc. … The only difference should be that this is a new random sample.  Now make the same graphs, find the same summary statistics. Time to draw some conclusions!  How are the samples different?  Are the summary statistics the same? How much did they change from one sample to the next?  Are the graphs the same shape? Did any change shape?  What about comparing groups by gender? Are the groups really different? Or do different random samples tell a different story?

Word of fair warning: These data were entered by students. You may find measurement errors and/or off-topic responses that need to be “cleaned”. That is, just delete them!

.


4-20

Part I Exploring and Understanding Data

Statistics – Random Matters

Chapter 4

Notes to instructors This is a powerful activity. It builds on the Random Matters elements that are throughout the text. Part of the power lies in helping students realize, early in the course, that a statistic can vary upon repeated sampling. If we build this idea early on then it won’t come as such a shock when we hit Sampling Distributions. Depending on your classroom and time constraints, this activity could be done in class, especially in a lab. It is written in an open-ended format, but you could be more specific. For example:  Each student is to find and compare the heights and reaction times separated by gender.  Each student is to include in their report 5 random samples, each of size 100. There are categorical variables in this data set also. Thus, if you wanted to review Chapter 3, you’d have an opportunity to do so. For example:  Create bar charts comparing the favorite superpower for each gender.  Describe the similarities and differences in choice for the genders.  After you have gathered 4 more random samples, compare the graphs. Are the differences and similarities the same? Hopefully students realize that random samples will vary! If they copy from each other, well, the results will be rather telling!

.


Chapter 4 Understanding and Comparing Distributions

Statistics Quiz D – Chapter 4

Name__________________________

1. The following boxplots show monthly sales revenue figures ($ thousands) for a discount office supply company with locations in three different regions of the U.S. (Northeast, Southeast, and West). Which of the following statements is true? Boxplot of Northeast, Southeast, West 200 175

Data

150

125 100

75 50 Northeast

Southeast

West

A. The Northeast has the lowest mean sales revenue. B. The Southeast has the lowest median sales revenue. C. The West has the lowest mean sales revenue. D. The West has the lowest median sales revenue. E. None of the above. 2. The following boxplots show monthly sales revenue figures ($ thousands) for a discount office supply company with locations in three different regions of the U.S. (Northeast, Southeast, and West). Which of the following statements is false? Boxplot of Northeast, Southeast, West 200 175

Data

150

125 100

75 50 Northeast

Southeast

A. The West has the most variable sales revenues. B. The West has the largest IQR. C. The Southeast has the smallest IQR. D. The Northeast has the most variable sales revenues. E. The Southeast has the least variable sales revenues. .

West

4-21


4-22

Part I Exploring and Understanding Data

3. The side-by-side boxplots show the cumulative college GPAs for sophomores, juniors, and seniors taking an intro stats course in Autumn 2003.

In which class (sophomore, junior, or senior) do you find the student with the lowest cumulative college GPA? A. Sophomore B. Junior C. Senior D. Both Junior and Senior E. Both Sophomore and Junior

.


Chapter 4 Understanding and Comparing Distributions

4-23

4. The side-by-side boxplots show the cumulative college GPAs for sophomores, juniors, and seniors taking an intro stats course in Autumn 2003.

What is the approximate value of that GPA of the lowest scoring student? A. 1.6 B. 1.8 C. 2.6 D. 2.8 E. 3.0 5. The side-by-side boxplots show the cumulative college GPAs for sophomores, juniors, and seniors taking an intro stats course in Autumn 2003.

Which class has the highest median GPA? A. Sophomore B. Junior C. Senior D. Both Junior and Senior E. Both Sophomore and Junior

.


4-24

Part I Exploring and Understanding Data

6. The side-by-side boxplots show the cumulative college GPAs for sophomores, juniors, and seniors taking an intro stats course in Autumn 2003.

What is approximately the GPA of the class with the highest median GPA? A. 2.6 B. 2.8 C. 3.2 D. 3.6 E. 3.8 7. The side-by-side boxplots show the cumulative college GPAs for sophomores, juniors, and seniors taking an intro stats course in Autumn 2003.

Which class has the largest range for GPA? A. Sophomore B. Junior C. Senior D. Both Junior and Senior E. Both Sophomore and Junior

.


Chapter 4 Understanding and Comparing Distributions

4-25

8. The side-by-side boxplots show the cumulative college GPAs for sophomores, juniors, and seniors taking an intro stats course in Autumn 2003.

What is the approximate value of the largest range for GPA? A. 0.8 B. 1.0 C. 1.4 D. 2.4 E. 3.0 9. The side-by-side boxplots show the cumulative college GPAs for sophomores, juniors, and seniors taking an intro stats course in Autumn 2003.

Which class has the most skewed set of GPAs? A. Sophomore B. Junior C. Senior D. Both Junior and Senior E. Both Sophomore and Junior

.


4-26

Part I Exploring and Understanding Data

10. One thousand students from a local university were sampled to gather information such as gender, high school GPA, college GPA, and total SAT scores. The results were used to create histograms displaying high school grade point averages (GPAs) for both males and females. Compare the grade distribution of males and females. Check all that apply.

A. Both distributions are skewed to the left. B. Both distributions appear to be centered at a GPA of about 3.0. C. The distributions are skewed in different directions. D. The distributions are differ strongly in center. E. The distribution of male GPA appears slightly more spread out than the distribution of female GPA

.


Chapter 4 Understanding and Comparing Distributions

Statistics Quiz D – Chapter 4 – Key 1.

B

2.

D

3.

B

4.

A

5.

A

6.

C

7.

B

8.

D

9.

A

10.

A, B, E

.

4-27


4-28

Part I Exploring and Understanding Data

Statistics Quiz E – Chapter 4

Name__________________________

1. Data were collected on the hourly wage ($) for two types of marketing managers: (1) advertising / promotion managers and (2) sales managers. The results were used to create the following histograms. Which of the following statements is true? Histogram of Advertising/Promotion Hourly, Sales Hourly 20 Advertising/Promotion Hourly

16

30

40

50

60

Sales Hourly

14

Frequency

12 10 8 6 4 2 0

20

30

40

50

60

A. The distribution of hourly wages for sales managers is unimodal and skewed right. B. The distribution of hourly wages for advertising/promotion managers is unimodal and skewed left. C. The distribution of hourly wages for sales managers is unimodal and skewed left. D. It appears that sales managers earn a lower hourly wage compared to advertising/promotion managers. E. Both C and D.

.


Chapter 4 Understanding and Comparing Distributions

4-29

2. The following boxplots show the closing share prices for a sample of technology companies on the first trading days in August 2007 and in August 2002. Which of the following statement is true? Boxplot of Aug. 2007, Aug. 2002 100

80

Data

60

40

20

0 Aug. 2007

Aug. 2002

A. The median closing share price is higher in August 2007 compared to August 2002. B. Closing prices are more variable in August 2007 compared to August 2002. C. The distribution of closing prices in August 2007 appears more symmetric than the distribution of closing prices in August 2002. D. Both A and B. E. All of the above. 3. At www.census.gov you can create a "population pyramid" for any country. These pyramids are back-to-back histograms. This pyramid shows Mexico's 2000 female population and the census bureau's projection for 2050. Check all that apply.

A. The distribution of ages in 2000 is strongly skewed to the left. B. The distribution of ages in 2000 is strongly skewed to the right. C. The distribution of projected ages in 2050 is strongly skewed to the left. D. The distribution of projected ages in 2050 is strongly skewed to the right. E. The distribution of projected ages in 2050 is approximately uniform.

.


4-30

Part I Exploring and Understanding Data

4. At www.census.gov you can create a "population pyramid" for any country. These pyramids are back-to-back histograms. This pyramid shows Mexico's 2000 female population and the census bureau's projection for 2050. Check all that apply.

A. The expected mean age in 2050 is higher than the mean age in 2000. B. The expected median age in 2050 is higher than the median age in 2000. C. The expected mean age in 2050 is lower than the mean age in 2000. D. The expected median age in 2050 is lower than the median age in 2000. E. The IQR of the distribution of projected ages in 2050 is larger than the IQR of the distribution of ages in 2000.

.


Chapter 4 Understanding and Comparing Distributions

4-31

5. How do sports salaries compare? Two sets of histograms below show the distributions of salaries for Major League Baseball and the National Football League. What set of histograms makes it easier to compare the distributions? And for what reasons? Check all that apply.

A. Both allow equal comparison. B. The first set of histograms, because it uses different scales on the axes. C. The second set of histograms, because it uses the same scales on the axes. D. The first set of histograms, because it is stacked horizontally. E. The second set of histograms, because it is stacked vertically.

.


4-32

Part I Exploring and Understanding Data

6. The boxplots show prices of used cars (in thousands of dollars) advertised for sale at three different car dealers. Which dealer offers the cheapest car offered, and at what price?

A. Car Z: $5000. B. BuyIt: $6000. C. Ace: $10000. D. BuyIt: $10000. E. Cannot say from boxplots only. 7. The boxplots show prices of used cars (in thousands of dollars) advertised for sale at three different car dealers. Which dealer has the lowest median price, and how much is it?

A. Ace: $15000. B. BuyIt: $8000. C. CarZ: $12000. D. BuyIt: $10000. E. Cannot say from boxplots only. .


Chapter 4 Understanding and Comparing Distributions

4-33

8. The boxplots show prices of used cars (in thousands of dollars) advertised for sale at three different car dealers. Which dealer has the smallest price range, and what is it?

A. CarZ: $8000. B. CarZ: $11000. C. Ace: $10000. D. BuyIt: $5000. E. Cannot say from boxplots only. 9. The boxplots show prices of used cars (in thousands of dollars) advertised for sale at three different car dealers. Which dealer's prices have the smallest IQR, and what is it?

A. CarZ: $3000. B. CarZ: $8000. C. Ace: $5000. D. BuyIt: $5000. E. Cannot say from boxplots only. .


4-34

Part I Exploring and Understanding Data

6. The boxplots show prices of used cars (in thousands of dollars) advertised for sale at three different car dealers. Which dealer generally sells cars cheapest?

A. CarZ. B. BuyIt. C. Ace. D. Both CarZ and Ace. E. Cannot say from boxplots only.

.


Chapter 4 Understanding and Comparing Distributions

Statistics Quiz E – Chapter 4 – Key 1. C 2. E 3. B, E 4. A, B, E 5. C, E 6. A 7. D 8. C 9. A 10. B

.

4-35



Chapter 5 The Standard Deviation as Ruler and the Normal Model Solutions to class examples: 1. Measures of center and position are affected by addition and multiplication. Measures of spread are only affected by multiplication. Statistic original +10 x2 x2+20 mean 30 40 60 80 median 32 42 64 84 IQR 8 8 16 16 SD 6 6 12 12 minimum 12 22 24 44 Q1 27 37 54 74 your score 35 45 70 90 2. Think We want to compare athletic performance across several different events, so we need some way to compare without units. We will compare z-scores from each event. Also, don’t forget that lower times are better in the 100-m dash! Show Competitor 100-m dash Shot put Long jump A 10.1 sec 66’ 26’ z-scores Z=-(10.1-10.0)/0.2=-0.5 Z=(66-60)/3=2 Z=(26-26)/6=0 (Total = 1.5) B 9.9 sec 60’ 27’ z-scores Z=-(9.9-10.0)/0.2=0.5 Z=(60-60)/3=0 Z=(27-26)/6=1/6 (Total = 4/6) C 10.3 sec 63’ 27’3” z-scores Z=-(10.3-10.0)/0.2=-1.5 Z=(63-60)/3=1 Z=(273/12-26)/6=5/24 (Total = -7/24) Tell A was the winner, with a total of 1.5 standard deviations above the mean performance. B came in second and C third, with 4/6 and 7/24 standard deviations above and below the mean, respectively.

.


5-2

Part I Exploring and Understanding Data

3.

4. If virtually all 16-year-old girls have heights between 58” and 74,” a reasonable guess for the standard deviation might be (74 – 58)/6  3.” Most high school boys would weigh between 90 and 250 pounds, so the standard deviation in weight might be about 25 pounds. If gasoline prices in a city varied from $2.00 to $2.25 per gallon, the standard deviation in price might be around $0.04 per gallon. It is reasonable to use the ±3 standard deviation estimate in these examples, since the distributions are plausibly Normal. One could not use the ±3 standard deviation estimate for the number of states visited, since this distribution is likely to be skewed to the right. The Normal model is not appropriate. 5. Although answers may vary, it is likely that the distribution of the number of candies and distribution of weights of candies are plausibly Normal.

.


Chapter 5 The Standard Deviation as a Ruler and the Normal Model

6a.

5-3

According to the Normal model, we expect 68% of cars to get between 18 and 30 mpg, 95% of cars to get between 12 and 36 mpg, and 99.7% of cars to get between 6 and 42 mpg.

6b.

According to the Normal model, about 6.68% of cars are expected to get less than 15 mpg.

6c.

According to the Normal model, about 58.89% of cars are expected to get between 20 and 30 mpg.

.


5-4

Part I Exploring and Understanding Data

6d.

According to the Normal model, only about 0.38% of cars are expected to get more than 40 mpg.

6e. According to the Normal model, the worst 20% of cars have fuel efficiency less than about 18.95

6f.

According to the Normal model, the third quartile of the distribution of fuel efficiency is about 28.05 mpg.

6g.

According to the Normal model, we can expect the most efficient 5% of cars to have fuel efficiency in excess of 33.87 mpg.

6h. It depends on your definition of unusual. If ±2 standard deviations is considered unusual, then, according to the Normal model, any gas mileage below 12 mpg or above 36 mpg would be considered unusual. If ±3 standard deviations is considered unusual, then, according to the Normal model, any gas mileage below 6 mpg or above 42 mpg would be considered unusual.

.


Chapter 5 The Standard Deviation as a Ruler and the Normal Model

7a.

5-5

According to the Normal model, about 25.25% of cars have gas mileage less than 20 miles per gallon.

7b. According to the Normal model, in order to meet the goal, the new mean mileage must be at least 27.69 mpg. 7c. According to the Normal model, in order to meet the goal, the new standard deviation must be about 4.68 mpg. 7d. A lower standard deviation would mean more consistent gas mileage. This would be advantageous, since the new goal of “only 10% below 20 mpg” would be met. However, it also means that only 10% of cars would have gas mileage higher than 32 mpg.

.


5-6

Part I Exploring and Understanding Data

Statistics Quiz A – Chapter 5

Name _________________________

1. Students taking an intro stats class reported the number of credit hours that they were taking that quarter. Summary statistics are shown in the table. a. Suppose that the college charges $73 per credit hour plus a flat student fee of $35 per quarter. For example, a student taking 12 credit hours would pay $35 + $73(12) = $911 for that quarter.

x 16.65 s 2.96 min 5 Q1 15 median 16 Q3 19 max 28

i. What is the mean fee paid?

ii. What is the standard deviation for the fees paid?

iii. What is the median fee paid?

iv. What is the IQR for the fees paid?

b. Twenty-eight credit hours seems like a lot. Would you consider 28 credit hours to be unusually high? Explain.

2. Scores on the Wechsler Adult Intelligence Scale – Revised (WAIS-R) follow a Normal model with mean 100 and standard deviation 15. Draw and clearly label this model.

.


Chapter 5 The Standard Deviation as a Ruler and the Normal Model

5-7

3. Adult female Dalmatians weigh an average of 50 pounds with a standard deviation of 3.3 pounds. Adult female Boxers weigh an average of 57.5 pounds with a standard deviation of 1.7 pounds. One statistics teacher owns an underweight Dalmatian and an underweight Boxer. The Dalmatian weighs 45 pounds, and the Boxer weighs 52 pounds. Which dog is more underweight? Explain.

4. Human body temperatures taken through the ear are typically 0.5 F higher than body temperatures taken orally. Making this adjustment and using the 1992 Journal of the American Medical Association article that reports average oral body temperature as 98.2 F, we will assume that a Normal model with an average of 98.7 F and a standard deviation of 0.7 F is appropriate for body temperatures taken through the ear. a. An ear temperature of 97 F may indicate hypothermia (low body temperature). What percent of people have ear temperatures that may indicate hypothermia?

b. Find the interquartile range for ear temperatures.

c. A new thermometer for the ear reports that it is more accurate than the ear thermometers currently on the market. If the average ear temperature reading remains the same and the company reports an IQR of 0.5 F, find the standard deviation for this new ear thermometer.

.


5-8

Part I Exploring and Understanding Data

Statistics Quiz A – Chapter 5 – Key 1. Students taking an intro stats class reported the number of credit hours that they were taking that quarter. Summary statistics are shown in the table. a. Suppose that the college charges $73 per credit hour plus a flat student fee of $35 per quarter. For example, a student taking 12 credit hours would pay $35 + $73(12) = $911 for that quarter. i. What is the mean fee paid?

x 16.65 s 2.96 min 5 Q1 15 median 16 Q3 19 max 28

ii. What is the standard deviation for the fees paid?

$35 + $73(16.65) = $1250.45

$73(2.96) = $216.08

iii. What is the median fee paid?

iv. What is the IQR for the fees paid?

$35 + $73(16) = $1203

IQR = 73(19-15) = $292

b. Twenty-eight credit hours seem like a lot. Would you consider 28 credit hours to be unusually high? Explain. IQR = 19 – 15 = 4 credit hours High outliers will lie above Q3 + 1.5IQR = 19 + 1.5(4) = 25 credit hours. Since 28 credit hours exceeds 25 credit hours, I would consider 28 credit hours to be unusually high.

2. Scores on the Wechsler Adult Intelligence Scale – Revised (WAIS-R) follow a Normal model with mean 100 and standard deviation 15. Draw and clearly label this model.

.


Chapter 5 The Standard Deviation as a Ruler and the Normal Model

5-9

3. Adult female Dalmatians weigh an average of 50 pounds with a standard deviation of 3.3 pounds. Adult female Boxers weigh an average of 57.5 pounds with a standard deviation of 1.7 pounds. One statistics teacher owns an underweight Dalmatian and an underweight Boxer. The Dalmatian weighs 45 pounds, and the Boxer weighs 52 pounds. Which dog is more underweight? Explain. Dalmatian: zD=(45-50)/3.3-1.52 Boxer: zB=(52-57.5)/1.7-3.24 The Dalmatian is 1.52 standard deviations underweight, while the Boxer is 3.24 standard deviations underweight. So, the Boxer is more underweight.

4. Human body temperatures taken through the ear are typically 0.5 F higher than body temperatures taken orally. Making this adjustment and using the 1992 Journal of the American Medical Association article that reports average oral body temperature as 98.2 F, we will assume that a Normal model with an average of 98.7 F and a standard deviation of 0.7 F is appropriate for body temperatures taken through the ear. a. An ear temperature of 97 F may indicate hypothermia (low body temperature). What percent of people have ear temperatures that may indicate hypothermia? Z = (97-98.7)/0.7  -2.43, so P(z<-2.43)  0.0075

About 0.75% of people have ear temperatures that may indicate hypothermia.

b. Find the interquartile range for ear temperatures. The z-scores associated with the IQR are z = -0.6745 and z = 0.6745. The temperatures corresponding to these z-scores are 98.7-0.6745*0.798.23 and 98.7+0.6745*0.799.17. The interquartile range is IQR = 99.17 F – 98.23 F = 0.94 F.

c. A new thermometer for the ear reports that it is more accurate than the ear thermometers currently on the market. If the average ear temperature reading remains the same and the company reports an IQR of 0.5 F, find the standard deviation for this new ear thermometer. The new IQR is 0.5 F, while the old IQR was 0.94 F. Thus, the IQR must become a factor 0.5/0.94  0.532 smaller. Since IQR  1.35s, this implies that s should become a factor 0.532 smaller, or s  0.532*0.7  0.372. Our new standard deviation is 0.37 F.

.


5-10

Part I Exploring and Understanding Data

Statistics Quiz B – Chapter 5

Name __________________________

1. During a budget meeting, local school board members decided to review class size information to determine if budgets were correct. Summary statistics are shown in the table. a. Notice that the third quartile and maximum class sizes are the same. Explain how this can be.

x 33.39 students s 5.66 students min 17 Q1 29 median 33 Q3 40 max 40

b. The school district declares that classes with enrollments fewer than 20 students are “too small.” Would you consider a class of 20 students to be unusually small? Explain.

c. The school district sets the office supply budgets of their high schools on the enrollment of students. The district budgets each class $12 plus $0.75 per student, so a class with one student receives $12.75 and the classes with 40 students receive 12 + 0.75(40) = $42. What is the median class budget for office supplies? And the IQR?

d. What are the mean and standard deviation of the class office supply budgets?

2. The Postmaster of a city’s Post Office believes that a Normal model is useful in projecting the number of letters which will be mailed during the day. They use a mean of 20,000 letters and a standard deviation of 250 letters. Draw and clearly label this model.

.


Chapter 5 The Standard Deviation as a Ruler and the Normal Model

5-11

3. Light bulbs are measured in lumens (light output), watts (energy used), and hours (life). A standard white light bulb has a mean life of 675 hours and a standard deviation of 50 hours. A soft white light bulb has a mean life of 700 hours and a standard deviation of 35 hours. At a local science competition, both light bulbs lasted 750 hours. Which light bulb’s life span was better? Explain.

4. At a large business, employees must report to work at 7:30 A.M. The arrival times of employees can be described by a Normal model with mean of 7:22 A.M. and a standard deviation of 4 minutes. a. What percent of employees are late on a typical work day? (SHOW WORK)

b. A psychological study determined that the typical worker needs five minutes to adjust to their surroundings before beginning their duties. What percent of this business’ employees arrive early enough to make this adjustment?

c. Because late employees are a distraction and cost companies money, all employees need to be on time to work. If the mean arrival time of employees does not change, what standard deviation would the arrival times need to ensure virtually all employees are on time to work?

d. Explain what achieving a smaller standard deviation means in the context of this problem.

.


5-12

Part I Exploring and Understanding Data

Statistics Quiz B – Chapter 5 – Key 1. During a budget meeting, local school board members decided to review class size information to determine if budgets were correct. Summary statistics are shown in the table. a. Notice that the third quartile and maximum class sizes are the same. Explain how this can be. The top 25 percent of all classes have 40 students enrolled.

x 33.39 students s 5.66 students min 17 Q1 29 median 33 Q3 40 max 40

b. The school district declares that classes with enrollments fewer than 20 students are “too small.” Would you consider a class of 20 students to be unusually small? Explain.

When using mean and standard deviation, classes with 20 students enrolled seem unusually small. Twenty is well below the first quartile of 29 students, and only slightly above the minimum size (17). With z = -2.366, this size class is over 2 standard deviation units below the mean. However, the distribution is far from bell shaped, since it is truncated at 1.14 SD above the mean. The use of mean and SD is questionable, so better use median and IQR. If we use the ‘outlier’ rule to find unusual class sizes, IQR=11, 1.5*IQR=16.5, Q1-1.5*IQR=12.5, thus class sizes of 20 are not exceptional by this standard.

c. The school district sets the office supply budgets of their high schools on the enrollment of students. The district budgets each class $12 plus $0.75 per student, so a class with one student receives $12.75 and the classes with 40 students receive 12 + 0.75(40) = $42. What is the median class budget for office supplies? And the IQR? Median budget = $12 + $0.75(33) = $36.75 Q1 budget = $12 + $0.75(29) = $33.75 Q3 budget = $12 + $0.75(40) = $42.00 IQR = $42.00 - $33.75 = $8.25

$36.75 $8.25

d. What are the mean and standard deviation of the class office supply budgets? Mean budget = $12 + $0.75(33.39) = $37.04

$37.04

Standard deviation = $0.75(5.66) = $4.25

$4.25

2. The Postmaster of a city’s Post Office believes that a Normal model is useful in projecting the number of letters which will be mailed during the day. They use a mean of 20,000 letters and a standard deviation of 250 letters. Draw and clearly label this model.

.


Chapter 5 The Standard Deviation as a Ruler and the Normal Model

5-13

3. Light bulbs are measured in lumens (light output), watts (energy used), and hours (life). A standard white light bulb has a mean life of 675 hours and a standard deviation of 50 hours. A soft white light bulb has a mean life of 700 hours and a standard deviation of 35 hours. In a test at a local science competition, both light bulbs lasted 750 hours. Which light bulb’s life span was better? Explain. Standard light bulb: z = (750 – 675)/50 = 1.5 Soft light bulb: z 

750  700  1.4286 35

The standard light bulb lasted more than 1.5 standard deviations above the mean life, compared to the soft light bulb at 1.4286 standard deviations above its mean. The standard light bulb’s performance was slightly better.

4. At a large business, employees must report to work at 7:30 A.M. The arrival times of employees can be described by a Normal model with mean of 7:22 A.M. and a standard deviation of 4 minutes. a. What percent of employees are late on a typical work day? Employees are late if they arrive after 7:30 AM. P(time>30) = P(z>2)  0.02275 According to the Normal model, about 2.28% of employees are expected to arrive after 7:30 AM.

b. A psychological study determined that the typical worker needs five minutes to adjust to their surroundings before beginning their duties. What percent of this business’ employees arrive early enough to make this adjustment? P(time<25) = P(z<0.75)  0.07734 According to the Normal model, about 77.34% of employees arrive at work before 7:25 AM.

c. Because late employees are a distraction and cost companies money, all employees need to be on time to work. If the mean arrival time of employees does not change, what standard deviation would the arrival times need to ensure virtually all employees are on time to work? Virtually all times lie within 3 standard deviations of the mean. (Accept other reasonable z-scores greater than 3). If z*s=8 min with z=3, then 3s=8 min, so s=2.667 minutes.

d. Explain what achieving a smaller standard deviation means in the context of this problem. A smaller standard deviation would mean greater consistency in arrival times.

.


5-14

Part I Exploring and Understanding Data

Statistics Quiz C – Chapter 5

Name __________________________

1. A vendor offers various t-shirts for sale on an Internet commerce site. Some of his t-shirts sell quite well, others have yet to find an interested customer. The summary statistics are listed on right, based on the number of sales per month for each of 80 different t-shirt designs.

x Median Min Q1 Q3 Max sd

23 15 0 3 21 82 7.84

a. What is the shape of this distribution? Explain how you know.

b. If the vendor spends $2 per t-shirt in shipping costs, what is his average monthly costs per t-shirt sold? What is the standard deviation of this cost? Show work. __________ __________ c. As time passes, the vendor has simplified his profit calculation. He estimates that he has monthly flat costs of $35, with a gain of $3.50 per t-shirt sold. What is the median of his profit? The IQR? __________ __________ d. What are the mean and standard deviation of his profit? __________ __________ e. What is more appropriate in this case: using mean and standard deviation, or using median and IQR? __________ __________ 2.

Owners of a minor league baseball team believe that a Normal model is useful in projecting the number of fans who will attend home games. They use a mean of 8500 fans and a standard deviation of 1500 fans. Draw and clearly label this model.

.


Chapter 5 The Standard Deviation as a Ruler and the Normal Model

5-15

3. Although most of us buy milk by the quart or gallon, farmers measure daily production in pounds. Guernsey cows average 39 pounds of milk a day with a standard deviation of 8 pounds. For Jerseys the mean daily production is 43 pounds with a standard deviation of 5 pounds. When being shown at a state fair a champion Guernsey and a champion Jersey each gave 54 pounds of milk. Which cow’s milk production was more remarkable? Explain.

4. A company’s manufacturing process uses 500 gallons of water at a time. A “scrubbing” machine then removes most of a chemical pollutant before pumping the water into a nearby lake. Legally the treated water should contain no more than 80 parts per million of the chemical, but the machine isn’t perfect and it is costly to operate. Since there’s a fine if the discharged water exceeds the legal maximum, the company sets the machine to attain an average of 75 ppm for the batches of water treated. They believe the machine’s output can be described by a Normal model with standard deviation 4.2 ppm. (SHOW WORK) a.

What percent of the batches of water discharged exceed the 80ppm standard?

__________ b.

The company’s lawyers insist that they not have more than 2% of the water over the limit. To what mean value should the company set the scrubbing machine? Assume the standard deviation does not change.

__________ c.

Because achieving a mean that low would raise the costs too much, they decide to leave the mean set at 75 ppm and try to reduce the standard deviation to achieve the “only 2% over” goal. Find the new standard deviation needed.

__________ d.

Explain what achieving a smaller standard deviation means in this context.

.


5-16

Part I Exploring and Understanding Data

Statistics Quiz C – Chapter 5 – Key 1. A vendor offers various t-shirts for sale on an Internet commerce site. Some of his t-shirts sell quite well, others have yet to find an interested customer. The summary statistics are listed on right, based on the number of sales per month for each of 80 different t-shirt designs.

x Median Min Q1 Q3 Max sd

23 15 0 3 21 82 7.84

a. What is the shape of this distribution? Explain how you know. The distribution is clearly skewed to the right. The mean is much larger than the median. The max is to be an outlier and is likely a part of a long tail of the most popular t-shirts. The min and Q1 are very close together, while the spread increases between the median, Q3, and the max.

b. If the vendor spends $2 per t-shirt in shipping costs, what is his average monthly costs per t-shirt sold? What is the standard deviation of this cost? Show work. mean = 23*$2 = $46; sd = 7.84*$2 = $15.68

c. As time passes, the vendor has simplified his profit calculation. He estimates that he has monthly flat costs of $35, with a gain of $3.50 per t-shirt sold. What is the median of his profit? The IQR? median = 3.5*15 - 35 = $17.50; IQR = 3.5*(21-3) = $63

d. What are the mean and standard deviation of his profit? mean = 3.5*23 – 35 = $45.50; sd = 3.5*7.84 = $27.44

e. What is more appropriate in this case: using mean and standard deviation, or using median and IQR? The distribution is strongly right skewed, thus mean and standard deviation are not appropriate.

2. Owners of a minor league baseball team believe that a Normal model is useful in projecting the number of fans who will attend home games. They use a mean of 8500 fans and a standard deviation of 1500 fans. Draw and clearly label this model.

.


Chapter 5 The Standard Deviation as a Ruler and the Normal Model

5-17

3. Although most of us buy milk by the quart or gallon, farmers measure daily production in pounds. Guernsey cows average 39 pounds of milk a day with a standard deviation of 8 pounds. For Jerseys the mean daily production is 43 pounds with a standard deviation of 5 pounds. When being shown at a state fair a champion Guernsey and a champion Jersey each gave 54 pounds of milk. Which cow’s milk production was more remarkable? Explain. The Jersey’s milk production was comparatively higher. That cow gave slightly more than 2 standard deviations above the average amount of milk (z = 2.2), while the Guernsey gave less than 2 standard deviations more than the average for Guernseys (z = 1.875).

4. A company’s manufacturing process uses 500 gallons of water at a time. A “scrubbing” machine then removes most of a chemical pollutant before pumping the water into a nearby lake. Legally the treated water should contain no more than 80 parts per million of the chemical, but the machine isn’t perfect and it is costly to operate. Since there’s a fine if the discharged water exceeds the legal maximum, the company sets the machine to attain an average of 75 ppm for the batches of water treated. They believe the machine’s output can be described by a Normal model with standard deviation 4.2 ppm. (SHOW WORK) a. What percent of the batches of water discharged exceed the 80ppm standard? According to the normal model, we expect about 11.7% of the batches to exceed the 80 ppm standard.

b. The company’s lawyers insist that they not have more than 2% of the water over the limit. To what mean value should the company set the scrubbing machine? Assume the standard deviation does not change. According to the Normal model, a mean of about 71.37ppm would need to be achieved.

c. Because achieving a mean that low would raise the costs too much, they decide to leave the mean set at 75 ppm and try to reduce the standard deviation to achieve the “only 2% over” goal. Find the new standard deviation needed. According to the Normal model, the new standard deviation would need to be at most 2.43ppm.

d. Explain what achieving a smaller standard deviation means in this context. The scrubber must be more consistent in its performance from batch to batch.

.


5-18

Part I Exploring and Understanding Data

Statistics - Investigative Task

Chapter 5

Normal Models A Normal model can be a useful tool for interpreting what data have to say - sometimes. Your task here is to check the usefulness of such a model for data you collect or create. There are three phases in completing this task: 1. Collect data. You need a data set with 30 – 50 values. Find something you are interested in. Use existing data, or create some yourself. Need an idea?  Put 10 pennies in a glass, put your hand over the top, shake well, and then dump them out on a table and count the number that came up heads.  Roll two dice and record the total.  Deal cards from a well-shuffled deck one at a time. Count the number of cards you have to turn over until you find an ace.  Use some data from another class – a science experiment, perhaps.  Look something up in an almanac. For example, there are lots of tables of data about states - crime rates, population density, median income, etc.  Use some sports statistics – number of wins for baseball teams, scores in a golf tournament, weights of players on a football team, etc.  Find something on the Internet – www.census.gov for example. 2. Describe the data. Write a brief but thorough description of your data. Start with the W’s, and remember to include visual, numerical, and verbal descriptions. 3. Check the Normal model. Use the mean and standard deviation of your data to create a Normal model. Compare this model to the distribution that you found and explain why you think the model is or is not useful.

.


Chapter 5 The Standard Deviation as a Ruler and the Normal Model

Statistics Task

Think

Show

Tell

Chapter 5

Components Demonstrates clear understanding of the difference between distributions of actual data and models Data, display, and statistics: o collects appropriate data o shows a well-constructed histogram o scale is appropriate for comparison o calculates summary statistics Normal model: o centered at mean o has correct cutoffs based on s o clearly shows the 68-95-99.7 Rule Describes the data: o the W’s o shape o center o spread o unusual features Evaluates usefulness of the model: o discusses shape o compares the actual distribution of the data to the 68%, 95%, or 99.7% o states a valid conclusion about the usefulness of the model

Comments

4 Components are scored as Essentially correct, Partially correct, or Incorrect 1: Data and analysis: useful data, histogram, comparative scale, and summary statistics. E – All four requirements. P – Only 2 or 3. I – One or none. 2: Normal model: correct center, correct cutoffs, describes 68-95-99.7% regions. E – All three requirements. P – Only 2. I – One or none. 3: Describe the data: W’s, shape, center, spread, unusual features. E – All five requirements. P – Only 3 or 4. I – Fewer than 3. 4: Evaluation: shape, distribution, conclusion E – All three requirements. P – Only 2.

I – One or none

Scoring  E’s count 1 point, P’s are 1/2  Grade: A = 4, B = 3, etc., with +/- based on rounding (ex: 3.5 rounded to 3 is a B+)

Name __________________________________

.

Grade ____

5-19


5-20

Part I Exploring and Understanding Data

Statistics Quiz D – Chapter 5

Name _________________________

1. The ASQ (American Society for Quality) regularly conducts a salary survey of its membership, primarily quality management professionals. Based on the most recently published mean and standard deviation, a quality control specialist calculated the z-score associated with his own salary and found it was -2.50. This tells him that his salary is A. 2 and a half times more than the average salary. B. 2 and a half times less than the average salary. C. is 2.5 standard deviations above the average salary. D. is 2.5 standard deviations below the average salary. E. much higher than the average salary. 2. Adult female Dalmatians weigh an average of 50 pounds with a standard deviation of 3.3 pounds. Adult female Boxers weigh an average of 57.5 pounds with a standard deviation of 1.7 pounds. One statistics teacher owns an underweight Dalmatian and an underweight Boxer. The Dalmatian weighs 45 pounds, and the Boxer weighs 52 pounds. Which dog is more underweight? A. The Boxer because it is 9.6% underweight. B. The Dalmatian because it is 9.6% underweight. C. The Boxer because it is 5.5 pounds underweight. D. The Dalmatian because it is 5 pounds underweight. E. The Boxer because it is 3.24 standard deviations underweight. 3. Based on data collected from its production processes, Crosstiles Inc. determines that the breaking strength of its most popular porcelain tile is normally distributed with a mean of 400 pounds per square inch and a standard deviation of 12.5 pounds per square inch. Based on the 68-95-99.7 Rule, about what percent of its popular porcelain tile will have breaking strengths between 375 and 425 pounds per square inch? A. 95% B. 68% C. 84% D. 32% E. 47.5% 4. Based on data collected from its production processes, Crosstiles Inc. determines that the breaking strength of its most popular porcelain tile is normally distributed with a mean of 400 pounds per square inch and a standard deviation of 12.5 pounds per square inch. Based on the 68-95-99.7 Rule, about what percent of its popular porcelain tile will have breaking strengths greater than 412.5 pounds per square inch? A. 95% B. 68% C. 16% D. 32% E. 47.5%

.


Chapter 5 The Standard Deviation as a Ruler and the Normal Model

5-21

5. At a local manufacturing plant, employees must complete new machine set ups within 30 minutes. New machine set-up times can be described by a normal model with a mean of 22 minutes and a standard deviation of four minutes. What percent of new machine set ups take more than 30 minutes? A. 97.72% B. 47.72% C. 2.28% D. 52.28% E. none of the above 6. At a local manufacturing plant, employees must complete new machine set ups within 30 minutes. New machine set-up times can be described by a normal model with a mean of 22 minutes and a standard deviation of four minutes. The typical worker needs five minutes to adjust to his or her surroundings before beginning duties. What percent of new machine set ups are completed within 25 minutes to allow for this? A. 77.3% B. 27.3% C. 22.7% D. 72.7% E. none of the above 7. Both SAT and ACT are well-known placement tests that most US colleges require from prospective students to be admitted in their programs. Scores in the SAT test are approximately normally distributed with a mean of 500 and a standard deviation of 100. Scores in the ACT test are approximately normally distributed with a mean of 18 and a standard deviation of 6. What would be the score in the SAT test to get the same z-score as the admission requirement of an ACT score of 27? A. 500 B. 550 C. 600 D. 650 E. 700 8. The distribution of the diameters of a particular variety of grapes is approximately normal with a standard deviation of 0.3 cm. How does the diameter of a grape at the 67th percentile compare with the mean diameter? A. 0.201 cm below the mean B. 0.132 cm below the mean C. 0.132 cm above the mean D. 0.201 cm above the mean E. 0.254 cm above the mean

.


5-22

Part I Exploring and Understanding Data

9. A soft drink dispenser can be adjusted to deliver any fixed number of centiliters. If the machine is operating with a standard deviation in delivery equal to 0.3 centiliter, what should be the mean setting so that a 12-centilitre cup will overflow less than 1% of the time? Assume a normal distribution for centiliters delivered. A. 11.23 B. 11.30 C. 11.70 D. 12.00 E. 12.70 10. Suppose that the weights of vans delivering goods to inner city shops are normally distributed. If 70% of the vans weigh more than 1200 kg, and 80% weigh more than 1000 kg, what are the mean and standard deviation for the weights of vans serving the inner city (round to tens)? A.   1510 kg;   620 kg B.   1530 kg;   630 kg C.   1550 kg;   640 kg D.   1530 kg;   630 kg E. There is not enough information given to determine the mean and standard deviation.

.


Chapter 5 The Standard Deviation as a Ruler and the Normal Model

Statistics Quiz D – Chapter 5 – Key 1. D 2. E 3. A 4. C 5. C 6. A 7. D 8. C 9. B 10. B

.

5-23


5-24

Part I Exploring and Understanding Data

Statistics Quiz E – Chapter 5

Name _________________________

1. Suppose that a Normal model describes the acidity (pH) of rainwater, and that water tested after last week’s storm had a z–score of 1.8. This means that the acidity of that rain … A. varied with a standard deviation of 1.8. B. had a pH 1.8 higher than average rainfall. C. had a pH 1.8 standard deviations higher than that of average rainwater. D. had a pH of 1.8. E. had a pH 1.8 times that of average rainwater. 2. Light bulbs are measured in lumens (light output), watts (energy used), and hours (life). A standard white light bulb has a mean life of 675 hours and a standard deviation of 50 hours. A soft white light bulb has a mean life of 700 hours and a standard deviation of 35 hours. At a local science competition, both light bulbs lasted 750 hours. Which light bulb's life span was better? A. The standard white light bulb because it is 75 hours beyond the mean. B. The soft white light bulb because it is 50 hours beyond the mean. C. The standard white light bulb because it is 11% beyond the mean. D. The soft white light bulb because it is 7% beyond the mean. E. The standard light bulb because it lasted 1.5 standard deviations above the mean life. 3. The time it takes to process phone orders in a small florist/gift shop is normally distributed with a mean of 6 minutes and a standard deviation of 1.24 minutes. What cutoff value would separate the 2.5% of orders that take the most time to process? A. 3.52 minutes B. 4.76 minutes C. 8.48 minutes D. 10.01 minutes E. 11.98 minutes 4. The time it takes to process phone orders in a small florist/gift shop is normally distributed with a mean of 6 minutes and a standard deviation of 1.24 minutes. What cutoff values would separate the 16% of orders that take the least time to process? A. 3.52 minutes B. 4.76 minutes C. 8.48 minutes D. 10.01 minutes E. 11.98 minutes

.


Chapter 5 The Standard Deviation as a Ruler and the Normal Model

5-25

5. A small flower shop takes orders by phone and then one of the staff florists is assigned to prepare the arrangement. The time it takes to process phone orders is normally distributed with a mean of 6 minutes and a standard deviation of 2.5 minutes. The time it takes for an arrangement to be completed is normally distributed with a mean of 35 minutes and a standard deviation of 8.6 minutes. What is the standard deviation for the total time to process a phone order and complete the floral arrangement at this flower shop (assuming times are independent)? A. 8.96 minutes B. 41 minutes C. 80.28 minutes2 D. 4.87 minutes E. 12.43 minutes 6. A small flower shop takes orders by phone and then one of the staff florists is assigned to prepare the arrangement. The time it takes to process phone orders is normally distributed with a mean of 6 minutes and a standard deviation of 2.5 minutes. The time it takes for an arrangement to be completed is normally distributed with a mean of 35 minutes and a standard deviation of 8.6 minutes. What is the probability that it will take more than 50 minutes to process a phone order and complete the floral arrangement at this flower shop? A. 0.8413 B. 0.3413 C. 0.2167 D. 0.1587 E. 0.6843 7. A company’s manufacturing process uses 500 gallons of water at a time. A “scrubbing” machine then removes most of a chemical pollutant before pumping the water into a nearby lake. To meet federal regulations the treated water must not contain more than 80 parts per million (ppm) of the chemical. Because a fine is charged if regulations are not met, the company sets the machine to attain an average of 75 ppm in the treated water. The machine’s output can be described by a normal model with standard deviation 4.2 ppm. What percent of the batches of water discharged exceed the 80 ppm standard? A. 11.7% B. 1.17% C. 88.3% D. 8.83% E. 3.89%

.


5-26

Part I Exploring and Understanding Data

8. A company’s manufacturing process uses 500 gallons of water at a time. A “scrubbing” machine then removes most of a chemical pollutant before pumping the water into a nearby lake. To meet federal regulations the treated water must not contain more than 80 parts per million (ppm.) of the chemical. The machine’s output can be described by a normal model with standard deviation 4.2 ppm. The company’s lawyers insist that not more than 2% of the treated water should be over the limit. To achieve this, to what mean should the company set the scrubbing machine? A. 80 ppm. B. 75 ppm. C. 71.374 ppm. D. 88.626 ppm. E. 69.459 ppm. 9. Suppose that a Normal model describes fuel economy (miles per gallon) for automobiles and that a Toyota Corolla has a standardized score (z-score) of +2.2. This means that Corollas… A. get 2.2 miles per gallon. B. have a standard deviation of 2.2 mpg. C. get 2.2 mpg more than the average car. D. get 2.2 times the gas mileage of the average car. E. achieve fuel economy that is 2.2 standard deviations better than the average car. 10. Based on the Normal model for yearly snowfall in cm in a certain town N(57, 8), how many cm’s of snow would represent the 80th percentile approximately? A. 64.5 cm. B. 63.7 cm. C. 63.3 cm. D. 62.7 cm. E. 61.7 cm.

.


Chapter 5 The Standard Deviation as a Ruler and the Normal Model

Statistics Quiz E – Chapter 5 – Key 1. C 2. E 3. C 4. B 5. A 6. D 7. A 8. C 9. E 10. B

.

5-27



Statistics Test A – Data Analysis –Part I

Name __________________________

___ 1. School administrators collect data on students attending the school. Which of the following variables is quantitative? A) class (freshman, soph., junior, senior) C) whether the student is in AP* classes E) none of these

B) grade point average D) whether the student has taken the SAT

___ 2. Which of the following variables would most likely follow a Normal model? A) family income C) weights of adult male elephants E) all of these

B) heights of singers in a co-ed choir D) scores on an easy test

___ 3. A professor has kept records on grades that students have earned in his class. If he wants to examine the percentage of students earning the grades A, B, C, D, and F during the most recent term, which kind of plot could he make? A) boxplot

B) stem and leaf plot

C) dotplot

D) pie chart

E) histogram

___ 4. The weight of an Eurasian water shrew has a mean of 0.55 oz., a standard deviation of 0.06 oz., and is approximately normal. If a researcher captures a shrew whose weight is 0.59 oz., what percentile is its weight? A) 67%

B) 4%

C) 25%

D) 75%

E) 33%

___ 5. Two sections of a class took the same quiz. Section A had 15 students who had a mean score of 80, and Section B had 20 students who had a mean score of 90. Overall, what was the approximate mean score for all of the students on the quiz? A) 84.3

B) 85.0

C) 85.7

D) none of these

E) It cannot be determined.

___ 6. Data is collected on the number of movies 18-34 year olds go see in a year. In the random sample of 250 people 23% respond that they have seen 10 or more movies. Which of the following statements are true? I. Because the sample is random, 23% is the right answer. II. If you took more samples of size 250, this percentage would be slightly different. III. The right answer is probably close to 23% but other random samples will produce slightly different results. A) none of these

B) I only

C) II only

D) III only

E) II and III

___ 7. Suppose that a Normal model described student scores in a history class. Parker has a standardized score (z-score) of +2.5. This means that Parker A) is 2.5 points above average for the class. B) is 2.5 standard deviations above average for the class. C) has a standard deviation of 2.5. D) has a score that is 2.5 times the average for the class. E) None of the above.

.


Part I-2

Part I Exploring and Understanding Data

___ 8. The advantage of making a stem-and-leaf display instead of a dotplot is that a stem-and-leaf display A) satisfies the area principle. B) shows the shape of the distribution better than a dotplot. C) preserves the individual data values. D) A stem-and-leaf display is for quantitative data, while a dotplot shows categorical data E) none of these

___ 9. The five-number summary of credit hours for 24 students in a statistics class is: Min Q1 Median Q3 Max 13.0 15.0 16.5 18.0 22.0 Which statement is true? A) There are no outliers in the data. B) There is at least one low outlier in the data. C) There is at least one high outlier in the data. D) There are both low and high outliers in the data. E) None of the above.

___ 10.Which of the following summaries are changed by adding a constant to each data value? I. the mean II. the median III. the standard deviation A) I only

B) III only

C) I and II

D) I and III

E) I, II, and III

11. Cats and dogs The table shows whether students in an introductory statistics class like dogs and/or cats. Like Dogs Yes No Total Yes 194 21 215 Like Cats No 100 20 120 Total 294 41 335

a. Make a graphical display that shows the relationships in this table.

b. What is the marginal distribution (in %) of “liking dogs”?

___________________

c. What is the conditional distribution (in %) of “liking dogs” for students who like cats?

___________________

d. What kind of display(s) would you use to examine the association between “liking dogs” and “liking cats”? (Just name a graph.)

___________________

e. Do “liking dogs” and “liking cats” appear to be independent? Give statistical evidence to support your conclusion. .


Review of Part I: Exploring and Understanding Data

Part I-3

12. House calls A local plumber makes house calls. She charges $30 to come out to the house and $40 per hour for her services. For example, a 4-hour service call costs $30 + $40(4) = $190. a. The table shows summary statistics for the past month. Fill in the table to find out the cost of the service calls. Statistic Hours of Service Call Cost of Service Call Mean 4.5 Median 3.5 SD 1.2 IQR 2.0 Minimum 0.5

b. This past month, the time the plumber spent on one service call corresponded to a z-score of – 1.50. What was the z-score for the cost of that service call? _________

13. Health insurance The World Almanac and Book of Facts 2004 reported the percent of people not covered by health insurance in the 50 states and Washington, D.C., for the year 2002. Computer output gives these summaries for the percent of people not covered by health insurance: Min Q1 Median Q3 Max Mean SD 7.9 10.8 13.4 16.7 25.8 13.9 3.6

a. Were any of the states outliers? Explain how you made your decision.

b. A histogram of the data is as follows:

Is it more appropriate to use the mean and standard deviation or the median and IQR to describe these data? Explain. .


Part I-4

Part I Exploring and Understanding Data

14. Veterinary costs Costs for standard veterinary services at a local animal hospital follow a Normal model with a mean of $80 and a standard deviation of $20.

a. Draw and clearly label this model.

b. Is it unusual to have a veterinary bill for $125? Explain.

c. What is the IQR for the costs of standard veterinary services? Show your work.

15. Soda cans A machine that fills cans with soda fills according to a Normal model with mean 12.1 ounces and standard deviation 0.05 ounces. a. If the cans claim to have 12 ounces of soda each, what percent of cans are under-filled?

b. Management wants to ensure that only 1% of cans are under-filled. i. Scenario 1: If the mean fill of the cans remains at 12.1 ounces, what standard deviation does the filling machine need to have to achieve this goal?

ii. Scenario 2: If the standard deviation is to remain at 0.05 ounces, what mean does the filling machine need to have to achieve this goal?

.


Review of Part I: Exploring and Understanding Data

Part I-5

Statistics Test A – Data Analysis – Part I – Key 1. B 6. E

2. C 7. B

3. D 8. C

4. C 9. A

5. D 10. C

1

11. Cats and dogs

0.8

a. One possible option 

0.6

b. Yes: 87.8%

0.4

c. Yes: 90.2%

No: 12.2%

0.2

No: 9.8%

0

d. segmented bar graph, side-by-side bar

LikesDogsNo LikesDogsYes

graph, or pie charts

e. Perhaps. There is little difference between the percents of those who like dogs, depending on whether they like cats. Of those who like cats, 90.2% like dogs, compared to 87.8% overall.

12. House calls a. The table shows summary statistics for the past month. Fill in the table to find out the cost of the service calls. Statistic Hours of Service Call Cost of Service Call Mean 4.5 $210 Median 3.5 $170 SD 1.2 $48 IQR 2.0 $80 Minimum 0.5 $50

b. -1.50 13. Health insurance a. Were any of the states outliers? Explain how you made your decision. IQR = Q3 – Q1 = 16.7 – 10.8 = 5.9 1.5(IQR) = 1.5(5.9) = 8.85 Q3 + 1.5(IQR) = 16.7 + 8.85 = 25.55 < Max, so there is at least one high outlier Q1 – 1.5(IQR) = 10.8 – 8.85 = 1.95 < Min, so there are no low outliers

b. It is more appropriate to use the median and IQR to describe these data, since the distribution is skewed right.

.


Part I-6

Part I Exploring and Understanding Data

14. Veterinary costs a.

b. z 

125  80 20

 2.25 , more than 2 standard deviations above the mean bill. A veterinary bill of

$125 is unusual.

c. Q1 has z = –0.67 and Q3 has z = +0.67, so 0.67 

y  80 , thus y = 80 – 0.67(20) = 66.67 and 20

y  80 , thus y = 80 + 0.67(20) = 93.33. 20 The IQR = Q3 – Q1 = 93.33 – 66.67 = $26.66. 0.67 

15. Soda cans A machine that fills cans with soda fills according to a Normal model with mean 12.1 ounces and standard deviation 0.05 ounces.

a. z 

12  12.1  2.0 , which suggests that 2.28% of cans are under-filled. 0.05

b. A z-score of -2.33 has 1% to its left, meaning that 1% of the cans would be under-filled. i.

12  12.1  –2.33s = -0.1.  s = –0.1/–2.33  0.043. The standard deviation would s need to be 0.043 ounces. 2.33 

12  m  –2.33(0.05) = 12 – m.  m = 12 + 2.33(0.05)  12.12. The mean would 0.05 need to be 12.12 ounces.

ii. 2.33 

.


Review of Part I: Exploring and Understanding Data

Statistics Test B – Data Analysis – Part I

Part I-7

Name __________________________

___ 1. The SPCA collects the following data about the dogs they house. Which is categorical? A) breed D) number of days housed

B) age E) veterinary costs

C) weight

___ 2. Which of those variables about German Shepherds is most likely to be described by a Normal model? A) breed D) number of days housed

B) age E) veterinary costs

C) weight

___ 3. The SPCA has kept these data records for the past 20 years. If they want to show the distribution of the preferred breed of dog that people choose, which graph would they choose? A) boxplot

B) stem and leaf plot

C) bar graph

D) dotplot

E) histogram

___ 4. A researcher notes that the weight of a male Golden Retriever is approximately normally distributed with a mean of 70 lbs. and a standard deviation of 7.5 lbs. What percentage of these dogs weigh less than 65 lbs? A) 66.7%

B) 33.3%

C) 5%

D) 25%

E) 75%

___ 5. Referring to question #4, approximately how heavy would a male Golden Retriever be in order to be in the heaviest 5% of such dogs? A) 75 lbs.

B) 82 lbs.

C) 65 lbs.

D) 68 lbs.

E) 84 lbs.

___ 6. Last weekend police ticketed 18 men whose mean speed was 72 miles per hour, and 30 women going an average of 64 mph. Overall, what was the mean speed of all the people ticketed? A) 67 mph D) none of those

B) 68 mph C) 69 mph E) It cannot be determined.

___ 7. Which is true of the data shown in the histogram? I. The distribution is skewed to the right. II. The mean is probably smaller than the median. III. We should use median and IQR to summarize these data. A) I only

B) II only

C) III only

D) II and III only

E) I, II, and III

___ 8. If we want to discuss any gaps and clusters in a data set, which of the following should not be chosen to display the data set? A) histogram C) boxplot

B) stem-and-leaf plot D) dotplot E) any of these would work

.


Part I-8

Part I Exploring and Understanding Data

___ 9. Consider the five number summary for salaries of U.S. marketing managers. Min

Q1

Median

Q3

Max

46360 69693

77020

91750 129420

Suppose the marketing manager who was earning $129,420 got a raise and is now earning $140,000. Which of the following statement is true? A) The mean would increase. B) The median would increase. C) The range would increase. D) Both A and C. E) All of the above. ___10. Suppose that a Normal model describes fuel economy (miles per gallon) for automobiles and that a Toyota Corolla has a standardized score (z-score) of +2.2. This means that Corollas . . . A) get 2.2 miles per gallon. B) get 2.2 times the gas mileage of the average car. C) get 2.2 mpg more than the average car. D) have a standard deviation of 2.2 mpg. E) achieve fuel economy that is 2.2 standard deviations better than the average car. 11. Commuting to work The table shows how a company’s employees commute to work. a. What is the marginal distribution (in %) of mode of transportation? Car _____Bus _____Train _____

Job Class Management Labor Total

Transportation Car Bus Train 26 20 44 56 106 168 82 126 212

Total 90 330 420

b. What is the conditional distribution (in %) of mode of transportation for management? Car _____Bus _____Train _____ c. What kind of display would you use to show the association between job class and mode of transportation? (Just name a graph.) _____________________________ d. Do job classification and mode of transportation appear to be independent? Give statistical evidence to support your conclusion. 12. Book sales A publishing company pays its sales staff $600 a week plus a commission of $0.50 per book sold. For example, a salesman who sold 440 books earned 600 + 0.50(440) = $820. a. The table shows summary statistics for the number of books the large sales staff sold last week. Fill in the table to show the statistics for the pay these people earned. b. The newest employee had a pretty good week. Among all the salespeople her pay corresponded to a z-score of +1.80. What was the z-score of the number of books she sold?

.

Statistic Mean

Books Sold $ Earned 640

St. dev.

360

IQR

450

Maximum

1420


Review of Part I: Exploring and Understanding Data

Part I-9

13. Cordless phones In their October 2003 issue, Consumer Reports evaluated the price and performance of 23 models of cordless phones. Computer output gives these summaries for the prices: Min 15

Q1 30

Median 50

Q3 110

Max 200

MidRange 107.5

Mean 71.75

TrMean 67.63

SD 52.08

a. Were any of the prices outliers? Explain how you made your decision.

b. One of the manufacturers advertises a cordless phone “economy-priced at only $31.95.” Would you consider that to be a very low price? Explain.

14. Exercising Owners of an exercise gym believe that a Normal model is useful in projecting the number of clients who will exercise in their gym each week. They use a mean of 800 clients and a standard deviation of 90 clients. a. Draw and clearly label this model.

b. What is the first quartile of the weekly number of clients? (SHOW WORK)

c. An owner of another gym reports that 5% of the time their gym has fewer than 450 clients, and 40% of the time the gym has more than 1085 clients. What parameters should that owner use for his Normal model? N( ________ , _________)

.


Part I-10

Part I Exploring and Understanding Data

15. Concrete thickness A roadway construction process uses a machine that pours concrete onto the roadway and measures the thickness of the concrete so the roadway will measure up to the required depth in inches. The concrete thickness needs to be consistent across the road, but the machine isn’t perfect and it is costly to operate. Since there’s a safety hazard if the roadway is thinner than the minimum 23 inch thickness, the company sets the machine to average 26 inches for the batches of concrete. They believe the thickness level of the machine’s concrete output can be described by a Normal model with standard deviation 1.75 inches. (SHOW WORK) a. What percent of the concrete roadway is under the minimum depth?

_________ b. The company’s lawyers insist that no more than 3% of the output be under the limit. Because of the expense of operating the machine, they cannot afford to reset the mean to a higher value. Instead they will try to reduce the standard deviation to achieve the “only 3% under” goal. What standard deviation must they attain?

_________ c. Explain what achieving a smaller standard deviation means in this context.

.


Review of Part I: Exploring and Understanding Data

Part I-11

Statistics Test B – Data Analysis – Part I – Key 1. A

2. C

3. C

4. D

5. B

6. A

7. D

8. C

9. D

10. E

11. Commuting to work a.

Car: 19.5%

b. Car: 28.9%

Bus: 30%

Train: 50.5%

Bus: 22.2%

Train: 48.9%

c. segmented bar graph, or pie charts d. No, there is a difference between the percents in two types of transportation - Car and Bus categories, depending on the Job Classification. Car

Bus

Train

Management

28.9%

22.2%

48.9%

Labor

17.0%

32.1%

50.9%

Although about half of each group take the train, management are more likely than labor to come by car and less likely to take a bus. 12. Book sales. a.

b.

The table shows summary statistics for the number of books the large sales staff sold last week. Fill in the table to show the statistics for the pay these people earned.

Statistic Mean Standard deviation

360

$180

+1.80

IQR

450

$225

Maximum

1420

$1310

Books Sold $ Earned 640 $920

13. Cordless phones

a. IQR = Q3 – Q1 = 110 – 30 = 80 1.5(IQR) = 1.5(80) = 120 Q3 + 1.5(IQR) = 110 + 120 = 230; Max (200) < Q3 + 1.5(IQR), so no high outliers. Q1 – 1.5(IQR) = 30 – 120 = –90;Min (15) > Q1 – 1.5(IQR), so no low outliers.

b. A price of $31.95 is slightly above the first quartile, so over 25% of phones cost less. No, the advertised price would not be a very low price. (Or: The advertised price is only 0.76 standard deviations below the mean. This is not an unusually low price.)

.


Part I-12

Part I Exploring and Understanding Data

14. Exercising a. Draw and clearly label this model.

b. Q1 => P = 0.25 and z = –0.674, x  800 90 60.66  x  800 x  739.34 0.674 

So the first quartile is at 740 clients.

c. 450 (5th percentile) has z = –1.645 1085 (60th percentile) has z = +0.253

450  

 1.645 and

1084  

 0.253  450 –  = -1.645 and 1084 –  = 0.253 

1.898 = 634    334.0 and   999.5 N(999.5 , 334.0) 15. Concrete thickness a. The concrete roadway is under minimum depth when less then 23 inches in thickness.

z

23  26  1.7  P = 0.043 so the model suggests about 4.3% is under the minimum depth 1.75 23  26

b.

P = 0.03 z = –1.88, so 1.88 

c.

A smaller standard deviation means that the thickness of the concrete will be more consistent.

; then   1.596 inches

.


Review of Part I: Exploring and Understanding Data

Statistics Test C – Data Analysis – Part I

Part I-13

Name __________________________

___ 1. We collect these data from 50 male students. Which variable is categorical? A) eye color B) head circumference D) number of cigarettes smoked daily

C) hours of homework last week E) number of TV sets at home

___ 2. Which of those variables is most likely to follow a Normal model? ___ 3. The data collected in a recent (2012) Nielsen Global Survey analyzed Asian American demographics. Which of the following statements is (are) true? A) Education level is a categorical variable. B) Education level is nominal scaled. C) Education level is ordinal scaled. D) Both A and B E) Both A and C ___ 4. The mean number of hours worked for the 30 males was 6, and for the 20 females was 9. The overall mean number of hours worked … A) is 6.5

B) is 7.2

C) is 7.5

D) is none of these.

E) cannot be determined.

___ 5. We might choose to display data with a stemplot rather than a boxplot because a stemplot I. reveals the shape of the distribution. II. is better for large data sets. III. displays the actual data. A) I only B) II only C) III only D) I and III E) I, II, and III ___ 6. Which is true of the data whose distribution is shown? I. The distribution is skewed to the right. II. The mean is probably smaller than the median. III. We should summarize with mean and standard deviation. A) I only

B) II only

C) I and II

D) II and III

E) I, II, and III

___ 7. The standard deviation of the data displayed in this dotplot is most likely to be … A) 5.

B) 8.

C) 12.

D) 18.

E) 20.

___ 8. Suppose that a Normal model describes the acidity (pH) of rainwater, and that water tested after last week’s storm had a z–score of 1.8. This means that the acidity of that rain … A) had pH of 1.8. B) varied with a st. dev. of 1.8 C) had pH 1.8 higher than avg, rainfall. D) had pH 1.8 times that of average rainwater. E) had pH 1.8 standard deviations higher than average rainwater. ___ 9. A North American mole weighs an average of 3.5 oz. with a standard deviation of 0.9 oz. and a distribution that is approximately normal. In what percentile is the weight of a mole that weighs 4.25 oz.? A) 20.2%

B) 79.8%

C) 83.3%

D) 17.7%

.

E) 6.022%


Part I-14

Part I Exploring and Understanding Data

___10. Environmental researchers have collected rain acidity data. They want to examine the various causes of the acidity. They should graph their data with a… A) dotplot B) bar graph C) boxplot D) histogram E) stem and leaf plot 11. Paying for purchases One day a store tracked the way shoppers paid for their purchases. Their data are summarized in the table. a. What percent of men paid cash? _______ b. What is the conditional relative frequency distribution of payment method for women? c. d.

Male Female Total

Cash Check Charge Total 18 10 12 40 18 12 30 60 36 22 42 100

If you wanted to show the association between gender and method of payment visually, what kind of graph would you make? (Just name it.) ______________________ Is there evidence of an association between gender and method of payment? Explain briefly.

12. Repair bills An automobile service shop reported the summary statistics shown for repair bills (in $) for their customers last month. a. Were any of the bills outliers? Show how you made your decision.

b.

27 Min 88 Q1 Median 132 308 Q3 1442 Max 284 Mean 140 SD

After checking out a problem with your car the service manager gives you an estimate of “only $90.” Is he right to imply that your bill will be unusually low? Explain briefly.

13. Salary conversions You learn that your company is sending you and several other employees to staff a new office in China. While there everyone will earn the equivalent of their current salary, converted to Chinese currency at the rate of 8 yuans per dollar. In addition, everyone will earn a weekly foreign living allowance of 200 yuans. For example, since you are earning $1000 per week, your weekly salary in China will be 1000 x 8 + 200 = 8200 yuans. a. Shown are some summary statistics describing Statistic In the US In China the current salaries of this group being sent Minimum salary $400 overseas. Fill in the table to show what these statistics will be for the salaries you all will Standard deviation $250 earn while in China. Median $750 b. Among this group of employees going to China, your US salary has a z–score of +1.20. IQR $300 What will your new z-score be, based on everyone’s China salary? _____________ .


Review of Part I: Exploring and Understanding Data

Part I-15

14. Copy machines A manufacturer claims that lifespans for their copy machines (in months) can be described by a Normal model N(42,7). Show your work. a. Draw and clearly label the model.

b.

A company with a several large office buildings buys 200 of these copiers. The salesman tells the boss “190 (95%) of your new copiers will last between _____ and _____ months.” Comment on this claim.

c.

What is the 3rd quartile of copier lifespans? _____________

d.

What percent of the copiers are expected to fail before 36 months? _____________

e.

The manufacturer wants to reduce the 36-month failure rate to only 10%. Assuming the mean lifespan will stay the same, what standard deviation must they achieve? _____________

f.

Briefly explain what that change in standard deviation means in this context.

g.

A competing manufacturer says that not only will 90% of their copiers last at least 36 months, 65% will last at least 42 months. What Normal model parameters is that manufacturer claiming? Show your work. N(________ , _________)

.


Part I-16

Part I Exploring and Understanding Data

Statistics Test C – Data Analysis – Part I – Key 1. A

2. B

3. E

4. B

5. D

6. A

7. C

8. E

9. B

10. B

11. Paying for purchases a.

45%

b.

30% cash, 20% check, 50% charge

c.

segmented bar graphs, side-by-side bar graphs, or pie charts

d.

Yes. Women are more likely to charge their purchases than men (50% to 30%) and less likely to pay cash (30% to 45%).

12. Repair bills a. Yes. IQR = 308 – 88 = 220. The upper fence for outliers is one and a half IQR’s above the third quartile, or 308 + 1.5(220) = 638. The maximum repair bill was $1442, well above $638, so it is certainly an outlier. b. No. $90 is higher than over 25% of the bills, so it is not unusually low.

13. Salary conversions a. 3400 yuans, 2000, 6200, 2400

b. z = +1.20

14. Copy machines a.

b. 28, 56. The claim is probably false. This model should provide a useful estimate of what might happen, but is not certain to predict what actually will happen. c. 46.7 months d. 19.6% e. 4.7 months (should all include sketches of labeled curves) f.

A smaller standard deviation means that the copiers would be more consistent in their lifespans

g. 36 months (10th percentile) has z = –1.28 and 42 months (35th percentile) has z = –0.385. 36  

 1.28155 and

42  

 0.38532  36 –  = –1.28155 and 42 –  = –0.38532 

0.89623 = 6    6.695 and   44.580  N(44.580, 6.695)

.


Chapter 6 Scatterplots, Association, and Correlation Solutions to class examples: 1. Answers may vary. Here are some possibilities.  Drug dosage and degree of pain relief Answer: The association is likely to be strong, positive and curved. Assuming, of course, that the drug is an effective pain reliever, as the dosage increases, the degree of pain relief will increase. Eventually, the association is likely to level off, until no further pain relief is possible, since the pain will be gone.  Calories consumed and weight loss Answer: The association is likely to be moderate, negative and linear. As fewer calories are consumed, more weight is likely to be lost. The association will not be strong, since some people lose weight easier than others, and there are other variables involved like overall health, exercise and beginning weight.  Hours of sleep and score on a test Answer: The association is likely weak, positive and possibly linear. Generally, a wellrested person is expected to score higher on a test. The relationship is weak, since there are other variables involved. Maybe a person got less sleep because they were up studying.  Shoe size and grade point average Answer: There is no association between shoe size and GPA. The scatterplot is likely to be randomly scattered.  Time for a mile run and age Answer: The association between time for a mile run and age is likely to be moderate and curved, with no dominant direction. The very young will likely have high run times. Run times are likely to be the lowest for people in their late teens or early twenties. Older people are likely to have high run times.  Age of car and cost of repairs Answer: The association between age of car and cost of repairs is positive, moderate, and linear. As cars get older, they usually require more repairs. 2. Answers may vary. 3. Answers may vary depending on your data. A sample of a class worksheet based on height/weight data is included, preceding the Chapter 6 quizzes. 4. See Class Example 4. 5. See Class Example 5.

.


6-2

Part II Exploring Relationships Between Variables

The variables are year and U.S. population, in millions of people. Both variables are quantitative. The association between year and population is strong, positive, and curved. Population has been increasing over the last 200 years. Furthermore, the rate of population growth has been increasing. The U.S. population has been growing faster in more recent years. We will attempt to straighten the scatter plot using a logarithmic re-expression and a square root re-expression.

Millions of people

6. Think U.S. Population 225 150 75

1800

1880

1960

Year

Show Square Root Re-expression sqrt(population)

Log(population)

Logarithmic Re-expression 2.0 1.5 1.0

1800

1880

16 12 8 4 1800

1960

1880

1960

Year

Year

Tell The scatterplot of log(population) and year shows a strong, positive association, but it is now downward curved. We have gone too far down along the ladder of powers, and should go one step back. The scatterplot of square root of population and year shows a strong, positive, more linear association. The square root re-expression straightens the scatterplot well. Supplemental Resources On the following page is a sample of a worksheet that examines the relationship between height and weight of some students, using association, correlation, regression and prediction. It is designed to be used with the TI-83 graphing calculator, but you can easily adapt it to your own data analysis package. You may also choose to collect and use your own data.

.


Chapter 6 Scatterplots, Association, and Correlation

Statistics – Correlation, Regression, Prediction Worksheet

6-3

Chapters 6 & 7

1. Below are heights and weights data for male and female Stat students. Create TI lists MHT, MWT, FHT, FWT, and enter the data. You will need to keep these data for several days. (Remember, to create a list: (1) put the cursor atop the names (L1,L2,...), then (2) space to the right, and (3) type the new name in the blank space. You can then access these names via the LIST NAMES menu.) 2. Is it reasonable to assume these data are drawn from populations that are normally distributed? Check summary statistics and histograms for each variable. 3. a. Make a scatterplot for each gender (STAT PLOT, first plot type). Which is the explanatory variable? b. Describe the relationship (form, strength, direction, outliers, etc.) for each gender. 4. a. Calculate and interpret r for each gender. b. What would you predict about the weight of . . . someone of average height? . . . a male 2 standard deviations above average in height. . . . a female 1 standard deviation below average in height? c. Explain how this illustrates the concept of regression toward the mean. d. Calculate the coefficients of determination, r2, and interpret each in the context of this relationship. 5. a. b. c. d. e.

Write an equation of the least squares line of best fit for each gender. Check the residuals plots. Do you think that a line is a good model? Explain. Explain what the slope of each line means in the context of this relationship. Predict weights for : a 60” male; a 60” female; a 70” male; a 70” female Predict the weight of a 7’2” male; of a 20” newborn baby girl. Comment on these results. HT(in) 67 71 73 71 74 74 68 73 71 72 69 66 70

WT(lb) 140 165 168 142 200 175 135 145 150 155 168 106 144

MALE HT(in) 71 70 71 70 69 70 74 71 74 72 70 73

WT(lb) 132 140 140 140 130 150 170 175 180 150 150 190

HT(in) 63 62 75 61 62 63 66 63 67 67 64 61

.

FEMALE WT(lb) HT(in) 117 64 107 63 170 64 91 71 118 64 130 62 135 65 120 64.5 125 68.5 117 65.5 135 64 88 64

WT(lb) 110 123 110 134 129 129 123 115 122 120 111 115


6-4

Part II Exploring Relationships Between Variables

Chapters 6 & 7 – Worksheet Key – Correlation, Regression, and Prediction 1. Enter the data, or collect data from your own class, provided you have enough students to make meaningful summaries. 2. All of the distributions are at least roughly unimodal and symmetric. There are several tall females, one of which is also heavier than average. Since the means and medians are roughly the same, it seems reasonable that these data are drawn from populations that are normally distributed.

3. a. The explanatory variable is height. Height determines weight. b. There is a moderate, positive linear relationship between male height and male weight. Taller males are generally heavier. Likewise, there is a moderate, positive, linear relationship between female height and female weight. Taller females are generally heavier. There is one female who is much taller and heavier than the others, but this female’s attributes fit the overall pattern, although her weight is a bit higher than we would expect. 4. a. Males : r = 0.752 Females : r = 0.738 b. Someone of average height is expected to be of average weight. A male 2 SDs above the mean in height is expected to be 2(0.752) = 1.54 SDs above the mean in weight. A female 1 SD below the mean height is expected to be 1(0.738) = 0.738 SDs below the mean in weight. c. The relationship between height and weight is not perfect. When making predictions, we are cautious when straying too far away from the mean, because of this variability. When we back off in our predictions by multiplying standard deviations by r, we are regressing towards the mean. d. 56.5% of the variability in Male Weight is explained by the linear model. 54.5% of the variability in Female Weight is explained by the linear model.

.


Chapter 6 Scatterplots, Association, and Correlation

  364.403  7.29993( MHT ) 5. a. MWT b.

6-5

  116.574  3.66384( FHT ) FWT Both residuals plots appear reasonably scattered. The linear models are both appropriate, although removing the heaviest and two lightest females may yield a better overall model.

c. According to the linear models, each additional inch in male height is associated with about 7.3 additional pounds in weight, and each additional inch in female height is associated with about 3.7 additional pounds in weight. d. 60” male: 73.6 lbs 60” female: 103.3 lbs 70” male: 146.6 lbs 70” female: 139.9 lbs. e. 7’2” male: 263.4 lbs 20” newborn baby girl: – 43.3 lbs The weight of the male seems low. A male this tall would weigh much more. The weight of the newborn baby girl is impossible. These models were designed to predict the weights of typical high school students, not the very tall or babies. We wouldn’t expect these predictions to be accurate.

.


6-6

Part II Exploring Relationships Between Variables

Statistics Quiz A – Chapter 6

Name__________________________

1. After conducting a survey of his students, a professor reported that “There appears to be a strong correlation between grade point average and whether or not a student looks at his or her phone during class.” Comment on this observation.

2. The following scatterplot shows a relationship between x and y that results in a correlation coefficient of r = 0. Explain why r = 0 in this situation even though there appears to be a strong relationship between the x and y variables. Scatterplot of y vs x 18 16 14 12

y

10 8 6 4 2 0 -5

-4

-3

-2

-1

0 x

.

1

2

3

4


Chapter 6 Scatterplots, Association, and Correlation

6-7

3. The following scatterplot shows the relationship between the time (in seconds) it took men to run the 1500m race for the gold medal and the year of the Olympics that the race was run in: Scatterplot of Time vs Year 250

Time

240

230

220

210 1900

1910

1920

1930

1940 1950 Year

1960

1970

1980

1990

a. Write a few sentences describing the association.

b. Estimate the correlation.

r=

4. Identify what is wrong with each of the following statements: a. The correlation between Olympic gold medal times for the 800m hurdles and year is – 0.66 seconds per year. b. The correlation between Olympic gold medal times for the 100m dash and year is -1.37. c. Since the correlation between Olympic gold medal times for the 800m hurdles and 100m dash is –0. 41, the correlation between times for the 100m dash and the 800m hurdles is +0.41. d. If we were to measure Olympic gold medal times for the 800m hurdles in minutes instead of seconds, the correlation would be –0.66/60 = –0.011.

.


6-8

Part II Exploring Relationships Between Variables

Statistics Quiz A – Chapter 6 – Key 1. After conducting a survey of his students, a professor reported that “There appears to be a strong correlation between grade point average and whether or not a student looks at his or her phone during class.” Comment on this observation. Correlation measures the strength of a linear association between two quantitative variables. Whether or not a student looks at his phone is a categorical variable, so correlation cannot be calculated between GPA and whether or not a student looks at his or her phone.

2. The following scatterplot shows a relationship between x and y that results in a correlation coefficient of r = 0. Explain why r = 0 in this situation even though there appears to be a strong relationship between the x and y variables. Scatterplot of y vs x 18 16 14 12

y

10 8 6 4 2 0 -5

-4

-3

-2

-1

0

1

2

3

4

x

The correlation coefficient only measures the strength of linear associations. The relationship between x and y that we see here is far from linear (in fact, it is a parabolic relationship).

.


Chapter 6 Scatterplots, Association, and Correlation

6-9

3. The following scatterplot shows the relationship between the time (in seconds) it took men to run the 1500m race for the gold medal and the year of the Olympics that the race was run in: Scatterplot of Time vs Year 250

Time

240

230

220

210 1900

1910

1920

1930

1940 1950 Year

1960

1970

1980

1990

a. Write a few sentences describing the association. There is a fairly strong, negative, linear relationship between the time (in seconds) it took men to run the 1500m race for the gold medal and the year of the Olympics that the race was run in. It appears that the gold medal times for the 1500m race have decreased over time.

b. Estimate the correlation.

r=

– 0.94

(answers between – 0.7 and – 0.98 are acceptable)

4. Identify what is wrong with each of the following statements: a. The correlation between Olympic gold medal times for the 800m hurdles and year is – 0.66 seconds per year. Correlation has no units.

b. The correlation between Olympic gold medal times for the 100m dash and year is –1.37. Correlation has to be between –1 and +1.

c. Since the correlation between Olympic gold medal times for the 800m hurdles and 100m dash is –0. 41, the correlation between times for the 100m dash and the 800m hurdles is +0.41. Correlation does not change if we reverse the role of the x and y variables.

d. If we were to measure Olympic gold medal times for the 800m hurdles in minutes instead of seconds, the correlation would be –0.66/60 = –0.011. Correlation does not change when we change units. .


6-10

Part II Exploring Relationships Between Variables

Statistics Quiz B – Chapter 6

Name__________________________

1. Given the increase of people carrying their pets in public, a news reporter takes a survey of individuals about this trend. The reporter concludes, “There appears to be a strong correlation between bringing your pet to public places and the happiness of an individual.” Comment on this observation.

2. On the axes below, sketch a scatterplot described: a. a strong positive b. a weak negative association association

3. The National Sleep Foundation reported a moderately strong positive association between the number of hours of sleep a person gets and the person’s ability to listen, learn, and solve problems. a. Explain in the context of this problem what “positive association” means.

b. Hoping to improve academic performance, the professor recommends allowing students to take a nap prior to taking midterms and finals. Discuss the professor’s recommendations.

.


Chapter 6 Scatterplots, Association, and Correlation

6-11

4. A common objective for many school administrators is to increase the number of students taking SAT and ACT tests from their school. The data from each state from 2003 are reflected in the scatterplot at the right. a. Write a few sentences describing the association.

SAT Scores by State

Mean SAT Total

1220 1180 1140 1100 1060 1020 980 0 10 20 30 40 50 60 70 80 90 SATParticipation

b. Estimate the correlation. r =

c. If the point in the top left corner (4, 1215) were removed, would the correlation become stronger, weaker, or remain about the same? Explain briefly.

d. If the point in the very middle (38, 1049) were removed, would the correlation become stronger, weaker, or remain about the same? Explain briefly.

.


6-12

Part II Exploring Relationships Between Variables

Statistics Quiz B – Chapter 6 – Key 1. Given the increase of people carrying their pets in public, a news reporter takes a survey of individuals about this trend. The reporter concludes, “There appears to be a strong correlation between bringing your pet to public places and the happiness of an individual.” Comment on this observation. The variables – owning a pet and happiness of an individual – are both categorical variables. Correlation cannot be calculated with categorical variables.

2. On the axes below, sketch a scatterplot described: a. a strong positive b. a weak negative association association

3. The National Sleep Foundation reported a moderately strong positive association between the number of hours of sleep a person gets and the person’s ability to listen, learn, and solve problems. a. Explain in the context of this problem what “positive association” means. A positive association means in general people who had more sleep were able to listen, learn, and solve problems more successfully.

b. Hoping to improve academic performance, the professor recommends allowing students to take a nap prior to taking midterms and finals. Discuss the professor’s recommendations. The professor is attributing association to cause and effect. There is an implication that more sleep will cause better problem solving, etc…, therefore causing an increase in test grades. Perhaps people who are better at learning were able to sleep more, or perhaps the people who are better learners also need more sleep, for some unknown reason.

.


Chapter 6 Scatterplots, Association, and Correlation

6-13

4. A common objective for many school administrators is to increase the number of students taking SAT and ACT tests from their school. The data from each state from 2003 is reflected in the scatterplot at the right. SAT Scores by State

a. Write a few sentences describing the association. There is a moderate, negative, linear association between the percent of students taking the SAT test and the total SAT score. It appears that the states with a larger percentage of students taking the SAT test have lower average total scores. The relationship is however not as linear as we would hope for, demonstrating upwards curvature.

Mean SAT Total

1220 1180 1140 1100 1060 1020 980 0 10 20 30 40 50 60 70 80 90 SATParticipation

b. Estimate the correlation. r =

– 0.76

(answers between – 0.6 and – 0.9 are acceptable)

c. If the point in the top left corner (4, 1215) were removed, would the correlation become stronger, weaker, or remain about the same? Explain briefly. If the point in the top left corner (4, 1215) were removed, the correlation would become stronger because the remaining points show a pattern with slightly less scatter.

d. If the point in the very middle (38, 1049) were removed, would the correlation become stronger, weaker, or remain about the same? Explain briefly. If the point in the very middle (38, 1049) were removed, the correlation would remain about the same; this point does not contribute much to the scatter.

.


6-14

Part II Exploring Relationships Between Variables

Statistics Quiz C – Chapter 6

Name __________________________

1. After conducting a marketing study to see what consumers thought about a new tinted contact lens they were developing, an eyewear company reported, “Consumer satisfaction is strongly correlated with eye color.” Comment on this observation.

2. On the axes below, sketch a scatterplot described: a. a strong negative b. a strong association, association but r is near 0

c. a weak but positive association

3. A Massachussets study of teens found that there is a moderately strong association between the number of hours of physical activity and scores on English and Language Arts state tests. a. Explain in this context what “moderately strong association” means.

b.

Hoping to improve student performance, the school board passed a resolution urging parents to increase the number of hours that students are physically active. Do you agree or disagree with the school board’s reasoning. Explain. Make sure to comment on at least one other variable that might be influencing this association.

.


Chapter 6 Scatterplots, Association, and Correlation 80

70

S t r n g t h

Strength

4. Researchers investigating the association between the size and strength of muscles measured the forearm circumference (in inches) of 20 teenage boys. Then they measured the strength of the boys’ grips (in pounds). Their data are plotted at the right.

6-15

60

50

40

10

11

12

13

14

Circum

a.

Write a few sentences describing the association.

b.

Estimate the correlation. r = ________

c.

If the point in the lower right corner (at about 14” and 38#) were removed, how would the correlation become stronger, weaker, or remain about the same?

d.

If the point in the upper right corner (at about 15” and 75#) were removed, would the correlation become stronger, weaker, or remain about the same?

.


6-16

Part II Exploring Relationships Between Variables

Statistics Quiz C – Chapter 6 – Key 1. After conducting a marketing study to see what consumers thought about a new tinted contact lens they were developing, an eyewear company reported, “Consumer satisfaction is strongly correlated with eye color.” Comment on this observation. There may be an association between customer satisfaction and eye color, but these are both categorical variables so they cannot be “correlated.”

2. On the axes below, sketch a scatterplot described: a. a strong negative b. a strong association, association but r is near 0

c. a weak but positive association

3. A Massachussets study of teens found that there is a moderately strong association between the number of hours of physical activity and scores on English and Language Arts state tests.. a. Explain in this context what “moderately strong association” means. Students who were more physically active scored higher, in general, then students who were less active.

b. Hoping to improve student performance, the school board passed a resolution urging parents to increase the number of hours that students are physically active. Do you agree or disagree with the school board’s reasoning. Explain. Make sure to comment on at least one other variable that might be influencing this association. They are mistakenly attributing the association to cause and effect. Maybe students who are more active come from homes with better nutrition. Or higher income. (many possible variables could be listed). Exercise alone may not change students’ scores.

.


Chapter 6 Scatterplots, Association, and Correlation 80

70

S t r n g t h

Strength

4. Researchers investigating the association between the size and strength of muscles measured the forearm circumference (in inches) of 20 teenage boys. Then they measured the strength of the boys’ grips (in pounds). Their data are plotted at the right.

6-17

60

50

40

10

11

12

13

14

Circum

a. Write a few sentences describing the association. There is a moderate, positive, linear association between forearm circumference and grip strength among these boys. In general, the larger their forearms, the stronger their grip. One boy in particular had very large forearms and a very strong grip. There was one outlier—the boy with the second largest forearms had one of the weakest grips.

b. Estimate the correlation. Actually r = 0.652 – any guess between 0.5 and 0.8 is pretty good.

c. If the point in the lower right corner (at about 14” and 38#) were removed, how would the correlation become stronger, weaker, or remain about the same? The correlation would become stronger.

d. If the point in the upper right corner (at about 15” and 75#) were removed, would the correlation become stronger, weaker, or remain about the same? The correlation would become weaker.

.


6-18

Part II Exploring Relationships Between Variables

Statistics Quiz D – Chapter 6

Name __________________________

1. A study examined consumption levels of oil and carbon dioxide emissions for sample of counties. The response variable in this study is A. oil. B. oil consumption. C. carbon dioxide emissions. D. countries. E. none of the above. 2. A supermarket chain gathers data on the amount they spend on promotional material (e.g., coupons, etc.) and sales revenue generated each quarter. The predictor variable is A. sales revenue. B. amount spent on promotional material. C. number of coupons offered. D. supermarket chains. E. none of the above. 3. The scatterplot shows monthly sales figures (in units) and number of months of experience for a sample of salespeople. Scatterplot of Sales vs Experience 80

70

Sales

60

50

40

30 0

20

40

60

80

100

Experience

The association between monthly sales and level of experience can be described as A. positive and weak. B. negative and weak. C. negative and strong. D. positive and strong. E. nonlinear.

.


Chapter 6 Scatterplots, Association, and Correlation

6-19

4. The scatterplot shows monthly sales figures (in units) and number of months of experience for a sample of salespeople. Scatterplot of Sales vs Experience 80

70

Sales

60

50

40

30 0

20

40

60

80

100

Experience

The correlation between monthly sales and level of experience is most likely A. –.235. B. 0. C. .180. D. –.914. E. .914. 5. Shown below is a correlation table showing correlation coefficients between stock price, earnings per share (EPS) and price/earnings (P/E) ratio for a sample of 19 publicly traded companies. Which of the following statements is false? Correlations: Stock Price, EPS, PE Stock Price EPS EPS 0.875 PE 0.323 –0.111 A. EPS is the best predictor of stock price. B. The strongest correlation is between EPS and stock price. C. There is a weak negative association between PE and EPS. D. PE is the best predictor of stock price. E. The weakest correlation is between PE and EPS.

.


6-20

Part II Exploring Relationships Between Variables

6. A small independent organic food store offers a variety of specialty coffees. To determine whether price has an impact on sales, the managers kept track of how many pounds of each variety of coffee were sold last month. Based on the scatterplot shown below, which of the following statements is true? Scatterplot of Pounds S old vs Price/Pound 80

Pounds Sold

70 60 50 40 30 20 5.0

7.5

10.0 Price/Pound

A. The quantitative variable condition is satisfied. B. The linearity condition is satisfied. C. There are no obvious outliers. D. All of the above. E. None of the above.

.

12.5

15.0


Chapter 6 Scatterplots, Association, and Correlation

6-21

7. A small independent organic food store offers a variety of specialty coffees. To determine whether price has an impact on sales, the managers kept track of how many pounds of each variety of coffee were sold last month. Scatterplot of Pounds Sold vs Price/Pound 80

Pounds Sold

70 60 50 40 30 20 5.0

7.5

10.0 Price/ Pound

12.5

15.0

Based on the scatterplot, the linear relationship between number of pounds of coffee sold per week and price is A. strong and positive. B. strong and negative. C. weak and negative. D. weak and positive. E. nonexistent. 1 vs. x shows a strong positive linear pattern. It is probably true that y A. accurate predictions can be made for Y even if extrapolation is involved. B. the correlation between X and Y is near +1.0. C. large values of X are associated with large values of Y. D. the scatterplot of Y vs X also shows a linear pattern. E. the residuals plot for regression of Y on X shows a curved pattern.

8. A scatterplot of

9. All but one of the statements below contain a mistake. Which one could be true? A. If the correlation between blood alcohol level and reaction time is 0.73, then the correlation between reaction time and blood alcohol level is -0.73. B. The correlation between the breed of a dog and its weight is 0.435. C. The correlation between gender and age is -0.171. D. The correlation between height and weight is 0.568 inches per pound. E. The correlation between weight and length of foot is 0.488.

.


6-22

Part II Exploring Relationships Between Variables

10. A correlation of zero between two quantitative variables means that A. there is no linear association between the two variables. B. re-expressing the data will guarantee a linear association between the two variables. C. there is no association between the two variables. D. we have done something wrong in our calculation of r. E. none of these

.


Chapter 6 Scatterplots, Association, and Correlation

Statistics Quiz D – Chapter 6 – Key 1. C 2. B 3. D 4. E 5. D 6. D 7. B 8. E 9. E 10. A

.

6-23


6-24

Part II Exploring Relationships Between Variables

Statistics Quiz E – Chapter 6

Name __________________________

1. A consumer research group examining the relationship between the price of meat (per pound) and fat content (in grams) gathered data that produced the following scatterplot. Scatterplot of Price/lb vs Fat Grams 9 8

Price/lb

7 6 5 4 3 2 0

5

10 Fat Grams

15

20

If the point in the lower left hand corner (2 grams of fat; $3.00 per pound) is removed, the correlation would most likely A. remain the same. B. become positive. C. become weaker negative. D. become stronger negative. E. become zero.

.


Chapter 6 Scatterplots, Association, and Correlation

2. For the following scatterplot, Scatterplot of y vs x

20

y

15

10

5

8

10

12

14 x

16

18

20

The likely correlation coefficient is A. +0.35 B. +0.90 C. +0.77 D. –0.89 E. –1.00 3. The disadvantage of re-expressing variables is that A. We have to explain the association in terms of the transformed variables. B. We cannot use the standard regression models C. It is not commonly used D. It can be difficult E. A and D 4. All but one of these statements contain a mistake. Which could be true? A. The correlation between a car's length and its fuel efficiency is 0.71 miles per gallon. B. There is a correlation of 0.63 between gender and political party. C. There is a high correlation (1.09) between height of a corn stalk and its age in weeks. D. The correlation between the amount of fertilizer used and the yield of beans is 0.42. E. The correlation between a football player's weight and the position he plays is 0.54. 5. A company's sales increase by the same amount each year. This growth is . . . A. quadratic B. exponential C. linear D. logarithmic E. power

.

6-25


6-26

Part II Exploring Relationships Between Variables

6. Another company's sales increase by the same percent each year. This growth is … A. logarithmic B. quadratic C. linear D. exponential E. power 7. The correlation coefficient between the hours that a person is awake during a 24-hour period and the hours that same person is asleep during a 24-hour period is most likely to be A. near -0.8 B. exactly –1.0 C. exactly +1.0 D. near 0 E. near +0.8 8. All but one of the statements below contain a mistake. Which one could be true? A. The correlation between the height of a bean plant and the day is 0.78 in/day. B. The correlation between the time it takes to get ready in the morning and gender is 0.78. C. The correlation between your golf score and the number of hours you practice is -0.36. D. The number of apricots on a tree and the amount of fertilizer have a 1.12 correlation. E. There is a strong correlation between type of preferred pet and income level. 9. The correlation between Hours Sleep and Quiz score of the 60 students in your statistics class is r = -0.34. Which of these possible conclusions is justified: A. The more you sleep, the better your quiz score will be. B. The form of the relationship between Hours Sleep and Quiz score is moderately straight. C. There are several outliers that explain the low correlation. D. If we measure Sleep in minutes rather than hours, the correlation will increase. E. There exists a moderately stong, negative association between Gours Sleep and Quiz score. 10. In a study of streams in the Adirondack Mountains, the following relationship was found between the water's pH and its hardness (measured in grains). Is it appropriate to summarize the strength of association with a correlation?

A. All the conditions for correlation are met. B. The scatterplot has outliers; correlation is not appropriate. C. The scatterplot is not linear; correlation is not appropriate. D. The variables are not quantitative; correlation is not appropriate. E. None of the above. .


Chapter 6 Scatterplots, Association, and Correlation

Statistics Quiz E – Chapter 6 – Key 1. D 2. A 3. E 4. D 5. C 6. D 7. B 8. C 9. E 10. C

.

6-27



Chapter 7 Linear Regression Solutions to class examples: 1. Class Example 1: According to the linear model, each additional cricket chirp is associated with an increase in temperature by an average of 0.25 degrees Fahrenheit. 2. Class Example 2: Scatterplots, regression equations, correlations, R2 values, and residuals plots for the heights and weights provided with Chapter 6 materials is on pages 6-3 and 6-4. Of course, if you use your own class data for heights and weights, answers will vary. A worksheet detailing the relationship between distance and airfare is on pages 7-3 and 7-4, with a key on page 7-5. 3. See Class Example 3. 4. Class Example 4:   177.22  0.079( Distance) Fare

r  R 2  0.482  0.694

5. Class Example 5:   644.287  6.13( Fare) Distance

Predicting the distance traveled for a $300 fare, for example:   644.287  6.13(300) Distance

 1194.713miles Some students may think that you could just use the equation that predicts the fare, and merely “back-solve” it. This isn’t true.   177.22  0.079( Distance) Fare

300  177.22  0.079( Distance) Distance  1554.18 miles In order to predict distance, we need to use the model that minimizes the residuals for distances. The correct model predicts a distance of 1194.7 miles for a $300 fare. 6. Class Example 6: The Data Desk Lab Worksheet is on pages 7-17 and 7-18, with a key on page 7-19. Investigative Task Chapter 7’s Investigative Task examines the relationship between smoking and coronary heart disease.

.


7-2

Part II Exploring Relationships Between Variables

Supplemental Resources The following page, 7-2, contains the algebra to back up our assertion that the least squares line passes through  x , y  . This is merely provided for interested students, and is not required. A worksheet for the distance and airfares data is on pages 7-14 and 7-15, with a key on 7-16. A lab activity using the Data Desk Cars datafile is on pages 7-17 and 7-18, with a key on 7-19. Finally, a Random Matters investigation to see how much slopes and ܴଶ varies across different samples. Why does the regression line have to go through  x , y  ? The derivation of the least-squares regression line begins with the assertion that the best line must pass through the mean-mean point (the origin in the scatterplot of z-scores). Most students will not object. If you have some students who insist on a proof of the mean-mean assertion, here are the steps. We advise not doing all this in class lest most eyes glaze over… We seek the line zˆ y  a  mz x that minimizes

  z  zˆ    z   a  mz    z  mz   a    z  mz   2a  z  mz   a   2a  z  mz  2

y

y

2

Substitute the equation:

y

x

2

Rearrange terms:

y

x

y

x

2

Square the binomial:

Consider the “middle term”:

y

2

y

x

x

2a  z y  2am z x

Rewrite the sum:

But the mean (and thus the sum) of a set of z-scores must be zero. Hence this whole middle term is zero and we can turn our attention to the task of trying to minimize what’s left: 2   z y  mzx   a 2  By choosing a = 0 we can be sure that the sum will be minimal. (Adding the square of any other value would make it bigger.) Hence the best line must have a y-intercept of 0 in the standardized plane, proving that the line of regression goes through the mean-mean point.

.


Chapter 7 Linear Regression

Statistics Quiz A – Chapter 7

7-3

Name__________________________

An article in the Journal of Statistics Education reported the price of diamonds of different sizes in Singapore dollars (SGD). The following table contains a data set that is consistent with this data, adjusted to US dollars in 2004: 2004 US $ Carat 494.82 0.12 768.03 0.17 1105.03 0.20 1508.88 0.25 1826.18 0.28 2096.89 0.33

2004 US $ Carat 688.24 0.15 944.90 0.18 1071.75 0.21 1504.44 0.26 1908.28 0.29 2409.76 0.35

2004 US $ Carat 748.10 0.16 1076.18 0.19 1289.20 0.23 1597.63 0.27 2038.09 0.32

1. Make a scatterplot and describe the association between the size of the diamond (carat) and the cost (in US dollars).

2. Create a model to predict diamond costs from the size of the diamond.

.


7-4

Part II Exploring Relationships Between Variables

3. Do you think a linear model is appropriate here? Explain.

4. Interpret the slope of your model in context.

5. Interpret the intercept of your model in context.

6. What is the correlation between cost and size?

7. Explain the meaning of R2 in the context of this problem.

8. Would it be better for a customer buying a diamond to have a negative residual or a positive residual from this model? Explain.

.


Chapter 7 Linear Regression

7-5

Statistics Quiz A – Chapter 7 - KEY

An article in the Journal of Statistics Education reported the price of diamonds of different sizes in Singapore dollars (SGD). The following table contains a data set that is consistent with this data, adjusted to US dollars in 2004: 2004 US $ Carat 494.82 0.12 768.03 0.17 1105.03 0.20 1508.88 0.25 1826.18 0.28 2096.89 0.33

2004 US $ Carat 688.24 0.15 944.90 0.18 1071.75 0.21 1504.44 0.26 1908.28 0.29 2409.76 0.35

2004 US $ Carat 748.10 0.16 1076.18 0.19 1289.20 0.23 1597.63 0.27 2038.09 0.32

1. Make a scatterplot and describe the association between the size of the diamond (carat) and the cost (in US dollars). Scatterplot of 2004 US $ vs Carat 2500

2004 US $

2000

1500

1000

500 0.10

0.15

0.20

0.25

0.30

0.35

Carat

There is a strong, positive, linear association between the size of the diamond and its cost. The cost of a diamond increases with size.

2. Create a model to predict diamond costs from the size of the diamond. The regression equation is 2004 US $ = - 559 + 8225 Carat Predictor Constant Carat

Coef -558.52 8225.1

SE Coef 57.88 239.1

T -9.65 34.40

P 0.000 0.000

S = 64.9355 R-Sq = 98.7% R-Sq(adj) = 98.7% Predicted cost = -558.52 + 8225.1(carat)

Copyright © 2018 Pearson Education, Inc.


7-6

Part II Exploring Relationships Between Variables

3. Do you think a linear model is appropriate here? Explain. Residuals Versus the Fitted Values (response is 2004 US $) 100

Residual

50

0

-50

-100 500

1000

1500 Fitted Value

2000

2500

A linear model is appropriate for this problem. The residual plot shows no obvious pattern.

4. Interpret the slope of your model in context. The slope of the model is 8225.1. The model predicts that for each additional carat, the cost of the diamond will increase by $8225.10, on average. This can also be interpreted as for each additional 0.01 carat, the cost of the diamond will increase by $82.251, on average.

5. Interpret the intercept of your model in context. The intercept of the model is -558.52. The model predicts that a diamond of 0 carats costs $558.52. This is not realistic.

6. What is the correlation between cost and size? The correlation, r, is r  0.987  0.993 . Since the scatterplot shows a positive relationship, the positive value must be used.

7. Explain the meaning of R2 in the context of this problem. R 2  0.987 . So 98.7% of the variation in diamond prices can be accounted for by the variation in the size of the diamond.

8. Would it be better for a customer buying a diamond to have a negative residual or a positive residual from this model? Explain. It would be better for customers to have a negative residual from this model, since a negative residual would indicate that the actual cost of the diamond was less than the model predicted it to be.

.


Chapter 7 Linear Regression

Statistics Quiz B – Chapter 7

7-7

Name__________________________

In an effort to decide if there is an association between the year of a postal increase and the new postal rate for first class mail, the data were gathered from the United States Postal Service. In 1981, the United States Postal Service changed their rates on March 22 and November 1. This information is shown in the table below. 1. Make a scatterplot and describe the association between the year and the first class postal rate.

2. Create a model to predict postal rates from the year.

.

Year Rate 1971 0.08 1974 0.10 1975 0.13 1978 0.15 1981 0.18 1981 0.20 1985 0.22 1988 0.25 1991 0.29 1995 0.32


7-8

Part II Exploring Relationships Between Variables

3. Do you think a linear model is appropriate here? Explain.

4. Interpret the slope of your model in context.

5. Interpret the intercept of your model in context.

6. What is the correlation between year and postal rate?

7. Explain the meaning of R2 in the context of this problem.

8. Would it be better for customers for a year to have a negative residual or a positive residual from this model? Explain.

.


Chapter 7 Linear Regression

7-9

Statistics Quiz B – Chapter 7 - KEY

In an effort to decide if there is an association between the year of a postal increase and the new postal rate for first class mail, the data were gathered from the United States Postal Service. In 1981, the United States Postal Service changed their rates on March 22 and November 1. This information is shown in the table below. 1. Make a scatterplot and describe the association between the year and the first class postal rate.

Year 1971 1974 1975 1978 1981 1981 1985 1988 1991 1995

Rate 0.08 0.10 0.13 0.15 0.18 0.20 0.22 0.25 0.29 0.32

There is a strong, positive, linear association between the year and the first class postal rate. Postal rates have increased over time.

2. Create a model to predict postal rates from the year.

  19.93  0.01015( year ) Rate

.


7-10

Part II Exploring Relationships Between Variables

3. Do you think a linear model is appropriate here? Explain. Yes, a linear model is appropriate for this problem. A review of the residual plots shows no obvious pattern.

4. Interpret the slope of your model in context. Slope of the model is 0.0101518. The model predicts that for every additional year the first class postal rate will increase by $0.01, on average.

5. Interpret the intercept of your model in context. Intercept of the model is -19.93. The model predicts that at Year = 0, the first class postal rate was -$19.93. This is not realistic.

6. What is the correlation between year and postal rate? The correlation, r, is r  0.990  0.9950 . Since the scatterplot shows a positive relationship, the positive value must be used.

7. Explain the meaning of R2 in the context of this problem. R2 = 0.990. So 99.0% of the variation in first class postal rates can be accounted for by the variation in year.

8. Would it be better for customers for a year to have a negative residual or a positive residual from this model? Explain. It would be better for customers to have a negative residual from this model. A negative residual would indicate that the actual first class postal rate is lower than the model predicted it would be.

.


Chapter 7 Linear Regression

Statistics – Investigative Task

7-11

Chapter 7

Smoking Now that cigarette smoking has been clearly tied to lung cancer, researchers are focusing on possible links to other diseases. The data below show annual rates of cigarette consumption and deaths from coronary heart disease for several nations. Some public health officials are urging that the United States adopt a national goal of cutting cigarette consumption in half. Examine these data and write a report. In your report you should:  include appropriate graphs and statistics;  describe the association between cigarette smoking and coronary heart disease;  create a linear model;  evaluate the strength and appropriateness of your model;  interpret the slope and y-intercept of the line;  use your model to estimate the potential benefits of reaching the “national goal” proposed for the United States.

Country Australia Austria Belgium Canada Denmark Finland France Germany Greece Iceland Ireland Italy Mexico Netherlands New Zealand Norway Spain Sweden Switzerland United Kingdom USA

Cigarette Consumption (per adult per year)

CHD Mortality (Deaths per 100,000)

3220 1770 1700 3350 1500 2160 1410 1890 1800 2290 2770 1510 1680 1810 3220 1090 1200 1270 2780 2790 3900

238 182 118 212 145 233 60 150 41 111 187 114 32 125 212 136 44 127 125 194 257

Data Source: American Journal of Public Health, as cited in Exploring Data, Landwehr & Watkins, 1986.

.


7-12

Part II Exploring Relationships Between Variables

Statistics – Investigative Task

Chapter 7

Components Think

Show

Tell

Comments

Demonstrates clear understanding of correlation, regression, prediction, and the concept of a linear model. Visual: The scatterplot… o has correct explanatory/response o is accurate and clearly labeled o shows the regression line Numerical: The analysis… o has correct r and r2 o has correct slope and y-intercept o uses the proper notation Verbal: All interpretations must be in context and must distinguish between the data and the model. The Association o Discusses direction o Verifies linearity w/residuals plot o Describes strength w/correlation o Notes unusual points The Statistics - Interprets… o r2 in context o slope, in context (as model) o y-intercept, in context (as model) The Prediction o Makes correct prediction in context o Clearly indicates it’s a model o Avoids suggesting cause and effect

Components are scored as Essentially correct, Partially correct, or Incorrect 1: Graphs and Statistics: Scatterplot, numerical values, and notation E – Each category has all 3 requirements correct P – Each category has at least 2 of the requirements correct I – Too many errors in graph, statistics, or notation 2: Association: positive (in context), linear (w/residuals), strength (correlation), outliers E – All 4 requirements P – 2 or 3 requirements I – None or 1 of the requirements 3: Interpretation: r-sq, slope, y-intercept, in context, in terms of the model (not reality) E – All 5 requirements P – 3 or 4 requirements I – Fewer than 3 requirements 4: Prediction: correct prediction, in context, model-based, without implying causation E – All 4 requirements

P – 2 or 3 requirements

I – None or 1 of the requirements

Scoring  E’s count 1 point, P’s are 1/2  Grade: A = 4, B = 3, etc. +/- based on rounding (Ex: 3.5 rounded to 3 is B+)

Name __________________________________ Grade ____ .


Chapter 7 Linear Regression

7-13

NOTE: We present a model solution with some trepidation. This is not a scoring key, just an example. Many other approaches could fully satisfy the requirements outlined in the scoring rubric. That (not this) is the standard by which student responses should be evaluated. Model Solution - Investigative Task – Smoking

Information from the American Journal of Public Health reports on the number of cigarettes consumed per adult per year and the rate of deaths related to coronary heart disease, with statistics reported from 21 countries. Is there a relationship between cigarette consumption and CHD death rates? Furthermore, what are the potential health benefits associated with reducing cigarette consumption in the U.S. by half?

225 150 75

1500 3000 Cigarettes consumed per adult per year

Residuals Plot Residual

The association between cigarette consumption and CHD deaths is moderate, with r = 0.731, linear, and positive. Countries with higher cigarette consumption generally have higher rates of CHD deaths. There are several unusual points. Mexico and Greece have CHD death rates lower than we would expect, given their cigarette consumption, and Finland has a higher rate of CHD deaths than we would expect for its level of cigarette consumption. These points are notable, but not drastic departures from the pattern.

Cigarette Consumption and CHD deaths Coronary Heart Disease deaths per 100,000 people.

In the 21 countries surveyed, the mean cigarette consumption was approximately 2148 cigarettes per adult per year, with a standard deviation of 809 cigarettes per adult per year. The mean CHD rate was 144.9 deaths per 100,000 citizens, with a standard deviation of 66.5 deaths per 100,000 residents.

50 0

-50 Since the relationship is straight enough, both variables are quantitative, and there are no outliers, we can model the 120 200 association with a linear model. The regression equation for   15.6415  0.060176(Cigarettes ) . predicted(C/C) the relationship is CHD The model predicts that a country in which there is no cigarette consumption would have a CHD death rate of about 15.6 deaths per 100,000. Furthermore, according to the model, for each additional 100 cigarettes consumed per adult per year, there is an expected increase of about 6 deaths due to CHD. With R 2  53.4% , the model explains 53.4% of the variability in CHD death rate. The residuals plot shows no pattern, so the linear model is appropriate.

If the U.S. were to cut its cigarette consumption in half, from 3900 cigarettes per adult per year to 1950, the model predicts that the CHD rate would drop from 257 to approximately 133 deaths per 100,000 citizens. There is no guarantee of a decrease in CHD death rate, since we are dealing with a model, not reality, and there may not be a causal relationship between average cigarette consumption and CHD death rate.

.


7-14

Part II Exploring Relationships Between Variables

Statistics – Worksheet – Distance and Ticket Price Atlanta to: Baltimore Boston Dallas Denver Detroit Kansas City Las Vegas Miami Memphis Minneapolis New Orleans NY Okla City Orlando Philadelphia St Louis Salt Lake Seattle

400 350 300 f a r e

250 200 150

200

600

1000

1400

Chapter 7

1800

dist

1. Find r 2 . 2. Explain what r 2 means in this context.

Distance 568 933 720 1190 602 683 1719 589 327 894 419 749 749 392 657 461 1565 2150

Fare 219 222 249 308 249 141 252 229 183 209 199 248 301 238 205 232 371 343

Summary Statistics Mean St Dev Correlation

853.7 497.8

244.33 56.37 0.694

3. Find the slope of the regression line.

8. Using those estimates, draw the line on the scatterplot. 9. Explain what the y-intercept means in this context.

4. Find the y-intercept of the regression line.

10. Explain what the slope means in this context.

5. Write the equation of the linear model.

11. The fare to fly to Los Angeles, 1719 miles from Atlanta, is $212. Find the residual.

6. Estimate the fare for a 200-mile flight.

12. In general, a positive residual means…

7. Estimate the fare for a 2000-mile flight.

13. In general, a negative residual means…

.


Chapter 7 Linear Regression

1. Is the linear model appropriate for estimating airfare from distance flown? Why? 400 350

2. How strong is the model? Explain.

300 250

f a r e

3. Identify outliers. Why are they unusual?

200 150

200

600

1200

1800

dist

Residuals

4. Write the equation of the model _____________________________ 5. Predict the airfare for a 1000-mile flight _____________________________

40 0 -40 -80 240

280

320

predicted(f/d)

6. Write the equation of the model to estimate how far you could fly for a given price. ___________________________

2000 1500 d i s t

1000 500 150

600 300 0 -300 400

800

predicted(d/f)

.

300

fare

Residuals

7. How far does this model suggest you could fly for the fare you estimated in #5?

225

1200

7-15


7-16

Part II Exploring Relationships Between Variables

Distance and Ticket Price Worksheet Key – Side 1:

1. r 2  48.2% 2. 48.2% of the variability in airfare is accounted for by differences in distance traveled. 3. b1  0.078578 6. $192.96

4. b0  177.24

  177.24  0.079( Distance) 5. Fare

7. $334.40 8. Always make two predictions to graph the line.

9. The model predicts a base airfare of about $177.24. This might represent the fixed costs of air travel, such as employee salaries, aircraft maintenance, equipment costs, etc. 10. According to this model, for each additional mile in distance, we expect the airfare to increase by about $0.079. More meaningfully, for each additional 100 miles in distance, we expect the airfare to increase by about $7.90. 11. The model predicts that the fare for 1719 mile trips averages $312.32. The residual is $212 - $312.32 = – $100.32 12. A positive residual means that the model’s prediction was lower than the actual airfare. 13. A negative residual means that the model’s prediction was higher than the actual airfare. Distance and Ticket Price Worksheet Key – Side 2

1. Since the scatterplot is Straight Enough, and the residuals plot shows no pattern, the linear model is appropriate for predicting airfare from distance flown. 2. 48.2% of the variability in airfare is accounted for by differences in distance traveled. 3. $150 for a flight of about 700 miles seems low compared to the other fares.   177.215  0.078619( Distance) . It’s basically the same as 4. The regression equation is Fare the regression equation found by hand from the summary statistics, just a bit more accurate, since it isn’t based on rounded values.

5. $255.83   644.287  6.13101( Fare) 6. Distance

7. About 924 miles. .


Chapter 7 Linear Regression

Statistics – Data Desk Lab

7-17

Name______________________________

Fuel Economy Yes, cars again. But now we aren’t insuring them, we are driving them. To most of us, fuel economy is an important issue. In the US, we measure fuel economy in miles per gallon; in Europe and Canada, the standard is liters per 100 kilometers. A sociologist might say that this difference is a window into the national psyche. A European has to go from A to B and asks, “How much gas will I need to get there?” An American says, “The gas tank is full. Where can I go?” Anyway, we seek a good way to predict a car’s fuel economy. In completing this assignment you will examine several possible explanatory variables, choose the one you think is best, construct the linear model, then make a prediction and interpret the results. Here are the step-by-step instructions. Check off the steps as you go. 1. Launch ActivStats.  Open (double click) the Resources tab on the left side of the Lesson Book.  Scroll down and choose ActivStats DataSets: to launch the List Dataset window.  Choose the file Cars1991, the hit the Insert into Data Desk button.  Load the data by following the instructions. You’ll see these variables of interest: MPG - fuel economy ratings in miles per gallon; Weight - in pounds; Horsepo - engine horsepower; Eng. Dis - engine displacement (size) in cubic inches; Cylinders - number of cylinders (usually 4, 6, or 8); Drive R - drive ratio (how many revolutions the engine makes to rotate the wheels once). 2. Let’s look for the best predictor of fuel economy.  Select the response variable MPG as Y.  Holding the shift key down, select the other variables simultaneously as possible explanatory variables X.  Calculate Correlations (Pearson).  The correlation matrix shows you the strength and direction of the association between fuel economy and each of the other variables. Which variable seems to be most strongly correlated with MPG? _____________ 

Explain: _____________________________________________________________

3. From now on you will be working only with the response variable MPG and whatever you just chose as the best explanatory variable.  Select these variables as Y and X respectively.  Plot the Scatterplot.  Is the pattern you see what the correlation led you to expect? Explain. ________________________________________________________________________ .


7-18

Part II Exploring Relationships Between Variables

4. Now create the model (find the equation of the line of best fit).  Using the scatterplot’s hypermenu (click on the little triangle in the plot’s title bar), Add Regression Line and do the Regression of MPG vs X. ______  How many cars is this analysis based on?  What is the model - the equation of the line of best fit? The constant term and the slope for the equation of the line are the coefficients displayed in the bottom left corner of the regression analysis. (Use meaningful variable names) ___________________________  What does the slope mean in this context? 

_____________________________________________________________________ What does the value of r2 mean in this context? _____________________________________________________________________ _____________________________________________________________________

5. A 1992 Geo Prizm weighed 2608 pounds and had a 102 horsepower, 97 cubic inch, 4 cylinder engine with a final drive ratio of 3.05.  Use your model to estimate how many miles per gallon it should get. _____________ 

The owner actually averaged about 33.7 mpg. What is the residual? _____________ What does the residual mean? ___________________________________________

6. Residuals provide an important look at how successful the model is.  Using the hypermenu on the title bar of the regression analysis, create the Scatterplot residuals vs predicted.  In that scatterplot’s hypermenu you can again Add Regression Line.  The regression line now is horizontal, indicating where residuals would equal 0. See the actual residuals plotted for the cars data? Explain what they represent. _____________________________________________________________________ 7. If the model had successfully extracted all the meaning from the data, the remaining error would be random. In that case the residuals plot should appear to contain no pattern in the scatter. Look at your residuals plot. Does the scatter appear to be random? Look carefully. You should see a hint of a curve. That means you might be able to find a better model. Try the European concept: gallons per 100 miles.  Under Manip, choose Transform, and create a New Derived Variable named GpHM for “gallons per hundred miles.”  Enter the formula 100/MPG.  You will now see a new icon GpHM to use as the response variable Y. Using your chosen explanatory variable as X again, make a new scatterplot, do the regression analysis to create a new model, and check the new residuals. 2  Using the residuals plot and the value of R , explain why you think this model is better than the original. .


Chapter 7 Linear Regression

7-19

Data Desk Lab Key – Fuel Economy

2. Weight. The correlation between weight and MPG is – 0.869, the strongest of any of the correlations with MPG. 3. Yes. The correlation was negative, and so is the association. 4. 50 cars.   48.74  0.0082(Weight ) MPG The model predicts that each additional 100 pound in weight is associated with a decrease of approximately 0.82 miles per gallon. 75.6% of the variability in mileage is explained by the model. 5. 27.35 miles per gallon +6.35 miles per gallon The owner’s mileage was 6.35 miles per gallon more than the model predicted. 6. The residuals are the difference between what the mileage actually was and what the model predicted the mileage to be for a car of that weight. 7. The new model is better, since the new residuals plot shows only random scatter. The original residuals plot showed a curve.

.


7-20

Part II Exploring Relationships Between Variables

Statistics – Random Matters

Chapter 7

Once again we are using statistics to tell a story. Now we focus mostly on a slope—it tells us the rate of change between two variables. But by now we know that the slope we find may not be exactly the right answer. Let’s examine (again) how different random samples are… different! Go to http://ww2.amstat.org/CensusAtSchool. Here you will find a large set of data that has been collected from statistics students around the world. Follow these steps:     

Read about the survey that the students filled out. Click on the Random Sampler. Download 100 randomly chosen students. You can choose if you want restrict your random sample to a particular year, region, age group, etc. … There are some relationships here for which regression is a natural choice. Here are some suggestions: o Students with a longer armspan will be taller. o Student with bigger feet will also be taller.

Verify that a linear model is appropriate for the relationship you chose. Find the slope for one of these relationships. Interpret the slope in context. Time to compare! Now you need another random sample. Or two. Or three. You might work with a partner or a group. Or you might just do this yourself.    

Acquire another sample of 100 students. Make sure that you (or your partner(s)) follow the same process that you did the first time. Same year, same region, same age group, etc. … The only difference should be that this is a new random sample. Run the same regression on the new sample and compare the slopes. You can also compare the strength of the relationship: is the correlation the same for each random sample? Is ܴଶ the same?

Time to draw some conclusions!   

How are the samples different? How much did the slope change? Did the sign of the slope change? How different are the rates? Was the strength of the relationship the same in every sample? How much did ܴଶ vary from sample to sample?

Word of fair warning: These data were entered by students. You may find measurement errors and/or off-topic responses that need to be “cleaned”. That is, just delete them! .


Chapter 7 Linear Regression

Statistics – Random Matters

7-21

Chapter 7

Notes to instructors

This is a powerful activity. It builds on the Random Matters elements that are throughout the text. Part of the power lies in helping students realize early in the course that a statistic can vary. If we build this idea early on then it won’t come as such a shock when we hit Sampling Distributions. Depending on your classroom and time constraints, this activity could be done in class, especially in a lab. It is written in an open-ended format, but you could be more specific. For example: 

Each student is to use armspan to predict height. Find the slope and ܴଶ and compare these values to at least four other random samples.

Hopefully students realize that random samples will vary! If they copy (too much) from each other, well, the results will be rather telling!

.


7-22

Part II Exploring Relationships Between Variables

Statistics Quiz C – Chapter 7

Name__________________________

1. A small independent organic food store offers a variety of specialty coffees. To determine whether price has an impact on sales, the managers kept track of how many pounds of each variety of coffee were sold last month.

Mean Standard Deviation Correlation

PRICE PER POUND $ 3.99 $ 5.99 $ 7.00 $ 12.00 $ 4.50 $ 7.50 $ 15.00 $ 10.00 $ 12.50 $ 8.99

POUNDS SOLD 75 60 65 45 80 70 25 35 40 50

$ 8.75 $ 3.63

54.50 18.33

–0.927

Based on the data and summary statistics shown below, the slope of the estimated regression line that relates the response variable (monthly sales) to the predictor variable (price per pound) is A. 95. 459. B. .858. C. –4.681. D. –.858. E. –8.999.

.


Chapter 7 Linear Regression

7-23

2. A small independent organic food store offers a variety of specialty coffees. To determine whether price has an impact on sales, the managers kept track of how many pounds of each variety of coffee were sold last month.

Mean Standard Deviation Correlation

PRICE PER POUND $ 3.99 $ 5.99 $ 7.00 $ 12.00 $ 4.50 $ 7.50 $ 15.00 $ 10.00 $ 12.50 $ 8.99

POUNDS SOLD 75 60 65 45 80 70 25 35 40 50

$ 8.75 $ 3.63

54.50 18.33

-0.927

Based on the data and summary statistics, the intercept of the estimated regression line that relates the response variable (monthly sales) to the predictor variable (price per pound) is A. 95.459. B. .858. C. –4.684. D. –.858. E. –8.999. 3. A small independent organic food store offers a variety of specialty coffees. To determine whether price has an impact on sales, the managers kept track of how many pounds of each variety of coffee were sold last month. Mean Standard Deviation Correlation

$ 8.75 $ 3.63

54.50 18.33

–0.927

Based on the summary statistics shown below, what percent of the variability in the number of pounds of coffee sold per week can be explained by price? A. 95.47% B. 100% C. 85.9% D. 55.6% E. 4.68%

.


7-24

Part II Exploring Relationships Between Variables

4. Residuals are … A. the difference between observed responses and values predicted by the model. B. data collected from individuals that is not consistent with the rest of the group. C. none of these D. possible models not explored by the researcher. E. variation in the data that is explained by the model. 5. It's easy to measure the circumference of a tree's trunk, but not so easy to measure its height. Foresters developed a model for ponderosa pines that they use to predict the tree's height (in feet) from the circumference of its trunk (in inches): ln hˆ  1.2  1.4(ln C ). A lumberjack finds a tree with a circumference of 60"; how tall does this model estimate the tree to be? A. 11' B. 83' C. 93' D. 19' E. 5' 6. If the point in the upper right corner of this scatterplot is removed from the data set, then what will happen to the slope of the line of best fit (b) and to the correlation (r)?

A. b will increase, and r will decrease. B. b will decrease, and r will increase. C. both will decrease. D. both will increase. E. both will remain the same. 7. The correlation coefficient between high school grade point average (GPA) and college GPA is 0.560. For a student with a high school GPA that is 2.5 standard deviations above the mean, we would expect that student to have a college GPA that is ______ the mean. A. 0.56 SD above B. 1.94 SD above C. 2.5 SD above D. equal to E. 1.4 SD above

.


Chapter 7 Linear Regression

7-25

8. An 8th grade class develops a linear model that predicts the number of cheerios (a small round cereal) that fit on the circumference of a plate by using the diameter in inches. Their   0.56  5.11(diameter). model is cheerios The slope of this model is best interpreted in context as… A. For every 5.11 inches of diameter, the circumference is about 1 cheerio bigger. B. For every 1 inch of diameter, the circumference holds about 0.56 more cheerios. C. It takes 5.11 cheerios to fill a plate's circumference. D. For every 1 inch of diameter, the circumference holds about 5.11 more cheerios. E. A mistake, because π is about 3.14 and that should be the slope. 9. An 8th grade class develops a linear model that predicts the number of cheerios (a small round cereal) that fit on the circumference of a plate by using the diameter in inches. Their   0.56  5.11(diameter). model is cheerios If the diameter is increased from 4 inches to 14 inches, the predicted number of cheerios will increase by about… A. 72 B. 51 C. 10 D. 21 E. none of these above 10. For many people, breakfast cereal is an important source of fiber in their diets. Cereals also contain potassium, a mineral shown to be associated with maintaining a healthy blood pressure. An analysis of the amount of fiber (in grams) and the potassium content (in milligrams) in serving of 77 breakfast cereals produced the regression model   38  27 Fiber and se = 30.23. What does in this context mean that se = 30.23? Potassium A. The true potassium contents of cereals vary from the predicted amounts with a variance of 30.23 milligrams. B. The true fiber contents of cereals vary from the predicted amounts with a variance of 30.23 grams. C. The true potassium contents of cereals vary from the predicted amounts with a standard deviation of 30.23 milligrams. D. The true fiber contents of cereals vary from the predicted amounts with a standard deviation of 30.23 grams. E. None of the above.

.


7-26

Part II Exploring Relationships Between Variables

Statistics Quiz C – Chapter 7 - KEY

1. C 2. A 3. C 4. A 5. C 6. B 7. E 8. D 9. B 10. C

.


Chapter 7 Linear Regression

Statistics Quiz D – Chapter 7

7-27

Name__________________________

1. Data were collected on monthly sales revenues (in $1,000s) and monthly advertising expenditures ($100s) for a sample of drug stores. The regression line relating revenues (Y) to advertising expenditure (X) is estimated to be yˆ  48.3  9.00 x . The correct interpretation of the slope is that for each additional A. $1 spent on advertising, predicted sales revenue increases by $9,000. B. $100 spent on advertising, predicted sales revenue increases by $9,000. C. $100 spent on advertising, predicted sales revenue decreases by $9,000. D. $1,000 in sales revenue, advertising expenditures decrease by $48.30. E. $100 in sales revenue, advertising expenditures decrease by $48.30. 2. Data were collected on monthly sales revenues (in $1,000s) and monthly advertising expenditures ($100s) for a sample of drug stores. The regression line relating revenues (Y) to advertising expenditure (X) is estimated to be yˆ  48.3  9.00 x . The predicted sales revenue for a month in which $1,000 was spent on advertising is A. $50,000. B. $851.70. C. $8,951.70. D. $41,700. E. $90,000. 3. A company studying the productivity of its employees on a new information system was interested in determining if the age (X) of data entry operators influenced the number of completed entries made per hour (Y). The regression equation is yˆ  14.374  0.145 x . Suppose the actual completed entries per hour for an operator who is 35 years old was 8. The residual is A. –1.3 B. 2.6 C. –3.5 D. 1.3 E. –2.2 4. A company studying the productivity of their employees on a new information system was interested in determining if the age (X) of data entry operators influenced the number of completed entries made per hour (Y). The regression equation is yˆ  14.374  0.145 x. If sx = 14.04 and sy = 2.61, then the correlation coefficient between age and productivity is A. .779 B. –.236 C. .575 D. –.929 E. –.779

.


7-28

Part II Exploring Relationships Between Variables

5. Suppose the correlation, r, between two variables x and y is -0.44. What would you predict about a y value if the x value is 2 standard deviations above its mean? A. It will be .88 standard deviations below its mean. B. It will be .88 standard deviations above its mean. C. It will be 2 standard deviations below its mean. D. It will be .44 standard deviations below its mean. E. It will be .44 standard deviations above its mean. 6. Suppose the correlation, r, between two variables x and y is –0.44. What percentage of the variability in y cannot be explained by x? A. 19% B. 44% C. 81% D. 88% E. 12% 7. A regression analysis of students' college grade point averages (GPAs) and their high school GPAs found R 2  0.311. Which of these is true? I. High school GPA accounts for 31.1% of college GPA. II. 31.1% of college GPAs can be correctly predicted with this model. III. 31.1% of the variance in college GPA can be accounted for by the model A. I only B. II only C. III only D. I and II E. none of these 8. If the point in the upper left corner of the scatterplot is removed, what will happen to the correlation (r) and the slope of the line of best fit (b)?

A. r will decrease and b will increase. B. r will increase and b will decrease. C. Both will decrease. D. Both will increase. E. They will not change.

.


Chapter 7 Linear Regression

7-29

9. R-sq is a measure of ... A. the change in the y-variable that corresponds with the change in the x-variable. B. the probability that the regression line makes a correct prediction. C. the percentage of the accuracy of the regression equation. D. the initial predicted starting point of the response variable when x is zero. E. the proportion of the variability in the response variable that is explained by the explanatory variable. 10. The relationship between the number of hours a person practices a task and the time it takes them to complete the task is calculated to have R-sq = 56.7%. The value of the correlation coefficient is A. –0.753 B. 2.38 C. –0.238 D. 0.238 E. 0.753

.


7-30

Part II Exploring Relationships Between Variables

Statistics Quiz D – Chapter 7 - KEY

1. D 2. B 3. A 4. A 5. E 6. A 7. C 8. E 9. E 10. A

.


Chapter 8 Regression Wisdom Solutions to Class Examples: 1. See Class Example 1. 2. See Class Example 2 3. See Class Example 3 Investigative Task The Olympic Long Jump task requires students to model using an appropriate subset of the set of gold medal distances for the men’s long jump. Supplemental Resources The Graduating Classes worksheet deals with modeling an appropriate subset of a set of data. The Wandering Point worksheet explores influential points. It also shows students that an outlier in the x-direction can be influential, while an outlier in the y-direction can have virtually no effect on the slope of the regression line.

.


8-2

Part II Exploring Relationships Between Variables

Statistics Quiz A – Chapter 8

Name __________________________

1. The following is a scatterplot of the average final exam score versus midterm score for 11 sections of an introductory statistics class:

The correlation coefficient for these data is r = 0.829. If you had a scatterplot of the final exam score versus midterm score for all individual students in this introductory statistics course, would the correlation coefficient be weaker, stronger, or about the same? Explain.

2. A plot of the residuals versus the fitted values for record-breaking times of female marathon runners for the years 1998 – 2003 is:

Based on this residuals plot, does it seem reasonable to use linear regression for this model? Explain. .


Chapter 8 Regression Wisdom

8-3

3. Here is a scatterplot of weight versus height for students in an introductory statistics class. The men are coded as ‘1’ and appear as circles in the scatterplot; the women are coded as ‘2’ and appear as squares in the scatterplot.

a. Do you think there is a clear pattern? Describe the association between weight and height.

b. Comment on any differences you see between men and women in the plot.

c. Do you think a linear model from the set of all data could accurately predict the weight of a student with height 70 inches? Explain.

.


8-4

Part II Exploring Relationships Between Variables

Statistics Quiz A – Chapter 8 – Key 1. The following is a scatterplot of the average final exam score versus midterm score for 11 sections of an introductory statistics class:

The correlation coefficient for these data is r = 0.829. If you had a scatterplot of the final exam score versus midterm score for all individual students in this introductory statistics course, would the correlation coefficient be weaker, stronger, or about the same? Explain. Relationships based on averages have higher correlation coefficients than relationships based on individual data. Therefore, a scatterplot of the final exam score versus midterm score for individual students would show much more scatter and a weaker correlation coefficient.

2. A plot of the residuals versus the fitted values for record-breaking times of female marathon runners for the years 1998 – 2003 is:

Based on this residuals plot, does it seem reasonable to use linear regression for this model? Explain. Since we see a clear pattern in the residuals plot, it does not seem reasonable to use linear regression for this model.

.


Chapter 8 Regression Wisdom

8-5

3. Here is a scatterplot of weight versus height for students in an introductory statistics class. The men are coded as ‘1’ and appear as circles in the scatterplot; the women are coded as ‘2’ and appear as squares in the scatterplot.

a. Do you think there is a clear pattern? Describe the association between weight and height. There is a moderately strong, positive association between weight and height. The variation in weight is larger for larger values of height.

b. Comment on any differences you see between men and women in the plot. Men, on average, appear to be taller and heavier than women. We can clearly see two groups (with some overlap) in this scatterplot.

c. Do you think a linear model from the set of all data could accurately predict the weight of a student with height 70 inches? Explain. Since there appears to be a difference between men and women in the plot, it is not correct to use a single model obtained by these data to make a prediction. Furthermore, there appears to be a great deal of scatter at 70”, with weights varying by over 50 pounds for women and well over 100 pounds for men.

.


8-6

Part II Exploring Relationships Between Variables

Statistics Quiz B – Chapter 8

Name __________________________

Current research states that a good diet should contain 20-35 grams of dietary fiber. Research also states that each day should start with a healthy breakfast. The nutritional information for 77 breakfast cereals was reviewed to find the grams of fiber and the number of calories per serving. The scatterplot below shows the relationship between fiber and calories for the cereals. 1. Do you think there is a clear pattern? Describe the association between fiber and calories.

2. Comment on any unusual data point or points in the data set. Explain.

3. Do you think a model could accurately predict the number of calories in a serving of cereal that has 22 grams of fiber? Explain.

.


Chapter 8 Regression Wisdom

8-7

Statistics Quiz B – Chapter 8 – Key Current research states that a good diet should contain 20-35 grams of dietary fiber. Research also states that each day should start with a healthy breakfast. The nutritional information for 77 breakfast cereals was reviewed to find the grams of fiber and the number of calories per serving. The scatterplot below shows the relationship between fiber and calories for the cereals. 1. Do you think there is a clear pattern? Describe the association between fiber and calories. There is no clear pattern. At first glance, there appears to be a weak, negative association between grams of fiber and the number of calories in the cereals. Yet, the five points at the bottom of the graph are outside the pattern, with extremely low numbers of calories. Additionally, the three points on the right of the scatterplot have an unusually high amount of fiber, making them outliers and influential points.

2. Comment on any unusual data point or points in the data set. Explain. The points in the bottom left corner seem to have extremely low calorie content for cereals between zero and six grams of fiber. The points with 9, 10 and 14 grams of fiber appear to have an unusually high amount of fiber for their calorie content, making them outliers and influential points. These three points would also be leverage points, creating the impression that there is a negative association between grams of fiber and the number of calories in the cereals.

3. Do you think a model could accurately predict the number of calories in a serving of cereal that has 22 grams of fiber? Explain. This data contains information about cereals with fiber content between 0 and 14 grams. It would be extrapolation to try to use this data to predict the calorie content of cereals with 22 grams of fiber.

.


8-8

Part II Exploring Relationships Between Variables

Statistics Quiz C – Chapter 8

Name __________________________

California high schools use their total enrollment in 10th through 12th grade to measure how many students they should have taking AP exams. Abraham Lincoln High School in San Jose, as you can see in the graph below has a relatively large number of AP test takers (753) for its relatively small student body (1396). 1. Do you think there is a pattern? Describe the association between Enrollment and the number of AP exams taken.

2. Comment on Abraham Lincoln High School. How will this school effect a regression calculation for these data?

3. Do you think the correlation would be stronger or weaker if we removed Abraham Lincoln High School? Is it appropriate to simply remove an outlier of this type?

4. Do you think a model based on these data could accurately predict the number of AP tests taken at a school with 500 10th through 12th graders? Explain.

.


Chapter 8 Regression Wisdom

8-9

Statistics Quiz C – Chapter 8 - KEY California high schools use their total enrollment in 10th through 12th grade to measure how many students they should have taking AP exams. Abraham Lincoln High School in San Jose, as you can see in the graph below has a relatively large number of AP test takers (753) for its relatively small student body (1396). 1. Do you think there is a pattern? Describe the association between Enrollment and the number of AP exams taken. There is a strong, linear, positvie pattern between Enrollment and the number of AP exams taken. Abraham Lincoln High School is an outlier that is giving more exams than is typical for its size.

2. Comment on Abraham Lincoln High School. How will this school effect a regression calculation for these data? This outlier will weaken the strength of relationship, lowering r and ܴଶ . It will probably not affect the slope to a large degree, pulling the line up towards itself just a small amount.

3. Do you think the correlation would be stronger or weaker if we removed Abraham Lincoln High School? Is it appropriate to simply remove an outlier of this type? The correlation would be stronger. But we cannot remove this point just because it is an outlier. There might be more schools of this type. More research may be needed to discover how strong the relationship truly is. There are certainly many more schools in California!

4. Do you think a model based on these data could accurately predict the number of AP tests taken at a school with 500 10th through 12th graders? Explain. Extrapolation of this sort is risky. Smaller schools may be very different than bigger ones. For example, a small magnet school may have a much larger percentage of students taking AP exams.

.


8-10

Part II Exploring Relationships Between Variables

Statistics – Investigative Task

Chapter 8

Olympic Long Jumps The modern Olympic Games, a modified revival of the ancient Greek Olympian Games, were inaugurated in 1896. Since then, the Games have been held nearly every four years at various sites around the world, and have become a major international athletic competition. Based on the gold medal distances shown, write a report about the men’s long jump. In your report be sure to: DISTANCE  include appropriate graphical and numerical analyses; YEAR (meters)  discuss the trend in long jump performances, based on 1896 6.35 an appropriate linear model; 1900 7.185  explain the decisions you made in creating your model, 1904 7.34 with some historical analysis of gaps in the data and 1908 7.48 departures from the trend; 1912 7.60  predict the gold-medal-winning distance in the men’s 1920 7.15 long jump at the 2016 Games in Rio de Janeiro, with 1924 7.445 comments on your faith in that prediction. 1928 7.73 1932 7.64 1936 8.06 1948 7.825 1952 7.57 1956 7.83 1960 8.12 1964 8.07 1968 8.90 1972 8.24 1976 8.35 1980 8.54 1984 8.54 1988 8.72 1992 8.67 1996 8.50 2000 8.55 2004 8.59 2008 8.34 2012 8.31

.


Chapter 8 Regression Wisdom

Statistics – Investigative Task – Long Jump

Think

Show

Tell

Components Creates a good linear model: o uses a subset of the data, perhaps avoiding years following wars, using only recent jumps, and/or eliminating outliers o justifies modeling decisions, (including historical observations) Visual: The scatterplot… o has correct explanatory/response o is accurate and clearly labeled o shows the regression line Numerical: The analysis… o has correct r-squared o has correct slope and y-intercept o uses the proper notation Interprets the Model: o evaluates the model w/residuals o describes trend in distances o interprets the slope in context o distinguishes model from reality Makes a Prediction for 2016: o makes correct prediction from model o expresses caution based on r-sq o expresses caution about extrapolation

Chapter 8 Comments

Components are scored as Essentially correct, Partially correct, or Incorrect 1: The Model: E – Creates a good model; makes sound and well-supported decisions P – Model is somewhat flawed, or decisions are not supported I – Uses the data for all the Olympics 2: Graphs and Statistics: correct scatterplot, numerical values, equation, and notation E – Each category has all 3 requirements correct P – Each category has at least 2 of the requirements correct I – Too many errors in graph, statistics, or notation 3: Interpretation of Model: discusses residuals, trend, and slope (in context of model) E – All 4 requirements P – 2 or 3 requirements I – None or 1 4: Prediction: makes correct prediction with proper caution E – All 3 requirements P – 2 requirements I – None or 1 Scoring  E’s count 1 point, P’s are 1/2  Grade: A = 4, B = 3, etc., +/- based on rounding. (3.5 rounded to 3 is a B+)

Name ______________________________ .

Grade ___

8-11


8-12

Part II Exploring Relationships Between Variables

NOTE: We present a model solution with some trepidation. This is not a scoring key, just an example. Many other approaches could fully satisfy the requirements outlined in the scoring rubric. That (not this) is the standard by which student responses should be evaluated. Model Solution – Investigative Task – Olympic Long Jumps Since 1896, winning distances for Olympic long jumps have generally increased, with the exception of Games following the two World Wars. During those wars, no Games were held, training was interrupted, and many potential competitors may have died. In the scatterplot for all years, one can see the drop in winning jump distance following each war, as well as amazing and very unusual 1968 jump, Bob Beamon’s record-breaking Mexico City jump. But, the trends of the somewhat distant past seem to have little bearing on modern jumps. Recently, the trend has actually been a one of decreasing gold medal distance. To predict what will happen in Rio de Janeiro in 2016, I will include only those jumps from 1988 – 2012.

  40.669  0.016071(Year ) , estimates that gold medal long jump distance My model, Jump decreases an average of 0.064 meters per Olympics, with the passage of time explaining 79.8% of the variability in distance of recent jumps. According to the linear model, the gold medal distance in 2016 is expected to be 8.33 meters. The moderate value of R 2 , along with a residuals plot that may show some pattern, as well as the small number of points used in the regression, gives me little faith in the accuracy of a prediction for 2016, especially considering that the prediction is an extrapolation beyond the years present in the model.

.


Chapter 8 Regression Wisdom

Statistics - Worksheet

8-13

Graduating Classes

The table at the right shows the number of seniors who graduated from a small college during the latter half of the last century. Data was not available for all of the years, especially those longer ago. 1. Make a scatterplot and describe the trends you see in the data.

Year 1950 1953 1957 1960 1962 1965 1968 1970 1972 1976 1980 1984 1987 1989 1992 1994 1996 1998 2000

Number Graduated 203 211 288 319 381 446 439 521 509 527 476 413 399 362 379 413 437 426 451

2. These data do not show the size of the graduating class in 1969. Create an appropriate model and use it to estimate the size of that class. Explain what years you used, and why.

3. If we wanted to estimate the graduating class sizes through the year 2010, what model would you use? What aspect of these data makes you cautious about your projections?

.


8-14

Part II Exploring Relationships Between Variables

Statistics Worksheet – Graduating Classes – Key 1. Generally, the association between year and number of graduates at this small college is very strong. The number of graduates increased in a linear fashion from 1950 until the mid-70s, and then began to decrease in a linear fashion until about 1990. The number of graduates began to increase again until 2000, the last year for which the number of graduates is known. 2. In order to estimate the size of the graduating class in 1969, use only the data values from the first positive, linear trend, perhaps years from 1950 to 1976. Using this subset of data will yield a reasonable estimate, without being adversely affected by other years that have a different trend. The linear model is:   27739  8.1567(Year ) Graduate The model is quite strong, explaining 95.6% of the variability in the number of graduates. The residuals plot is unpatterned, which indicates an appropriate model. Using this model, it is estimated that there were about 466 graduates in 1969.   27739  14.3246(Year )  27739  14.3246(1969)  466 graduates. Grad

3. To estimate the size of the graduating class in 2010, use the linear model for 1989 to 2000:   15859.9  8.1567(Year ) Grad This is a strong model, explaining 90.7% of the variability in the number of graduates. The residuals plot is unpatterned, indicating an appropriate model. It is estimated that there will be about 535 graduates in 2010. We should be cautious about this model for two reasons. The model is based on only 6 points, and the prediction is an extrapolation 10 years into the future.

.


Chapter 8 Regression Wisdom

Statistics - Worksheet: The Wandering Point

8-15

Chapter 8

1. The scatterplot shows the four points (1,2), (2, 6), (4, 2) and (5,6) plotted in a 10x10 graphing window. Find the correlation and the equation of the line of best fit. r = _________

ŷ  __________________________

2. Now investigate the influence of one more point on the correlation and the slope. Try each of these as the fifth point and record the new correlation and slope. Also note whether the new point has a small or large residual. (NOTE: Add each point to the original four, one at a time, see what happens, and then remove that point. There are never more than 5 points in the plot!) Fifth Slope of the Size of the point Description Correlation regression line residual None (3,4) (8,6) (10,7) (3,8) (1,7) (8,9) (10,0)

the original four points

N/A

right in the center of the given points also on the line, but far from the other points only close to the line, but much farther away above the center of the original cluster nearby, but not consistent with the apparent pattern farther away, and also not consistent farther and stranger…

3. A point that dramatically changes the apparent slope of the regression line is called an influential point. You need to be able to spot potential influential points in a scatterplot. What should you look for?

4. Originally there were only four points here. Suppose instead that we had started with 50 points clustered in essentially the same region and displaying an association of roughly the same strength and direction. Would our fifty-first point still be as influential? Where would you locate one additional point so influential that it changed the line as dramatically as (10,0) did above?

.


8-16

Part II Exploring Relationships Between Variables

Statistics - Worksheet: The Wandering Point – Key 1. r  0.3162

yˆ  2.8  0.4 x

2. Fifth Point None (3,4) (8,6) (10,7) (3,8) (1,7) (8,9) (10,0)

Correlation 0.3162 0.3162 0.5 0.6157 0.2357 –0.0457 0.7303 –0.4888

Slope of the regression line 0.4 0.4 0.4 0.4228 0.4 –0.06 0.8 –0.3740

Size of the residual N/A 0 0 small large medium? small small

3. To spot influential points in a scatterplot, look for points that depart from the overall pattern, especially those points that are outliers in the explanatory, or x, direction. DO NOT look for points with large residuals. Influential points change the slope of the regression line, so they often have small residuals. 4. The fifty-first point would not be as influential, since the pattern is more established by 50 points than it was by 4. In order to be as influential as (10,0), an additional point would have to be much further out along the x-axis, maybe (100,0).

.


Chapter 8 Regression Wisdom

Statistics Quiz D – Chapter 8

8-17

Name __________________________

1. Linear regression was used to describe the trend in world population over time. Below is a plot of the residuals versus predicted values. What does the plot of residuals suggest? Versus Fits (response is Population (millions)) 200 150

Residual

100 50 0 -50 -100 2000

3000

4000 Fitted Value

A. An outlier is present in the data set. B. The linearity condition is not satisfied. C. A high leverage point is present in the data set. D. The data are not normal. E. The equal spread condition is not satisfied.

.

5000

6000


8-18

Part II Exploring Relationships Between Variables

2. Based on the following residual plot, which condition / assumption for linear regression is not satisfied? Versus Fits (response is y) 200

Residual

100

0

-100

-200 100

200

300

400 Fitted Value

500

600

700

A. Linearity. B. Quantitative Variables. C. Equal Spread. D. Outlier. E. None of the above; all conditions are satisfied. 3

Which statement about influential points is true? I. Removal of an influential point changes the regression line. II. Data points that are outliers in the horizontal direction are more likely to be influential than points that are outliers in the vertical direction. III. Influential points have large residuals. A. I, II, and III B. II and III C. I and II D. I and III E. I only

4. Two variables that are actually not related to each other may nonetheless have a very high correlation because they both result from some other, possibly hidden, factor. This is an example of A. leverage. B. regression. C. an outlier. D. a lurking variable. E. extrapolation.

.


Chapter 8 Regression Wisdom

8-19

5. A residuals plot is useful because I. it will help us to see whether our model is appropriate. II. it might show a pattern in the data that was hard to see in the original scatterplot. III. it will clearly identify influential points. A. I and II only B. I, II, and III C. II only D. I only E. I and III only

  3.30  0.235 (speed. can be used to predict the stopping distance (in 6. The model distance feet. for a car traveling at a specific speed (in mph.) According to this model, about how much distance will a car going 65 mph need to stop? A. 729.0 feet B. 345.0 feet C. 27.0 feet D. 4.3 feet E. 18.6 feet 7. Which of the following is not a source of caution in regression between two variables? A. subgroups with differences B. extrapolation C. a lurking variable D. an outlier E. all of these are potential problems. 8. Which statement about re-expressing data is not true? I. Unimodal distributions that are skewed to the left will be made more symmetric by taking the square root of the variable. II. A curve in which the direction of the association changes from negative to positive will not benefit from re-expression. III. One goal of re-expression may be to make the variability of the response variable more uniform. A. I only B. III only C. II only D. II and III E. I, II, and III

.


8-20

Part II Exploring Relationships Between Variables

9. Over the past decade a farmer has been able to increase his wheat production by about the same number of bushels each year. His most useful predictive model is probably… A. linear B. logarithmic C. quadratic D. exponential E. power 10. Another farmer has increased his wheat production by about the same percentage each year. His most useful predictive model is probably… A. linear B. exponential C. logarithmic D. quadratic E. power

.


Chapter 8 Regression Wisdom

Statistics Quiz D – Chapter 8 – Key 1. B 2. C 3. C 4. D 5. A 6. B 7. E 8. E 9. A 10. B

.

8-21


8-22

Part II Exploring Relationships Between Variables

Statistics Quiz E – Chapter 8

Name __________________________

1. Which is true? I. Random scatter in the residuals indicates a model with high predictive power. II. If two variables are very strongly associated, then the correlation between them will be near +1.0 or –1.0. III. The higher the correlation between two variables the more likely the association is based in cause and effect. A. I, II, and III B. none C. I only D. I and II E. II only 2. When using midterm exam scores to predict a student's final grade in a class, the student would prefer to have a A. negative residual, because that means the student's final grade is lower than we would predict with the model. B. residual equal to zero, because that means the student's final grade is exactly what we would predict with the model. C. positive residual, because that means the student's final grade is lower than we would predict with the model. D. positive residual, because that means the student's final grade is higher than we would predict with the model. E. negative residual, because that means the students final grade is higher than we would predict with the model. 3. Which of the following is not a goal of re-expressing data? A. Make the scatter in a scatterplot spread out evenly rather than following a fan shape. B. Make the spread of several groups more alike. C. All of these are goals of re-expressing data. D. Make the form of a scatterplot more nearly linear. E. Make the distribution of a variable more symmetric. 4. Which statement about correlation is true? I. Regression based on data that are summary statistics tends to result in a higher correlation. II. If r = 0.95, the response variable increases as the explanatory variable increases. III. An outlier always decreases the correlation. A. I only B. II only C. III only D. I, II, and III E. none of these

.


Chapter 8 Regression Wisdom

8-23

5. Which statement about residuals plots is true? I. A curved pattern indicates nonlinear association between the variables. II. A pattern of increasing spread indicates the predicted values become less reliable as the explanatory variable increases. III. Randomness in the residuals indicates the model will predict accurately. A. I only B. II only C. I, II, and III D. I and III only E. I and II only 6. The

s tr  12  20  dia  can be used to predict the breaking strength of a rope (in pounds)

from its diameter (in inches). According to this model, how much force should a rope onehalf inch in diameter be able to withstand? A. 16 lbs B. 22 lbs C. 4.7 lbs D. 256 lbs E. 484 lbs 7. A scatterplot of log(Y) vs. log(X) reveals a linear pattern with very little scatter. It is probably true that … A. the calculator's LnReg function will model the association between X and Y. B. the residuals plot for regression of Y on X shows a curved pattern. C. the correlation between X and Y is near 0. D. the correlation between X and Y is near +1. E. the scatterplot of Y vs X shows a linear association. 8. If a data point is influential it… A. has a small residual. B. is guaranteed to be extreme in the vertical direction. C. is guaranteed to be extreme in the horizontal direction. D. will change the slope of the regression equation. E. none of these. 9. A residual plot that has no pattern is a sign that… A. the original data is straight and the regression line is a good model. B. the original data is straight and the regression line is not a good model. C. the model is not a good one, because there is no pattern. D. the original data is curved and the regression line is not a good model. E. the original data is curved and the regression line is a good model.

.


8-24

Part II Exploring Relationships Between Variables

10. In predicting the growth of the volume of a small bay by measuring the height of the water at   2.34  4.56 (height) where height is a dock, a researcher is using a model of 3 volume measured in m and volume cubic miles. If the height rises to 3.45 m, what is the predicted volume? A. 2.62 m3 B. 1.2 × 1012 m3 C. 5902 m3 D. 7 × 107 m3 E. 18.1 m3

.


Chapter 8 Regression Wisdom

Statistics Quiz E – Chapter 8 – Key 1. B 2. D 3. C 4. A 5. E 6. E 7. B 8. D 9. D 10. C

.

8-25



Chapter 9 Multiple Regression Solutions to Class Examples See Instructor’s Guide for solutions Included are two quizzes. One quiz uses the computer output for students to interpret. The other version of the quiz includes the data would need to be conducted in a lab setting with software.

.


9-2

Part II Exploring Relationships Between Variables

Statistics Quiz A – Chapter 9

Name __________________________

1. There are different ways to predict how many games a baseball team will win. The most obvious is to look at scoring Runs. To win a game you have to score more Runs than the other team. The other statistic is ERA, earned run average. This measures how many runs your pitchers allow the other team to score, on average, per game. Response Variable: Wins Predictor Variable(s): Runs, ERA R-squared: 0.824 Parameter Constant Runs ERA

Estimate 105.70727 0.069896411 -18.409312

Std. Err. 12.63829 0.014380364 1.8603859

T-Stat P-value 8.364049 <0.0001 4.8605454 <0.0001 -9.8954265 <0.0001

a. Write down the regression equation.

b. Interpret the value of ܴଶ in this equation.

c. Interpret the coefficient of Runs in this regression.

d. Explain why the coefficient of ERA is negative.

.


Chapter 9 Multiple Regression

9-3

e. Compute the predicted number of Wins for a team with 700 Runs and an ERA of 4.

f. Here is a plot of the residuals vs. predicted values. What does this plot tell you about the conditions for this regression?

g. The largest residual is the Texas Rangers. They were predicted to win about 80 games and had a residual of 7.86. What does this tell you about the Rangers season?

.


9-4

Part II Exploring Relationships Between Variables

Statistics Quiz A – Chapter 9 Key 1. There are different ways to predict how many games a baseball team will win. The most obvious is to look at scoring Runs. To win a game you have to score more Runs than the other team. The other statistic is ERA, earned run average. This measures how many runs your pitchers allow the other team to score, on average, per game. Response Variable: Wins Predictor Variable(s): Runs, ERA R-squared: 0.824 Parameter Constant Runs ERA

Estimate 105.70727 0.069896411 -18.409312

Std. Err. 12.63829 0.014380364 1.8603859

T-Stat P-value 8.364049 <0.0001 4.8605454 <0.0001 -9.8954265 <0.0001

a. Write down the regression equation.   105.707  0.0699 Runs  18.409 ERA Wins

b. Interpret the value of ܴଶ in this equation. 82.4% of the variability in team Wins has been successfully explained by using this model.

c. Interpret the coefficient of Runs in this regression. For every one more Run a team scores, after taking into account the ERA, a team is predicted to win about 0.07 more games.

d. Explain why the coefficient of ERA is negative. If the pitchers on a team allow more runs, their team will win less often.

.


Chapter 9 Multiple Regression

9-5

d. Compute the predicted number of Wins for a team with 700 Runs and an ERA of 4.

81 wins predicted = 105.707 + 0.0699(700) + –18.409(4)

e. Here is a plot of the residuals vs. predicted values. What does this plot tell you about the conditions for this regression? The residual plot has a random scatter. The residuals do not appear to thicken. This regression appears to be appropriate.

f. The largest residual is the Texas Rangers. They were predicted to win about 80 games and had a residual of 7.86. What does this tell you about the Rangers season? It tells us that the Rangers won 87 or 88 games. That is, they performed better than they should have, given the ERA and Runs scored by the team.

.


9-6

Part II Exploring Relationships Between Variables

Statistics Quiz B – Chapter 9

Name __________________________

1. There are different ways to predict how many games a baseball team will win. The most obvious is to look at scoring Runs. To win a game you have to score more Runs than the other team. The other statistic is ERA, earned run average. This measures how many runs your pitchers allow the other team to score, on average, per game. Use the data to build a model that predicts team Wins from ERA and Runs. a. Write down the regression equation. Team Wins Toronto Blue Jays New York Yankees Baltimore Orioles Tampa Bay Rays Boston Red Sox Kansas City Royals Minnesota Twins Cleveland Indians Chicago White Sox Detroit Tigers Texas Rangers Houston Astros Los Angeles Angels Seattle Mariners Oakland Athletics New York Mets Washington Nationals Miami Marlins Atlanta Braves Philadelphia Phillies St. Louis Cardinals Pittsburgh Pirates Chicago Cubs Milwaukee Brewers Cincinnati Reds Los Angeles Dodgers San Francisco Giants Arizona Diamondbacks San Diego Padres Colorado Rockies

b. Interpret the value of ܴଶ in this equation.

c. Interpret the coefficient of Runs in this regression.

d. Explain why the coefficient of ERA is negative.

.

R 93 87 81 80 78 95 83 81 76 74 88 86 85 76 68 90 83 71 67 63 100 98 97 68 64 92 84 79 74 68

ERA 891 764 713 644 748 724 696 669 622 689 751 729 661 656 694 683 703 613 573 626 647 697 689 655 640 667 696 720 650 737

3.8 4.05 4.05 3.74 4.31 3.73 4.07 3.67 3.98 4.64 4.24 3.57 3.94 4.16 4.14 3.43 3.62 4.02 4.41 4.69 2.94 3.21 3.36 4.28 4.33 3.44 3.72 4.04 4.09 5.04


Chapter 9 Multiple Regression

9-7

e. Compute the predicted number of Wins for a team with 700 Runs and an ERA of 4.

f. Construct an appropriate plot(s) to check the conditions for regression. What can you say about the conditions for this regression?

g. The largest residual is the Texas Rangers. They were predicted to win about 80 games and had a residual of 7.86. What does this tell you about the Rangers season?

.


9-8

Part II Exploring Relationships Between Variables

Statistics Quiz B – Chapter 9 Key 1. There are different ways to predict how many games a baseball team will win. The most obvious is to look at scoring Runs. To win a game you have to score more Runs than the other team. The other statistic is ERA, earned run average. This measures how many runs your pitchers allow the other team to score, on average, per game. Use the data to build a model that predicts team Wins from ERA and Runs. a. Write down the regression equation.

Team Wins Toronto Blue Jays New York Yankees Baltimore Orioles Tampa Bay Rays Boston Red Sox Kansas City Royals Minnesota Twins Cleveland Indians Chicago White Sox Detroit Tigers Texas Rangers Houston Astros Los Angeles Angels Seattle Mariners Oakland Athletics New York Mets Washington Nationals Miami Marlins Atlanta Braves Philadelphia Phillies St. Louis Cardinals Pittsburgh Pirates Chicago Cubs Milwaukee Brewers Cincinnati Reds Los Angeles Dodgers San Francisco Giants Arizona Diamondbacks San Diego Padres Colorado Rockies

  105.707  0.0699 Runs  18.409 ERA Wins

b. Interpret the value of ܴଶ in this equation.

82.4% of the variability in team Wins has been successfully explained by using this model.

c. Interpret the coefficient of Runs in this regression. For every one more Run a team scores, after taking into account the ERA, a team is predicted to win about 0.07 more games.

d. Explain why the coefficient of ERA is negative. If the pitchers on a team allow more runs, their team will win less often.

.

R 93 87 81 80 78 95 83 81 76 74 88 86 85 76 68 90 83 71 67 63 100 98 97 68 64 92 84 79 74 68

ERA 891 764 713 644 748 724 696 669 622 689 751 729 661 656 694 683 703 613 573 626 647 697 689 655 640 667 696 720 650 737

3.8 4.05 4.05 3.74 4.31 3.73 4.07 3.67 3.98 4.64 4.24 3.57 3.94 4.16 4.14 3.43 3.62 4.02 4.41 4.69 2.94 3.21 3.36 4.28 4.33 3.44 3.72 4.04 4.09 5.04


Chapter 9 Multiple Regression

9-9

e. Compute the predicted number of Wins for a team with 700 Runs and an ERA of 4.

81 wins predicted = 105.707 + 0.0699(700) + –18.409(4)

f. Construct an appropriate plot(s) to check the conditions for regression. What can you say about the conditions for this regression?

Numerous plots are possible, depending on software (residual plot is offered here). The residual plot has a random scatter. The residuals do not appear to thicken. This regression appears to be appropriate

g. The largest residual is the Texas Rangers. They were predicted to win about 80 games and had a residual of 7.86. What does this tell you about the Rangers season? It confirms that the Rangers won 88 games. That is, they performed better than they should have, given the ERA and Runs scored by the team.

.


9-10

Part II Exploring Relationships Between Variables

Statistics Quiz D – Chapter 9

Name __________________________

1. The problem of collinearity occurs when A. there is an influential observation in the data set. B. at least one predictor var. has a nonlinear relationship with the response variable. C. two or more predictor variables are linearly related to each other. D. more than one predictor variable is linearly related to the response variable. E. none of these 2. Which of the following are NOT characteristics of a good regression model? A. a relatively high B. a relatively low value of s (the standard deviation of the residuals) C. relatively few predictor variables D. relatively small p-values for the F- and t-statistics E. All of these are characteristics of a good regression model. 3. In regression an observation has high leverage when A. the observation has a combination of x-values that is far from the center of the data. B. the observation is perfectly predicted by the regression. C. the observation is poorly predicted by the regression. D. removing the observation causes a large change in one of more coefficients of the model. E. none of these 4. A sample of 33 companies was randomly selected and data collected on the average annual bonus, turnover rate (%), and trust index (measured on a scale of 0 – 100). According to the output is shown below, what is the estimated multiple regression model? Response Variable is Turnover Rate Predictor Constant Trust Index Average Bonus

Coef 12.1005 -0.07149 -0.0007216

SE Coef 0.7826 0.01966 0.0001481

T 15.46 -3.64 -4.87

P 0.000 0.001 0.000

A. Turnover Rate = 0.01966 Trust Index + 0.0001481 Average Bonus B. Turnover Rate = 0.7826 + 0.01966 Trust Index + 0.0001481 Average Bonus C. Turnover Rate = 12.1005 – 0.07149 Trust Index – 0.0007216 Average Bonus D. Turnover Rate = 12.1005 + 0.07149 Trust Index + 0.0007216 Average Bonus E. Turnover Rate = 15.46 – 3.64 Trust Index – 4.87 Average Bonus

.


Chapter 9 Multiple Regression

9-11

5. A sample of 33 companies was randomly selected and data collected on the average annual bonus, turnover rate (%), and trust index (measured on a scale of 0 – 100). Based on the output, how much of the variability in Turnover Rate is explained by the estimated multiple regression model? Response Variable is Turnover Rate Predictor Constant Trust Index Average Bonus

A. B. C. D. E.

Coef 12.1005 -0.07149 -0.0007216

SE Coef 0.7826 0.01966 0.0001481

T 15.46 -3.64 -4.87

P 0.000 0.001 0.000

78.3% 79.6% 12.1% 95.4% None of the above.

6. A sample of 33 companies was randomly selected and data collected on the average annual bonus, turnover rate (%), and trust index (measured on a scale of 0 – 100). In a multiple regression estimating turnover rate using average bonus and trust index, what is the correct null hypotheses for testing the regression coefficient of Trust Index? Response Variable is Turnover Rate Predictor Constant Trust Index Average Bonus

Coef 12.1005 -0.07149 -0.0007216

SE Coef 0.7826 0.01966 0.0001481

T 15.46 -3.64 -4.87

P 0.000 0.001 0.000

A. β TI ≠ 0 B. T TI > 0 C. β TI = 0 D. β TI < 0 E. T TI ≤ 0 7. A sample of 33 companies was randomly selected and data collected on the average annual bonus, turnover rate (%), and trust index (measured on a scale of 0 – 100). According to the output below, what is the F statistic to determine the overall significance of the estimated is? Analysis of Variance Source Regression Residual Error Total

DF 2 30 32

SS 262.73 67.27 330.00

MS 131.36 2.24

A. 58.64 B. 1.497 C. 131.36 D. 78.3 E. 2.24

.


9-12

Part II Exploring Relationships Between Variables

8. A sample of 33 companies was randomly selected and data collected on the average annual bonus, turnover rate (%), and trust index (measured on a scale of 0 – 100). Using the output below, and a significance level of α = .01, we can conclude that Response Variable is Turnover Rate Predictor Constant Trust Index Average Bonus

Coef 12.1005 -0.07149 -0.0007216

S = 1.49746

R-Sq = 79.6%

SE Coef 0.7826 0.01966 0.0001481

T 15.46 -3.64 -4.87

P 0.000 0.001 0.000

R-Sq(adj) = 78.3%

Analysis of Variance Source Regression Residual Error Total

DF 2 30 32

SS 262.73 67.27 330.00

MS 131.36 2.24

A. The multiple regression model is significant overall. B. Trust Index is a significant independent variable in explaining turnover rate. C. Average Annual Bonus is a significant independent variable in explaining turnover rate. D. The predictor Constant is a significant independent variable in explaining turnover rate. E. All of these. 9. Using the output below, calculate the predicted turnover rate for a company having a trust index score of 70 and an average annual bonus of $6500. Response Variable is Turnover Rate Predictor Constant Trust Index Average Bonus

Coef 12.1005 -0.07149 -0.0007216

SE Coef 0.7826 0.01966 0.0001481

T 15.46 -3.64 -4.87

P 0.000 0.001 0.000

A. 3.5% B. 4.2% C. 1.9% D. 2.4 % E. None of the above. 10. Selling price and amount spent advertising were entered into a multiple regression to determine what affects flat panel LCD TV sales. A multiple regression model was fit to the data and the graph of residuals shows a unimodal and symmetric pattern. What does this graph suggest? A. the nearly normal assumption is satisfied. B. the randomization condition is satisfied. C. the linearity condition is satisfied. D. both A and B. E. both A and C.

.


Chapter 9 Multiple Regression

Statistics Quiz D – Chapter 9 Key 1. C 2. E 3. A 4. C 5. B 6. C 7. A 8. A 9. D 10. A

.

9-13


9-14

Part II Exploring Relationships Between Variables

Statistics Quiz E– Chapter 9

Name __________________________

1. Selling price and amount spent advertising were entered into a multiple regression to determine what affects flat panel LCD TV sales. The regression coefficient for Price was found to be -0.03055, which of the following is the correct interpretation for this value? A. Increasing the price of the Sony Bravia by $100 will result in at least 3 fewer TV’s sold. B. For a given amount spent on advertising, a $100 increase in price of the Sony Bravia is associated with a decrease in sales of 3.055 units, on average. C. Holding the amount spent on advertising constant, an increase of $100 in the price of the Sony Bravia will decrease sales by 3.055 units. D. Holding the amount spent on advertising constant, an increase of $100 in the price of the Sony Bravia will decrease sales by .03%. E. None of the above. 2. Selling price and amount spent advertising were entered into a multiple regression to determine what affects flat panel LCD TV sales. The plot of residuals versus predicted values is shown below. What does the residual plot suggest?

A. The Linearity condition is not satisfied. B. There is an extreme departure from normality. C. The variance is not constant. D. The presence of a couple of outliers. E. The plot thickens from left to right. 3. Selling price and amount spent advertising were entered into a multiple regression to determine what affects flat panel LCD TV sales. The adjusted R2 value was reported as 83.3%. This means that A. Selling price and amount spent advertising do not describe TV sales well. B. 83.3% of the variance in TV sales can be accounted for by selling price. C. 83.3% of the variance in TV sales can be accounted for by amount spent advertising. D. 83.3% of the variance in TV sales can be accounted for by the model including both selling price and amount spent advertising. E. Both selling price and amount spent advertising are significant predictors of TV sales.

.


Chapter 9 Multiple Regression

9-15

4. Selling price and amount spent advertising were entered into a multiple regression to determine what affects flat panel LCD TV sales. The correct null and alternative hypotheses for testing the regression coefficient of Price is A. H0 : βP ≠ 0 vs. HA : β P = 0 B. H0 : βP ≥ 0 vs. HA : β P < 0 C. H0 : βP ≤ 0 vs. HA : β P > 0 D. H0 : βP = 0 vs. HA : β P ≠ 0 E. H0 : The regression is not significant vs. HA: The regression is significant. 5. Selling price and amount spent advertising were entered into a multiple regression to determine what affects flat panel LCD TV sales. According to the output below, the calculated t-statistic to determine if amount spent on advertising is a significant independent variable in explaining Sony Bravia sales is Response Variable is Sales Predictor Constant Price Advertising

Coef 90.19 -0.03055 3.0926

SE Coef 25.08 0.01005 0.3680

T 3.60 -3.04 8.40

P 0.001 0.005 0.000

A. 3.60 B. –3.04 C. 8.40 D. 10.61 E. None of the above 6. Selling price and amount spent advertising were entered into a multiple regression to determine what affects flat panel LCD TV sales. Using the output below, estimate the number of units sold on average at a store that sells the Sony Bravia for $2199 and spends 10% of its advertising budget on the product. Response Variable is Sales Predictor Constant Price Advertising

Coef 90.19 -0.03055 3.0926

SE Coef 25.08 0.01005 0.3680

T 3.60 -3.04 8.40

P 0.001 0.005 0.000

A. 53.94 units B. 120 units C. 66.54 units D. 90.34 units E. None of the above.

.


9-16

Part II Exploring Relationships Between Variables

7. Selling price and amount spent advertising were entered into a multiple regression to determine what affects flat panel LCD TV sales. Using the output below, calculated F statistic to determine the overall significance of the estimated multiple regression model is Analysis of Variance Source Regression Residual Error Total

DF 2 27 29

SS 16477.3 3038.0 19515.4

MS 8238.7 112.5

A. 10.61 B. 73.23 C. 112.5 D. 3.60 E. None of the above 8. Selling price and amount spent advertising were entered into a multiple regression to determine what affects flat panel LCD TV sales. Use the output shown below, calculate the amount of variability in Sales is explained by the estimated multiple regression model. Analysis of Variance Source Regression Residual Error Total

DF 2 27 29

SS 16477.3 3038.0 19515.4

MS 8238.7 112.5

A. 15.57% B. 6.90% C. 84.43% D. 29% E. None of the above.

.


Chapter 9 Multiple Regression

9-17

9. Selling price and amount spent advertising were entered into a multiple regression to determine what affects flat panel LCD TV sales. Based on the output below, which of the following statements is/are true? Response Variable is Sales Predictor Constant Price Advertising

Coef 90.19 -0.03055 3.0926

SE Coef 25.08 0.01005 0.3680

S = 10.6075

R-Sq = 84.4%

T 3.60 -3.04 8.40

P 0.001 0.005 0.000

R-Sq(adj) = 83.3%

Analysis of Variance Source Regression Residual Error Total

DF 2 27 29

SS 16477.3 3038.0 19515.4

MS 8238.7 112.5

A. The multiple regression model is significant overall. B. Selling Price is a significant independent variable in explaining Bravia sales. C. Amount Spent on Advertising is a significant independent variable in explaining Bravia sales. D. Only A and B E. A, B and C 10. Selling price and amount spent advertising were entered into a multiple regression to determine what affects flat panel LCD TV sales. If the scatterplots of sales versus both selling price and amount spent on advertising are created, they are being used to determine whether A. the nearly normal assumption is satisfied. B. the randomization condition is satisfied. C. the linearity condition is satisfied. D. both A and B. E. both A and C.

.


9-18

Part II Exploring Relationships Between Variables

Statistics Quiz E – Chapter 9 Key 1. C 2. D 3. D 4. D 5. C 6. A 7. B 8. C 9. E 10. C

.


Statistics Test A – Exploring Relationships Between Variables

Name _________________

__ 1.

Researchers studying growth patterns of children collect data on the heights of fathers and sons. The correlation between the fathers’ heights and the heights of their 16 year-old sons is most likely to be: A) near –1.0 B) near 0 C) near +0.7 D) exactly +1.0 E) somewhat greater than 1.0

__ 2.

The auto insurance industry crashed some test vehicles into a cement barrier at speeds of 5 to 25 mph to investigate the amount of damage to the cars. They found a correlation of r = 0.60 between speed (MPH) and damage ($). If the speed at which a car hit the barrier is 1.5 standard deviations above the mean speed, we expect the damage to be _?__ the mean damage. A) equal to B) 0.36 SD above C) 0.60 SD above D) 0.90 SD above E) 1.5 SD above

__ 3.

Which scatterplot shows a strong association between two variables even though the correlation is probably near zero?

A)

B)

C)

D)

E)

__ 4.

The correlation between X and Y is r = 0.35. If we double each X value, decrease each Y by 0.20, and interchange the variables (put X on the Y-axis and vice versa), the new correlation A) is 0.35 B) is 0.50 C) is 0.70 D) is 0.90 E) cannot be determined.

__ 5.

A consumer group collected information on HDTVs . They created a linear model to estimate the cost of an HDTV (in $) based on the screen size (in inches). Which is the most likely value of the slope of the line of best fit? A) 0.70 B) 7 C) 70 D) 700 E) 7000

__ 6.

A multiple regression model uses the height and weight of young child to predict his shoe size, ܴଶ = 62%. A correct interpretation of this value would be… A) 62% of the change in shoe size is caused by height and weight. B) The regression model successfully explains 62% of the variation in shoe size. C) The regression predictions are 62% accurate. D) 62% of points follow the regression equation. E) None of these are correct.

__ 7.

In the multiple regression problem in exercise 6, which conditions would the researcher want to check? I. The scatterplot has no pattern. II. The residual plot has no pattern. III. The residuals do not become thicker. A) none

__ 8.

B) I only

C) II only

D) III only

E) II and III only

Education research consistently shows that students from wealthier families tend to have higher SAT scores. The slope of the line that predicts SAT score from family income is 6.25 points per $1000, and the correlation between the variables is 0.48. Then the slope of the line that predicts family income from SAT score (in $1000 per point) … A) is 0.037 B) is 0.16 C) is 3.00 D) is 6.25 E) is 13.02.

.


Part II-2 __ 9.

Part II Exploring Relationships Between Variables

A regression analysis of company profits and the amount of money the company spent on 2 advertising found r  0.72 . Which of these is true? I. This model can correctly predict the profit for 72% of companies. II. On average, about 72% of a company’s profit results from advertising. III. On average, companies spend about 72% of their profits on advertising. A) none B) I only C) II only D) III only E) I and III

__ 10. A least squares line of regression has been fitted to a scatterplot; the model’s residuals plot is shown. Which is true? A) The linear model is appropriate. B) The linear model is poor because some residuals are large. C) The linear model is poor because the correlation is near 0. D) A curved model would be better. E) None of the above. 11. Earning power A college’s job placement office collected data about students’ GPAs and the salaries they earned in their first jobs after graduation. The mean GPA was 2.9 with a standard deviation of 0.4. Starting salaries had a mean of $47200 with a SD of $8500. The correlation between the two variables was r = 0.72. The association appeared to be linear in the scatterplot. (SHOW WORK) a. Write an equation of the model that can predict salary based on GPA.

________________________________ b. Do you think these predictions will be reliable? Explain.

c. Your brother just graduated from that college with a GPA of 3.30. He tells you that based on this model the residual for his pay is -$1880. What salary is he earning? ____________

12. Assembly line Your new job at Panasony is to do the final assembly of camcorders. As you learn how, you get faster. The company tells you that you will qualify for a raise if after 13 weeks your assembly time averages under 20 minutes. The data shows your average assembly time during each of your first 10 weeks. a. Which is the explanatory variable?

________________________

b. What is the correlation between these variables? _____________ c. You want to predict whether or not you will qualify for that raise. Would it be appropriate to use a linear model? Explain. .

Week 1 2 3 4 5 6 7 8 9 10

Time(min) 43 39 35 33 32 30 30 28 26 25


Review of Part II: Exploring Relationships Between Variables

Part II-3

13. Associations For each pair of variables, indicate what association you expect: positive(+), negative(-), curved(C), or none(N). ___ ___ ___ ___ ___

power level setting of a microwave number of days it rained in a month (during the summer) number of hours a person has been up past a normal bedtime number of hockey games played in Minnesota during a week length of a student’s hair

number of minutes it takes to boil water number of times you mowed your lawn that month number of minutes it takes the person to do a crossword puzzle sales of suntan lotion in Minnesota during that week number of credits the student earned last year

14. Music and grades (True Story) A couple of years ago, a local newspaper published research results claiming a positive association between the number of years high school children had taken instrumental music lessons and their performances in school (GPA). a. What does “positive association” mean in this context?

b. A group of parents then went to the School Board demanding more funding for music programs as a way to improve student chances for academic success in high school. As a statistician, do you agree or disagree with their reasoning? Explain briefly.

15. AP Scores Using the data from over 1500 California high schools, we construct a model that uses the average SAT and ACT math scores to predict the percentage of students that will pass their AP tests with a score of 3 or 4 or 5. Response Variable: PctAP345 Parameter Estimate Std. Err. Alternative DF T-Stat P-value Intercept -77.289013 2.862604 ≠ 01012-26.999548<0.0001 AvgScrMath SAT 0.188070930.019336307 ≠ 01012 9.7263108<0.0001 AvgScrMath ACT 1.6279173 0.38526582 ≠ 01012 4.2254392<0.0001 a. What is the regression equation?

b. Interpret the coefficient of AvgScrACT in context.

.


Part II-4

Part II Exploring Relationships Between Variables

16. Crawling Researchers at the University of Denver Infant Study Center investigated whether babies take longer to learn to crawl in cold months (when they are often bundled in clothes that restrict their movement) than in warmer months. The study sought an association between babies' first crawling age (in weeks) and the average temperature during the month they first try to crawl (about 6 months after birth). Between 1988 and 1991 parents reported the birth month and age at which their child was first able to creep or crawl a distance of four feet in one minute. Data were collected on 208 boys and 206 girls. The graph below plots average crawling ages (in weeks) against the mean temperatures when the babies were 6 months old. The researchers found a correlation of r = -0.70 and their line of best fit was  AvAge  36  0.08 AvTemp .

a. Draw the line of best fit on the graph. (Show your method clearly.)

b. Describe the association in context.

c. Explain (in context) what the slope of the line means.

d. Explain (in context) what the y-intercept of the line means.

e. Explain (in context) what r 2 means.

f.

In this context, what does a negative residual indicate?

.


Review of Part II: Exploring Relationships Between Variables

Part II-5

Statistics Test A – Exploring Relationships Between Variables – Key 1. C

2. D

3. C

4. A

5. C

6. B

7. E

8. A

9. A

10. A

11. Earning power a. $̂  2830  15300gpa b. Somewhat reliable; based on this model, differences in GPA explain only 52% of the variability in salaries. c. $51,440 12. Assembly line a. Weeks worked b. r = – 0.97 c. No. The residuals plot shows a distinct curve, and predictions about what will happen three weeks in the future are likely to be unreliable. 13. Associations –, C, +, –, N 14. Music and grades. a. In general, kids who studied music longer had higher GPA’s. b. Disagree; association does not mean cause and effect. Perhaps the greater parental commitment that supports music lessons also encourages higher grades. (or higher SES enhances both, or people who are better students anyway take music, etc.) 15. AP scores a. PctAP345 = -77.29 + 0.188 AvgScrMath SAT + 1.628 AvgScrMath ACT b. For every one more point a student scores on the ACT Math test, after taking into account SAT math scores, we predict a 1.6% increase in AP pass rate. 16. Crawling a. Plot 2 points; for example (30,33.6) and (70,30.4) b. The association is linear, moderately strong, and negative, with one outlier. Children seem to crawl earlier when the temperature is higher, though there was an unusually early age observed for a temperature just above 50°. c. The model suggests that, on average, babies crawl 0.8 weeks earlier for every 10° higher the temperature is. d. The model predicts that at a temperature of 0° babies would crawl at an average of 36 weeks old (though this may not mean much as no data were collected at such cold temperatures.) e. 49% of the variability in crawling age can be explained by variations in temperature. f. A negative residual would indicate that babies crawled at a younger age than the model predicted.

.


Part II-6

Part II Exploring Relationships Between Variables

Statistics Test B – Exploring Relationships Between Variables

Name _________________

___ 1. It takes a while for new factory workers to master a complex assembly process. During the first month new employees work, the company tracks the number of days they have been on the job and the length of time it takes them to complete an assembly. The correlation is most likely to be A) exactly –1.0 B) near -0.6 C) near 0 D) near +0.6 E) exactly +1.0 ___ 2. A lakeside restaurant found the correlation between the daily temperature and the number of meals they served to be 0.40. On a day when the temperature is two standard deviations above the mean, the number of meals they should plan on serving is _?_ the mean. A) equal to B) 0.16 SD above C) 0.4 SD above D) 0.8 SD above E) 2.0 SD above ___ 3. For families who live in apartments the correlation between the family’s income and the amount of rent they pay is r = 0.60. Which is true? I. In general, families with higher incomes pay more in rent. II. On average, families spend 60% of their income on rent. III. The regression line passes through 60% of the (income$, rent$) data points. A) I only

B) II only

C) I and II only

D) I and III only

E) I, II, and III

___ 4. The residuals for a regression model should be… I. unimodal and symmetric. II. have a strong linear pattern. III. should get thicker as the predictive values increase. A) none B) I only C) II only D) III only

E) I and II

___ 5. Variables X and Y have r = 0.40. If we decrease each X value by 0.1, double each Y value, and then interchange them (put X on the Y- axis and vice versa) the new correlation will be A) –0.40 B) 0.15 C) 0.40D) 0.60 E) 0.80 ___ 6. The residuals plot for a linear model is shown. Which is true? A) The linear model is okay because approximately the same number of points are above the line as below it. B) The linear model is okay because the association between the two variables is fairly strong. C) The linear model is no good since the correlation is near 0. D) The linear model is no good since some residuals are large. E) The linear model is no good because of the curve in the residuals.

.


Review of Part II: Exploring Relationships Between Variables

Part II-7

___ 7. In checking conditions for multiple regression, which of these should be verified? I. A scatterplot of the variables is straight. II. The residual plot is straight. III. The residual plot shows no thickening. A) none B) I only

C) III only

D) I and II

E) I and III

___ 8. Suppose we collect data hoping to be able to estimate the prices of commonly owned new cars (in $) from their lengths (in feet). Of these possibilities, the slope of the line of best fit is most likely to be A) 3

B) 30

C) 300

D) 3000

E) 30000

___ 9. Medical records indicate that people with more education tend to live longer; the correlation is 0.48. The slope of the linear model that predicts lifespan from years of education suggests that on average people tend to live 0.8 extra years for each additional year of education they have. The slope of the line that would predict years of education from lifespan is A) 0.288 ___ 10.

B) 0.384

C) 0.8

D) 1.25 E) 1.67

Response variable is Income This regression analysis examines the relationship between the number of years of formal education a R-squared = 25.8% person has and their annual income. According to s = 3888 with 57 degrees of freedom this model, about how much more money do Variable Coefficient s.e. of Coeff people who finish a 4-year college program earn Constant 3984.45 6600 each year, on average, than those with only a 2Education 2668.45 600.1 year degree? A) $2006 B) $2710 C) $5337 D) $7968 E) $9321

11. Associations For each pair of variables, indicate what association you expect: positive(+), negative(-), curved(C), or none(N). ___ a. the number of miles a student lives from school; the student’s GPA ___ b.

a person’s blood alcohol level; time it takes the person to solve a maze

___ c.

weekly sales of hot chocolate at a Montana diner; the number of auto accidents that week in that town

___ d.

the price charged for fund-raising candy bars; number of candy bars sold

___ e.

the amount of rainfall during growing season; the crop yield (bushels per acre)

.


Part II-8

Part II Exploring Relationships Between Variables

12. Email At CPU every student gets a college email address. Data collected by the college showed a negative association between student grades and the number of emails the student sent during the semester. a. Briefly explain what “negative association” means in this context.

b. After seeing this study the college proposes trying to improve academic performance by limiting the amount of email students can send through the college address. As a statistician, what do you think of this plan? Explain briefly.

13. Car commercials A car dealer investigated the association between the number of TV commercials he ran each week and the number of cars he sold the following weekend. He also used the number of ads he places in newspaper to predict his sales. He found the following regression model. Response Variable: CarsSold Parameter Intercept NumTVAds NumNewsPrAds

Estimate –5.678 0.148 0.2379

a. Write this regression equation.

b. Interpret the coefficient for the NumTVAds in context.

c. If the dealer runs 20 TV ads and 45 newspaper ads, what do you expect his sales to be?

.


Review of Part II: Exploring Relationships Between Variables

Part II-9

14. Taxi tires A taxi company monitoring the safety of its cabs kept track of the number of miles tires had been driven (in thousands) and the depth of the tread remaining (in mm). Their data are displayed in the scatterplot. They found the equation of the least squares regression line to be

  36  0.6 miles , with r 2  0.74 . tread a. Draw the line of best fit on the graph. (Show your method clearly.)

b. What is the explanatory variable? __________________ c. The correlation r = ____ d. Describe the association in context.

e. Explain (in context) what the slope of the line means.

f.

Explain (in context) what the y - intercept of the line means.

g. Explain (in context) what r 2 means.

h. In this context, what does a negative residual mean?

.


Part II-10

Part II Exploring Relationships Between Variables

Statistics Test B – Exploring Relationships Between Variables – Key 1. B

2. D

3. A

11. Associations

4. B

5. C

6. E

7. E

8. D

9. A

10. C

A—none; B—positive(+); C—positive(+); D—negative(-); E—curved

12. Email a. Negative association implies that students who sent out more emails during the semester tended to have lower grades. b. This plan assumes that association means cause and effect. The college incorrectly proposes to limit emails through the college address as a way of increasing student grades. Perhaps students with bad grades console themselves by emailing friends.

13. Car commercials a. CarsSold = –5.678 + 0.148(NumTVAds) + 0.2379(NumNewsPrAds) b. For every one more TV ad, after taking into account the number of newspaper ads, we 0.148 more cars sold. c. About 8.

14. Taxi tires a. Draw the line of best fit on the graph. With 25000 miles, With 45,000 miles 36 – 0.6(25) = 21 36 – 0.6(45) = 9 (25, 21) (45, 9) Model goes through points: (25, 21) and (45, 9). b. The explanatory variable is the number of miles tires had been driven (in thousands). c. The correlation must have the same sign as the slope. r  r 2  0.74  0.86 d. The association between the number of miles tires have been driven (in thousands) and the tire tread depth (in mm) is a moderately strong negative linear relationship. Tires with more miles tend to have lower tread depth. (In this model, the tire tread is expected to be an additional 0.6 mm lower for every additional 1000 miles the tires have been driven.) One tire had unusually deep tread for the number of miles driven. e. This model suggests that for every additional 1000 miles the tires are driven, the depth of the tire tread will decrease by 0.6 mm, on average. f. The model predicts that brand new tires (number of miles equals zero) have tread averaging 36 mm deep. g. r 2 means that 74% of the variability in tread depth is explained by the variations in the number of miles the tires have been driven. h. Residual equals the observed tread depth minus the predicted tread depth. A negative residual means that the observed amount of tread depth is less than the predicted amount of tread depth, using this model. This means that the tire tread is actually wearing out faster than the model predicts.

.


Review of Part II: Exploring Relationships Between Variables

Statistics Test C – Exploring Relationships Between Variables

Part II-11

Name _________________

__ 1.

All but one of these statements contain a mistake. Which could be true? A) The correlation between a football player’s weight and the position he plays is 0.54. B) The correlation between the amount of fertilizer used and the yield of beans is 0.42. C) The correlation between a car’s length and its fuel efficiency is 0.71 miles per gallon. D) There is a high correlation (1.09) between height of a corn stalk and its age in weeks. E) There is a correlation of 0.63 between gender and political party.

__ 2.

Residuals are . . . A) possible models not explored by the researcher. B) variation in the data that is explained by the model. C) the difference between observed responses and values predicted by the model. D) data collected from individuals that is not consistent with the rest of the group. E) none of these.

__ 3.

Which statement about influential points is true? I. Removal of an influential point changes the regression line. II. Data points that are outliers in the horizontal direction are more likely to be influential than points that are outliers in the vertical direction. III. Influential points have large residuals. A) I only B) I and II C) I and III D) II and III E) I, II, and III

__ 4.

In a multiple regression model, we are using height (inches) and hours of exercise monthly to predict the body weight (lbs.) of a college freshmen. The coefficient of height will tell us … A) the increase in weight for every one more inch of height. B) the increase in weight for every increase height and exercise. C) the increase in weight for every one more inch of height, after accounting for the amount of exercise. D) the increase in weight for every one more hour of exercise, after accounting for the amount of height. E) none of these

__ 5.

A company’s sales increase by the same amount each year. This growth is . . . A) linear B) exponential C) logarithmic D) power E) quadratic

__ 6.

A residual plot/residuals should be . . . A) random B) non-thickening C) lacking outliers E) all of these

D) roughly symmetrical

__ 7.

A correlation of zero between two quantitative variables means that A) we have done something wrong in our calculation of r. B) there is no association between the two variables. C) there is no linear association between the two variables. D) re-expressing the data will guarantee a linear association between the two variables. E) None of the above.

__ 8.

A regression analysis of students’ college grade point averages (GPAs) and their high school GPAs found R2 = 0.311. Which of these is true? I. High school GPA accounts for 31.1% of college GPA. II. 31.1% of college GPAs can be correctly predicted with this model. III. 31.1% of the variance in college GPA can be accounted for by the model A) I only B) II only C) III only D) I and II E) None

.


Part II-12 __ 9.

Part II Exploring Relationships Between Variables

Two variables that are actually not related to each other may nonetheless have a very high correlation because they both result from some other, possibly hidden, factor. This is an example of A) leverage. B) a lurking variable. C) extrapolation. D) regression. E) an outlier.

__ 10. If the point in the upper right corner of this scatterplot is removed from the data set, then what will happen to the slope of the line of best fit (b) and to the correlation (r) ? A) both will increase. B) both will decrease. C) b will increase, and r will decrease. D) b will decrease, and r will increase. E) both will remain the same.

11. Storks Data show that there is a positive association between the population of 17 European countries and the number of stork pairs in those countries. a. Briefly explain what “positive association” means in this context.

b. Wildlife advocates want the stork population to grow, and jokingly suggest that citizens should be encouraged to have children. As a statistician, what do you think of this suggestion? Explain briefly.

12. Math and Verbal Suppose the correlation between SAT Verbal scores and Math scores is 0.57 and that these scores are normally distributed. If a student’s Verbal score places her at the 90th percentile, at what percentile would you predict her Math score to be? (SHOW WORK)

.


Review of Part II: Exploring Relationships Between Variables

Part II-13

13. Penicillin Doctors studying how the human body assimilates medication inject some patients with penicillin, and then monitor the concentration of the drug (in units/cc) in the patients’ blood for seven hours. First they tried to fit a linear model. The regression analysis and residuals plot are shown.

Source Regression Residual

Sum of Squares 4900.55 494.199

Variable Constant Time 30

Coefficient 40.3266 -5.95956

df 1 41

Mean Square 4900.55 12.0536

s.e. of Coeff 1.295 0.2956

F-ratio 407

Residual

Response Variable is Concentration No Selector R squared = 90.8% R squared (adjusted) = 90.6% s = 3.472 with 43 – 2 = 41 degrees of freedom

t-ratio prob  0.0001 31.1  0.0001 -20.2

a. Find the correlation between time and concentration.

b. Using this model, estimate what the concentration of penicillin will be after 4 hours.

c. Is that estimate likely to be accurate, too low, or too high? Explain.

14. Coasters A roller coaster enthusiast uses the Length (ft.) and Drop (ft.) to predict the Speed (mph). The regression model is below: Response Variable: Speed Parameter Estimate Std. Err. Alternative DF T-Stat P-value Intercept 29.129699 0.91034488 ≠ 017531.998531<0.0001 Drop 0.210855 0.0061093522 ≠ 0175 34.51348<0.0001 Length 0.00093890.00031342751 ≠ 01752.9952698 0.0031 a. Write the regression equation.

b. Interpret the coefficient of Length in context.

.


Part II-14

Part II Exploring Relationships Between Variables

Statistics Test C – Exploring Relationships Between Variables – Key 1. B

2. C

3. B

4. C

5. A

6. E

7.C

8.C

9. B

10. D

11. Storks a. Positive association implies that countries with larger populations tend to have more stork pairs and countries with smaller populations tend to have fewer stork pairs. b. This suggestion assumes that association means causation. The wildlife advocates incorrectly propose human population growth as a way to increase the number of stork pairs. Perhaps there is a lurking variable, like land mass, that accounts for the positive association between the two variables. 12. Math and Verbal

zM  0.57 zv  0.57(1.28)  0.73 , corresponding to the 76th percentile 13. Penicillin a. –0.953 b. 16.5 units/cc c. Too high; the residuals are generally negative for times between 2 and 5 hours. 14. Coasters a. Speed = 29.13 + 0.211 Drop + 0.000939 Length b. For every 1 foot Longer, after accounting for Drop, we predict a 0.000939 mph increase in Speed.

.


Chapter 10 Sample Surveys Solutions to Class Examples 1. See Class Example 1. 2. Solution: To obtain an SRS of the students at your school, you could number the names in the student directory, and choose your sample by generating random numbers and matching the numbers with the students in your sample. This may raise some issues, however. Are all students included in the directory? If not, do the students excluded differ in any systematic way from the students in the directory? A stratified sample may consist of a certain number each of freshman, sophomores, juniors, and seniors. Other strata may be males and females, athletes and non-athletes, or any other groupings that you feel would decrease variability in each stratum. You might choose a cluster sample by choosing a homeroom at random, and sampling each student in that homeroom, or choosing a class at random and sampling all of the students in that class. A systematic sample could be obtained from a list of all students at the school. Suppose you want to sample 50 students from a school of 500. Number the students 001 to 500. Generate a random number from 001 to 010, and start with that student. Every 10th student in the list becomes part of your sample. 3. See Class Example 3. 4. See Class Example 4.

.


10-2

Part III Gathering Data

Statistics Quiz A – Chapter 10

Name__________________________

1. A statistics teacher wants to know how her students feel about an introductory statistics course. She decides to administer a survey to a random sample of students taking the course. She has several sampling plans to choose from. Name the sampling strategy in each. a. There are four ranks of students taking the class: freshmen, sophomores, juniors, and seniors. Randomly select 15 students from each class rank. b. Randomly select a class rank (freshmen, sophomores, juniors, and seniors) and survey every student in that class rank. c. Each student has a nine-digit student number. Randomly choose 60 numbers. d. Using the class roster, select every fifth student from the list. 2. Explain why the second plan suggested above, sampling all students from one class rank, might be biased. Be sure to name the kind(s) of bias you describe.

3. Listed below are the names of 20 students who are juniors. Use the random numbers listed below to select five of them to be in your sample. Clearly explain your method. Adam Ellen Joy Peter

Chris Eric Kenny Rachel

Dave Joan Laura Rob

Deirdre John Mary Sara

Dick Judi Paul Stacey

04028 55259 81183 40754 60209 06765 27306 28370 82669 83236 77479 90618 43707 78695

4. Name and describe the kind of bias that might be present if the statistics teacher decides that instead of randomly selecting students to survey on how they feel about the course she just… a. asks students to volunteer for the survey.

b. gives the survey during class one day.

.


Chapter 10 Sample Surveys

10-3

Statistics Quiz A – Chapter 10 – Key 1. A statistics teacher wants to know how her students feel about an introductory statistics course. She decides to administer a survey to a random sample of students taking the course. She has several sampling plans to choose from. Name the sampling strategy in each. a. There are four ranks of students taking the class: freshmen, sophomores, juniors, and seniors. Randomly select 15 students from each class rank. stratified

b. Randomly select a class rank (freshmen, sophomores, juniors, and seniors) and survey every student in that class rank. cluster

c. Each student has a nine-digit student number. Randomly choose 60 numbers. simple

d. Using the class roster, select every fifth student from the list. systematic

2. Explain why the second plan suggested above, sampling all students from one class rank, might be biased. Be sure to name the kind(s) of bias you describe. Undercoverage might occur. The class rank selected may not be representative of all students taking the course. For example, perhaps the senior rank contains students who have failed the course previously, and these students might have a more negative attitude toward the class.

3. Listed below are the names of 20 students who are juniors. Use the random numbers listed below to select five of them to be in your sample. Clearly explain your method. 01 - Adam 06 - Ellen 11 - Joy 16 - Peter

02 - Chris 07 - Eric 12 - Kenny 17 - Rachel

03 - Dave 08 - Joan 13 - Laura 18 - Rob

04 - Deirdre 09 - John 14 - Mary 19 - Sara

05 - Dick 10 - Judi 15 - Paul 20 - Stacey

04028 55259 81183 40754 60209 06765 27306 28370 82669 83236 77479 90618 43707 78695 Assign two digit numbers to each of the juniors, as noted in the table. Use the random digits, in groups of 2, to select the first five people in the sample. Ignore any digits 21-99 and 00, because there are no corresponding people, and ignore repeated numbers. 04 – Deirdre, 02 – Chris, 85 – ignore, 52 – ignore, 59 – ignore, 81 – ignore, 18 – Rob, 34 – ignore, 07 – Eric, 54 – ignore, 60 – ignore, 20 – Stacey People selected: Deirdre, Chris, Rob, Eric, and Stacey

4. Name and describe the kind of bias that might be present if the statistics teacher decides that instead of randomly selecting students to survey on how they feel about the course she just… a. asks students to volunteer for the survey. Voluntary response sample—the bias would probably be towards those students who enjoy the course, and these students would rate the course more favorably.

b. gives the survey during class one day. Convenience sample—the bias would probably be towards students who happen to attend class that day. They may enjoy the course more than students who are skipping class! .


10-4

Part III Gathering Data

Statistics Quiz B – Chapter 10

Name __________________________

1. Management at a retail store is concerned about the possibility of drug abuse by people who work there. They decide to check on the extent of the problem by having a random sample of the employees undergo a drug test. Several plans for choosing the sample are proposed. Name the sampling strategy in each. a. Randomly select an employee classification and test all the people who work in that classification - supervisors, fulltime clerks, part-time clerks, and maintenance staff. b. Choose the fourth person that arrives to work for each shift. c. There are four employee classifications: supervisors, fulltime clerks, part-time clerks, and maintenance staff. Randomly select ten people from each category. d. Each employee has a three-digit identification number. Randomly choose 40 numbers. 2. Explain why the first plan suggested above, sampling an entire classification, might be biased. Be sure to name the kind(s) of bias you describe.

3. Listed at the right are the names of twenty full-time clerks on the retail staff. Use the random numbers listed below to select four of them to be in your sample. Clearly explain your method. 18192 95318 02975 01191 Larry Marissa Tim Carol Joshua Barbara Rich Sharyn Brian Nicole John June Steve Andrea Matthew

29958 09275 89141 61909

Diane Allan Kelly Frank Erin

4. Name and describe the kind of bias that might be present if the management decides that instead of subjecting people to random testing they’ll just… a. Hold department meetings and drug test the employees that attend.

b. Offer additional employee discounts for those employees who agree to be drug tested.

.


Chapter 10 Sample Surveys

10-5

Statistics Quiz B – Chapter 10 – Key 1. Management at a retail store is concerned about the possibility of drug abuse by people who work there. They decide to check on the extent of the problem by having a random sample of the employees undergo a drug test. Several plans for choosing the sample are proposed. Name the sampling strategy in each. a. Randomly select an employee classification and test all cluster the people who work in that classification - supervisors, full-time clerks, part-time clerks, and maintenance staff. systematic b. Choose the fourth person that arrives to work for each shift. c. There are four employee classifications: supervisors, stratified full-time clerks, part-time clerks, and maintenance staff. Randomly select ten people from each category. simple d. Each employee has a three-digit identification number. Randomly choose 40 numbers. 2. Explain why the first plan suggested above, sampling an entire classification, might be biased. Be sure to name the kind(s) of bias you describe. Undercoverage might occur. The classification selected is not a representative of all employees or may not represent the drug usage by employees at the retail store. Perhaps the pressure of being a supervisor leads to more drug use than other classifications.

3. Listed at the right are the names of twenty full-time clerks on the retail staff. Use the random numbers listed below to select four of them to be in your sample. Clearly explain your method. 18192 95318 02975 01191 29958 09275 89141 61909

01 - Larry 02 - Marissa 03 - Tim 04 - Diane 05 - Carol 06 - Joshua 07 - Barbara 08 - Allan 09 - Rich 10 - Sharyn 11 - Brian 12 - Kelly 13 - Nicole 14 - John 15 - June 16 - Frank 17 - Steve 18 - Andrea 19 - Matthew 20 - Erin

Assign two digit numbers to each clerk on the retail staff, as noted in the table. Use the random digits, in groups of 2, to select the first four people in the sample. Ignore any digits 21 – 99 and 00 because there are no corresponding people, and ignore repeated numbers. 18 – Andrea, 19 – Matthew, 29 – ignore, 53 – ignore, 18 – ignore repeat, 02 – Marissa, 97 – ignore, 50 – ignore, 11 - Brian People selected: Andrea, Matthew, Marissa, and Brian

4. Name and describe the kind of bias that might be present if the management decides that instead of subjecting people to random testing they’ll just… a. Hold department meetings and drug test the employees that attend. Convenience sample; bias: under or overcoverage. We are sampling the employees that show up, which might not be representative of the entire retail staff. Drug use may lead to absence, causing us to underestimate the proportion of drug users.

b. Offer additional employee discounts for those employees who agree to be drug tested. Voluntary response sample; bias: toward employees who shop at the store. Employees who do use drugs would probably not volunteer for any reason, which may lead us to underestimate drug use.

.


10-6

Part III Gathering Data

Statistics Quiz C – Chapter 10

Name__________________________

1. Administrators at a hospital are concerned about the possibility of drug abuse by people who work there. They decide to check on the extent of the problem by having a random sample of the employees undergo a drug test. a. Define the population and the parameter of interest in context. b. Describe a method that you would use to collect your sample. Make sure to name the sampling strategy chose.

c. There are four employee classifications: doctors, medical staff (nurses, technicians, etc.) office staff, and support staff (custodians, maintenance, etc.). Describe how knowing about these classifications would modify your plan in part (b) and why.

2. Listed at the right are the names of the 20 pharmacists on the hospital staff. Use the random numbers listed below to select three of them to be in the sample. Clearly explain your method. Pastore Back 04905 83852 29350 91397 Spiridinov Ahl Hedge MacDowell 19994 65142 05087 11232 Schissel Novelli Lavine Kaplan Highland Roundy Grubb Markowitz Glass Davies Golkowski Reeves Janis Yen 3. Name and describe the kind of bias that might be present if the administration decides that instead of subjecting people to random testing they’ll just… a. interview employees about possible drug abuse.

b. ask people to volunteer to be tested.

.


Chapter 10 Sample Surveys

10-7

Statistics Quiz C – Chapter 10 – Key 1. Administrators at a hospital are concerned about the possibility of drug abuse by people who work there. They decide to check on the extent of the problem by having a random sample of the employees undergo a drug test. a. Define the population and the parameter of interest in context. The population is the employees of at the hospital and the parameter is the percentage who are abusing drugs (perhaps broken down by different drug types).

b. Describe a method that you would use to collect your sample. Make sure to name the sampling strategy chose. Answers will vary. Should include some form of random selection of employees. If stratification or cluster sampling is described, it should be named appropriately.

c. There are four employee classifications: doctors, medical staff (nurses, technicians, etc.) office staff, and support staff (custodians, maintenance, etc.). Describe how knowing about these classifications would modify your plan in part (a) and why. As these four different groups may very well have different levels of drug abuse, it is important to sample some workers from each of these groups. Thus we should stratify by these groups to ensure that some workers from each group are chosen.

2. Listed at the right are the names of the 20 pharmacists on the hospital staff. Use the random numbers listed below to select three of them to be in the sample. Clearly explain your method. 01 Pastore 11 Back 04905 83852 29350 91397 02 Spiridinov 12 Ahl 03 Hedge 13 MacDowell 19994 65142 05087 11232 04 Schissel 14 Novelli 05 Lavine 15 Kaplan 06 Highland 16 Roundy Number the people 01 – 20 (by columns). Go across the random digits two at a time, ignoring 20-99 and 07 Grubb 17 Markowitz any repeats, until three are chosen: 04 – Schissel, 08 Glass 18 Davies (ignore 90, 58, 38, etc), 09 – Golkowski, 09 Golkowski 19 Reeves 13 – MacDowell 10 Janis 20 Yen 3. Name and describe the kind of bias that might be present if the administration decides that instead of subjecting people to random testing they’ll just… a. interview employees about possible drug abuse. Response bias; people will feel threatened so they won’t answer truthfully. This would likely result in underestimating the proportion of drug users.

b. ask people to volunteer to be tested. Voluntary response bias; only those who are “clean” would volunteer. It is likely that the proportion of drug users would be underestimated.

.


10-8

Part III Gathering Data

Statistics Quiz D – Chapter 10

Name__________________________

1. The main ideas of sampling do not include A. Sample size matters B. Examine a part of the whole C. The fraction of the larger population determines the precision of the resulting statistics D. Randomization E. All ideas of sampling should be included 2. The administration of a large university is interested in learning about the types of wellness programs that would interest its employees. Suppose that there are five categories of employees (administration, faculty, professional staff, clerical and maintenance) and the university decides to randomly select ten individuals from each category. This sampling plan is called A. Simple Random Sampling B. Stratified Sampling C. Cluster Sampling D. Systematic Sampling E. Convenience Sampling 3. The administration of a large university is interested in learning about the types of wellness programs that would interest its employees. Suppose that the university randomly selects a school (e.g., the Business School) and surveys all of the individuals (administration, faculty, professional staff, clerical and maintenance) who work in that school. This sampling plan is called A. Simple Random Sampling B. Stratified Sampling C. Cluster Sampling D. Systematic Sampling E. Convenience Sampling 4. A basketball player has a 70% free throw percentage. Which plan could be used to simulate the number of free throws she will make in her next five free throw attempts? I. Let 0,1 represent making the first shot, 2, 3 represent making the second shot,…, 8, 9 represent making the fifth shot. Generate five random numbers 0-9, ignoring repeats. II. Let 0, 1, 2 represent missing a shot and 3, 4,…, 9 represent making a shot. Generate five random numbers 0-9 and count how many numbers are in 3-9. III. Let 0, 1, 2 represent missing a shot and 3, 4,…, 9 represent making a shot. Generate five random numbers 0-9 and count how many numbers are in 3-9, ignoring repeats. A. I only B. II only C. III only D. II and III E. I, II, and III

.


Chapter 10 Sample Surveys

10-9

5. A consumer research group is interested in how older drivers view hybrid cars. Specifically, they wish to assess the percentage of drivers in the U.S. 50 years of age or older who intend to purchase a hybrid in the next two years. They selected a systematic sample from a list of AARP (American Association of Retired Persons) members. Based on this sample, they estimated the percentage to be 17%. The sampling frame for this study is A. all drivers in the U.S. 50 years of age or older B. 17% C. the list of AARP members D. how older drivers view hybrid cars E. none of the above 6. A consumer research group is interested in how older drivers view hybrid cars. Specifically, they wish to assess the percentage of drivers in the U.S. 50 years of age or older who intend to purchase a hybrid in the next two years. They selected a systematic sample from a list of AARP (American Association of Retired Persons) members. Based on this sample, they estimated the percentage to be 17%. Which of the following statements about this study is true? A. 17% of all U.S. drivers 50 years of age or older intend to purchase a hybrid in the next two years. B. 17% is a parameter. C. 17% is a statistic. D. Both A and B E. Both A and C 7. The online MBA director at a large business school surveys a sample of current students to determine their level of satisfaction with the program. She finds that 67% are “very satisfied” with the online program. This is a A. statistic. B. parameter. C. target population. D. sampling frame. E. good result. 8. Which of these is not an advantage of using a stratified sample instead of a simple random sample? A. the stratified sample reduces bias B. the stratified sample reduces sample to sample variability C. the stratified sample eliminates the need for randomization D. the stratified random sample allows you to get information about each stratum E. the stratified sample allows you to get more reliable estimates using the same sample size

.


10-10

Part III Gathering Data

9. Shortly after the Sandy Hook Elementary School shooting in December 2012, a local television news program asked viewers to call in with their opinion about gun control. These results were likely biased because A. of a bad sampling frame. B. of an undefined target population. C. leading questions. D. of a voluntary response sample. E. of measurement error. 10. A school district administrator sent a survey to all teachers in the district. Only 30% of the teachers responded to the survey. Which of the following is true? I. The people that did not responded are likely to be similar to those that did so he should use them as the sample. II. This survey design suffered from non-response bias. III. Because he sent the survey to everyone, this is a census and the results can be applied to the whole population. A. I only B. II only C. I and II only D. II and III only E. I, II, and III

.


Chapter 10 Sample Surveys

Statistics Quiz D – Chapter 10 – Key 1. C 2. B 3. C 4. B 5. C 6. C 7. A 8. A 9. D 10. B

.

10-11


10-12

Part III Gathering Data

Statistics Quiz E – Chapter 10

Name__________________________

1. Among a dozen eggs, three are rotten. A cookie recipe calls for two eggs; they'll be selected randomly from that dozen. Which plan could be used to simulate the number of rotten eggs that might be chosen? I. Let 0, 1, and 2 represent the rotten eggs, and 3, 4, …, 11 the good eggs. Generate two random numbers 0-11, ignoring repeats. II. Randomly generate a 0, 1, or 2 to represent the number of rotten eggs you get. III. Since 25% of the eggs are rotten, let 0 = rotten and 1, 2, 3 = good. Generate two random numbers 0-3 and see how may 0's you get. A. I only B. II only C. III only D. I and III E. I, II, and III 2. Which of the following survey questions is leading? A. Given the prevalence of identity theft, are you reluctant to provide credit card information online? B. Are you confident that any information you provide online is secure? C. Are you concerned about the security of online transactions? D. Both A and B E. Both B and C 3. Hoping to get information that would allow them to negotiate new rates with their advertisers, Natural Health magazine phoned a random sample of 600 subscribers. 64% of those polled said they use nutritional supplements. Which is true? I. The population of interest is the people who read this magazine. II. "64%" is not a statistic; it's the parameter of interest. III. This sampling design should provide the company with a reasonably accurate estimate of the percentage of all subscribers who use supplements. A. I only B. I and II only C. I and III only D. II and III only E. I, II, and III

.


Chapter 10 Sample Surveys

10-13

4. Molly’s Reach, a regional restaurant and gift shop, has recently launched an increased sales campaign, particularly focused on their gift shop. To determine if their proposal is of interest, they plan to survey a random sample of their regular customers. Suppose that Molly’s Reach’s regular customers belong to a rewards program and have a customer rewards ID number. Molly’s Reach decides to randomly select 100 numbers. This sampling plan is called A. Simple Random Sampling B. Stratified Sampling C. Cluster Sampling D. Systematic Sampling E. Convenience Sampling 5. Molly’s Reach, a regional restaurant and gift shop, has recently launched an increased sales campaign, particularly focused on their gift shop. To determine if their proposal is of interest, they plan to survey a random sample of their regular customers. Suppose that Molly’s Reach has an alphabetized list of regular customers who belong to their rewards program. After randomly selecting a customer on the list, every 25th customer from that point on is chosen to be in the sample. This sampling plan is called A. Simple Random Sampling B. Stratified Sampling C. Cluster Sampling D. Systematic Sampling E. Convenience Sampling 6. Molly’s Reach, a regional restaurant and gift shop, has recently launched an increased sales campaign, particularly focused on their gift shop. To determine if their proposal is of interest, they plan to survey a random sample of their regular customers. All regular Molly’s Reach customers is known as the ________ of the study. A. parameter B. statistic C. target population D. sampling frame E. sample 7. Molly’s Reach, a regional restaurant and gift shop, has recently launched an increased sales campaign, particularly focused on their gift shop. To determine if their proposal is of interest, they plan to survey a random sample of their regular customers. Which of the following is the parameter of interest in the Molly’s Reach study? A. all regular Molly’s Reach customers B. % of regular Molly’s Reach customers who have concerns about online security C. Molly’s Reach customers who belong to the rewards program D. % of Molly’s Reach customers who belong to the rewards program that don’t shop online E. none of the above

.


10-14

Part III Gathering Data

8. Molly’s Reach, a regional restaurant and gift shop, has recently launched an increased sales campaign, particularly focused on their gift shop. To determine if their proposal is of interest, they plan to survey a random sample of their regular customers. One member of the management team at Molly’s Reach suggests that their survey could be conducted online. Customers logging on to the online store would be asked to take a few minutes to complete the survey and would be offered a coupon as incentive to participate. Which of the following statements is true? A. This is a voluntary response sample. B. This would result in an unbiased random sample. C. This would result in a biased sample. D. Both A and B E. Both A and C 9. The principal of a small elementary school wants to select a simple random sample of 24 students. The school has 12 classrooms with 18 students in each class. She decides to randomly select two students from each classroom. Is this a simple random sample? A. Yes, because a stratified sample is a type of simple random sample. B. Yes, because the students were selected at random. C. Yes, because each student had an equal chance to be selected. D. No, because not all combinations of 24 students could have been chosen. E. No, because each student did not have an equal chance of being selected. 10. Does Procellera® Antimicrobial Wound Dressing help injuries heal faster? Researchers checked records of 38 patients who had been treated for acute or chronic wounds between 2010 and 2012. They found that those who had been treated with Procellera® healed almost twice as fast. This is a A. randomized experiment B. matched experiment C. survey D. retrospective study E. prospective study

.


Chapter 10 Sample Surveys

Statistics Quiz E – Chapter 10 – Key 1. A 2. A 3. C 4. A 5. D 6. C 7. B 8. E 9. A 10. D

.

10-15



Chapter 11 Experiments and Observational Studies Solutions to Class Examples 1. See Class Example 1. 2. See Class Example 2. 3. See Class Example 3. 4. See Class Example 4. 5. See Class Example 5. Investigative Task The Investigative Task involves designing an experiment to determine the efficacy of an industrial hydraulic fuel additive. Supplemental Resources We provide an outline for a group data collection project.

.


11-2

Part III Gathering Data

Statistics Quiz A – Chapter 11

Name ___________________________

An article in a local newspaper reported that dogs kept as pets tend to be overweight. Veterinarians say that diet and exercise will help these chubby dogs get in shape. The veterinarians propose two different diets (Diet A and Diet B) and two different exercise programs (Plan 1 and Plan 2). Diet A: owners control the portions of dog food and dog treats; Diet B: a mixture of fresh vegetables with the dog food and substitute regular dog treats with baby carrots. Plan 1: three 30-minute walks a week; Plan 2: 20-minute walks daily. Sixty dog owners volunteer to take part in an experiment to help their chubby dogs lose weight. 1. Identify the following: a. the subjects:

_________________________

b. the factor(s) and the number of level(s) for each:

_________________________

c. the number of treatments:

_______

d. whether or not the experiment is blind (or double-blind): e. the response variable:

___________________

_________________________

2. Design an experiment to determine whether the diet and exercise programs are effective in helping dogs to lose weight.

.


Chapter 11 Experiments and Observational Studies

11-3

Statistics Quiz A – Chapter 11 - Key An article in a local newspaper reported that dogs kept as pets tend to be overweight. Veterinarians say that diet and exercise will help these chubby dogs get in shape. The veterinarians propose two different diets (Diet A and Diet B) and two different exercise programs (Plan 1 and Plan 2). Diet A: owners control the portions of dog food and dog treats; Diet B: a mixture of fresh vegetables with the dog food and substitute regular dog treats with baby carrots. Plan 1: three 30-minute walks a week; Plan 2: 20-minute walks daily. Sixty dog owners volunteer to take part in an experiment to help their chubby dogs lose weight. 1. Identify the following: a. the subjects: 60 chubby dogs

b. the factor(s) and the number of level(s) for each: diet (two levels) and exercise (two levels)

c. the number of treatments: four treatments (Diet A and Plan 1, Diet A and Plan 2, Diet B and Plan 1, Diet B and Plan 2)

d. whether or not the experiment is blind (or double-blind): This design is at best single-blind, since the owners know which diet and exercise plan their dogs are on, but the evaluators do not have to be given this information.

e. the response variable: weight loss

2. Design an experiment to determine whether the diet and exercise programs are effective in helping dogs to lose weight.

.


11-4

Part III Gathering Data

Statistics Quiz B – Chapter 11

Name

A group of people are concerned that the coach of a local high school men’s and women’s basketball teams alters the amount of air in the basketball to gain an unfair advantage over opponents during home games. The idea is that the basketballs are pumped up with one pound per square inch less air than required, and his teams practiced with these altered balls all week prior to home basketball games. Since these under-pumped basketballs would react differently to being shot at a basket, the team that practiced with these balls would have an unfair advantage when it came to shooting free throws. 1. Describe how to use a retrospective study to determine if the home teams have an unfair advantage when shooting free-throws.

2. Describe how to use a prospective study to determine if the home teams have an unfair advantage when shooting free-throws.

3. Design an experiment to determine if the home teams have an unfair advantage when shooting free-throws.

.


Chapter 11 Experiments and Observational Studies

11-5

Statistics Quiz B – Chapter 11 – Key 1. Describe how to use a retrospective study to determine if the home teams have an unfair advantage when shooting free-throws. A retrospective study is an observational study in which past games are reviewed to see how many free-throws the team when they had the altered basketballs versus when they had basketballs pumped up correctly. The differences in the number or percentage of free-throws made would be compared between the two types of basketballs.

2. Describe how to use a prospective study to determine if the home teams have an unfair advantage when shooting free throws. A prospective study is an observational study in which teams are selected and then their future freethrows are recorded to see if they improve when they have the altered basketballs. The differences in the number or percentage of free-throws made would be compared between the two types of basketballs.

3. Design an experiment to determine if the home teams have an unfair advantage when shooting free-throws. Subjects being studied: basketball games Treatment: Factor in the experiment: air pressure in the basketball Levels of factor: standard basketball air pressure and one pound less than standard basketball air pressure Response variable: number (or percentage) of free-throws for the team that practiced with the underinflated basketballs. Design: Home team will practice all week with under-inflated basketballs. Game will be played with under-inflated basketballs or regulation basketballs, determined at random at game time. We’ll record the team’s free-throw performance. Results will be compared for games using the under-inflated basketballs to games played with properly inflated basketballs. Double-blind experiment: The type of basketball used for each game will be determined by a person handling the basketballs, without letting the players or the coach know which ball is being used. Basketballs will look identical. Nature and scope of conclusion: the number or percentage of free-throws made by each team will be evaluated to determine if there is an improvement in rebounding when using the under-inflated basketballs. If there is a difference with the altered basketballs, practicing during the week with the under-inflated basketballs would give the home team an unfair advantage over the visiting team who has not practiced with the under-inflated basketballs.

.


11-6

Part III Gathering Data

Statistics Quiz C – Chapter 11

Name

Researchers plan to investigate the level of distraction created by social media. A group of 150 college students are recruited. 50 will study for a midterm for 2 hours while keeping all devices and social media put away. 50 students will study for 2 hours and check their devices for updates only every 30 minutes. The third group will keep their devices on hand and use them normally. The students will be randomly assigned to one of the three groups and their scores on the midterm will be compared. 1. Identify the subjects.

2. Identify the treatments.

3. Identify the response variable.

4. Describe an advantage to random assignment of treatments.

5. Which group is acting as a control? Why is this group important?

6. Describe a disadvantage of using volunteers in this study.

7. Can this experiment be conducted blind? Double blinded?

.


Chapter 11 Experiments and Observational Studies

11-7

Statistics Quiz C – Chapter 11 – Key Researchers plan to investigate the level of distraction created by social media. A group of 150 college students are recruited. 50 will study for a midterm for 2 hours while keeping all devices and social media put away. 50 students will study for 2 hours and check their devices for updates only every 30 minutes. The third group will keep their devices on hand and use them normally. The students will be randomly assigned to one of the three groups and their scores on the midterm will be compared. 1. Identify the subjects. The 150 college students

2. Identify the treatments. 3 treatments: studying for 2 hours with no devices, with social media check every 30 minutes, with normal device use.

3. Identify the response variable. Midterm scores

4. Describe an advantage to random assignment of treatments. If students picked their own treatments, they could pick the treatment that will work best for them, which make the effect of the treatment unclear.

5. Which group is acting as a control? Why is this group important? The group that uses their devices normally are the control group. We need a baseline of midterm scores with normal use to compare the two treatments against.

6. Describe a disadvantage of using volunteers in this study. The volunteers may be different than most students. Perhaps students who are very addicted to their phones would not be willing to volunteer. Then our experiment will not actually measure how the use of social media is associated with midterm scores.

7. Can this experiment be conducted blind? Double blinded? Students cannot be blind as to whether or not they use devices. However, the professor grading the exams could be blind as to which students had which treatment.

.


11-8

Part III Gathering Data

Statistics – Investigative Task

Chapter 11

Backhoes & Forklifts A heavy equipment manufacturer introduces new models of their backhoes and forklifts, and sells several of these machines. A few months later they discover a problem. An unacceptably high number of their customers needed to seek repairs of the machines’ hydraulic systems within the first few weeks of operation. The company’s engineers go to work on this problem, and soon think that they have found a solution. They believe that an additive poured into the hydraulic oil may greatly extend the number of hours these machines can be used before repairs become necessary. A few tests conducted under laboratory conditions indicate that this solution shows promise, but “real world” customer experience is needed before they can be sure. Impressed by these preliminary results, the company’s management gives the research team the green light to test their theory using up to 20 newly manufactured machines that will be sold during the next few weeks. It is your job to design the experiment by specifying the procedure the company should use. Be sure to use the appropriate vocabulary throughout your description. 

Clearly describe how you would design this experiment.

Outline your design with a diagram.

Explain and perform your randomization procedure, showing the resulting assignments.

Indicate additional measures, if any, that you would take to eliminate potential sources of bias.

Explain how you would decide whether the additive has solved the problem.

.


Chapter 11 Experiments and Observational Studies

Statistics – Investigative Task – Backhoes and Forklifts Components

Chapter 11 Comments

Understands blocking: Demonstrates clear understanding of randomized block design by randomly assigning each type of machine to both treatment groups. Creates an effective design: o controls effects of known variables o blinds dealers & owners o replicates the treatments o defines correct response variable Randomizes: o randomly assigns machines to groups o clearly explains randomization method o shows the resulting assignments Clearly describes the experiment: o identifies the question o has a good diagram o uses proper vocabulary o does not confuse experimentation and sampling

Think

Show

Tell

Components are scored as Essentially correct, Partially correct, or Incorrect 1: The Blocks E – Creates an experiment that uses blocking P – Uses blocking, but does not understand why or randomizes inappropriately I – Fails to use a block design 2: The Design E – Has all four requirements P – Has 3 of the requirements, or two including a response variable I – Has fewer than 2, or has 2 requirements but fails to specify a response variable 3: Randomization E – Has all 3 requirements P – Has 1 or 2 of the requirements I – Does not randomize 4: Description E – Has all 4 requirements P – Has 2 or 3 of the requirements I – Has none or only 1 of the requirements Scoring

 

E’s count 1 point, P’s are 1/2 Grade: A = 4, B = 3, etc., with +/- based on rounding (ex: 3.5 rounded to 3 is a B+)

Name __________________________________ Grade ____

.

11-9


11-10

Part III Gathering Data

NOTE: We present a model solution with some trepidation. This is not a scoring key, just an example. Many other approaches could fully satisfy the requirements outlined in the scoring rubric. That (not this) is the standard by which student responses should be evaluated. Model Solution – Investigative Task – Backhoes and Forklifts We want to know whether heavy machinery treated with a hydraulic oil additive will last longer before needing repairs than heavy machinery in similar conditions that are not treated with the hydraulic oil additive. To evaluate the machines, we will use the hour meters to note the number of hours of operation before repair is required. There is a single factor, additive, at two levels, additive and no additive, resulting in two treatments groups. Since backhoes and forklifts are used in different situations, the experiment will be blocked according to machine type to reduce variability. Each machine will be numbered, and 5 machines of each type will be randomly assigned to receive the additive. The remaining machines will receive a placebo additive consisting of regular hydraulic oil. The use of the placebo additive will serve as a mechanism to blind the equipment dealer, who will add the additive, which will in turn keep the equipment owner from knowing whether or not the machine contains the special additive. As an incentive to the owner, we will offer free service and repair on the hydraulic system of each machine. The mechanics evaluating the machine will also be unaware of the type of additive in any specific machine. This makes the experiment double blind. If the machines in the group that received the additive last significantly longer on average than the machines in the group that did not receive the additive, we will know that the additive was effective. Furthermore, we can determine whether or not the additive works differently in backhoes than in forklifts.

.


Chapter 11 Experiments and Observational Studies

Statistics

11-11

Group Project

ASSIGNMENT: Conduct a survey, study, or experiment, or do a simulation. Present your findings to the class, and submit a written report.

(1) PROPOSAL (Due __________ ) Decide what your group will investigate. Identify your research question(s) and design your procedure. Submit a detailed written proposal for approval.

(2) DO IT! (Week of __________ ) Collect and analyze your data. Prepare handouts, transparencies, or posters for your presentation.

(3) CLASS PRESENTATION (Date __________ ) Tell us about your project. (appropriate verbal, visual, and numerical descriptions)

(4) WRITTEN REPORT (Due __________ ) Submit a written report, including: your original question; an explanation of how you collected your data; your visual, verbal, and numerical analysis; and your conclusions. Also include an evaluation of how well your plan worked, with suggestions of how you could improve your methodology next time.

.


11-12

Part III Gathering Data

GRADES will be based on your design, analysis, conclusions, presentation, report, and selfevaluation, as well as meeting deadlines, full group involvement, completeness, accuracy, clarity, and your use of proper statistical techniques and vocabulary. NOTE: You will find that planning and carrying out a good study are more difficult than you expect. Pick a manageable topic; try to do it well. Things will probably go wrong. Goal #1: apply the ideas we have studied so far this year: good design, exploratory data analysis, clear descriptions and sound analysis of results. Goal #2: learn from any mistakes or problems that arise.

SOME IDEAS (or think up something else; get advice and approval before starting.)  Do after school jobs or participation in sports affect grades?  Can we predict height or weight from shoe size?  Are smokers less likely to wear seatbelts?  Which grocery store or drugstore has the lowest prices?  Do males get higher Math SAT scores than females?  Do boys and girls enroll equally in advanced math, science, and computer courses?  Do ninth graders study more or less than juniors or seniors?  How much stronger is a person’s dominant hand?  Are lefties more coordinated with their right hands than righties with their lefts?  Can people tell the difference between national brands and store brands?  Can people tell whether soda comes from a plastic bottle, a glass bottle, or a can?  Does mail arrive faster with zip codes?  Does ESP or astrology work?  Are reaction times faster for males or females? athletes or non-athletes? right or left hand?  Are homeruns, RBI, or batting averages good predictors of baseball salaries?  Are NFL or NHL teams more likely to be able to come from behind in home games?  What is the trend in swimming records? college costs? birth rates?

.


Chapter 11 Experiments and Observational Studies

Statistics Project Evaluation

Names ____________________ ____________________

General Issues  Clearly states the research question  Uses proper vocabulary throughout  Identifies problems and suggests viable solutions  Class presentation was clear and thorough Data Collection  Uses an appropriate design  Properly randomizes  Explains the procedure clearly  Executes the procedure well Data Description and Analysis  Creates appropriate graphs and/or tables  Computes appropriate summary statistics  Has appropriate verbal descriptions  Makes appropriate comparisons Conclusions  Conclusions are clear and in context  Conclusions are appropriate, based on the research  Considers possible statistical significance of results  Answers the original research question Comments:

.

11-13


11-14

Part III Gathering Data

Statistics Quiz D – Chapter 11

Name ___________________________

1. A survey of online consumers asked respondents to indicate how much money they spent for online purchases during the most recent Christmas holiday season. These data were analyzed to see if gender and income level influence the amount consumers spend online for holiday shopping. Which of the following statements is true? I. This is an experimental study. II. This is a randomized block design. III. This is an observational study. A. I only B. II only C. III only D. Both I and III E. Both II and III 2. Quality control engineers fire tiles made with this new material at three different temperature settings (1500°F, 1800°F and 2200°F) and for four different times (2 hours, 6 hours, 10 hours and 12 hours) to determine the optimal parameter settings for firing that will result in tiles with high flexural strength. The response variable in this experiment is A. firing temperature. B. the new material. C. tiles. D. flexural strength. E. firing time. 3. Which is not a critical part of designing a good experiment? A. Control of known sources of variability. B. Replication of the on a sufficient number of subjects. C. Random selection of subjects. D. Random assignment of subjects to treatments. E. All of these are important. 4. An education researcher was interested in examining the effect of the teaching method and the effect of the particular teacher on students' scores on a reading test. In a study, there are four different teachers (Juliana, Felix, Sonia, and Helen) and three different teaching methods (method A, method B, and method C). The number of students participating in the study is 258. Students are randomly assigned to a teaching method and teacher. Identify the response variable measured. A. Teaching method B. The education researcher C. Score on reading test D. Method A, method B, method C E. Teacher

.


Chapter 11 Experiments and Observational Studies

11-15

5. An education researcher was interested in examining the effect of the teaching method and the effect of the particular teacher on students' scores on a reading test. In a study, there are four different teachers (Juliana, Felix, Sonia, and Helen) and three different teaching methods (method A, method B, and method C). The number of students participating in the study is 258. Students are randomly assigned to a teaching method and teacher. Identify the factor(s) in the experiment. A. The education researcher B. Teaching method and teacher C. Juliana, Felix, Sonia, and Helen D. Method A, method B, method C E. Score on reading test 6. An education researcher was interested in examining the effect of the teaching method and the effect of the particular teacher on students' scores on a reading test. In a study, there are four different teachers (Juliana, Felix, Sonia, and Helen) and three different teaching methods (method A, method B, and method C). The number of students participating in the study is 258. Students are randomly assigned to a teaching method and teacher. Identify the levels of the factor "teaching method." A. Teaching method and teacher B. Juliana, Felix, Sonia, and Helen C. The education researcher D. Score on reading test E. Method A, method B, method C 7. An education researcher was interested in examining the effect of the teaching method and the effect of the particular teacher on students' scores on a reading test. In a study, there are two different teachers (Juliana and Felix) and three different teaching methods (method A, method B, and method C). The number of students participating in the study is 258. Students are randomly assigned to a teaching method and teacher. Identify the treatments. A. Juliana and method A, Juliana and method B, Juliana and method C, Felix and method A, Felix and method B, Felix and method C B. Juliana, Felix, Sonia, and Helen C. Method A, method B, method C D. The education researcher E. Teaching method and teacher 8. A company has tried to improve the effectiveness of its dishwashing detergent and wants to see if it works better than the original formula. They use 6 identical new dishwashers and load them identically with dirty dishes. Three packs of each of the two types of detergent are used, and they are randomly assigned to one of the six dishwashers. After the load is run, they rate each load for overall cleanliness. Which of the following is true? A. The response variable is the type of detergent. B. The explanatory variable is the different dishwashers. C. Because each brand is used in three dishwashers, replication is used properly. D. Blinding is impossible in this experiment because they must be able to see the dishes. E. A control group with no detergent at all is needed. .


11-16

Part III Gathering Data

9. Is the aspirin produced by World's Best Pharmaceutical Company better than that of a competitor at relieving headaches? 200 headache suffers are chosen at random. Migraned Testing Service administers the experiment and provides the results evaluation. Three levels are made: participants receive contents from Bottle A, Bottle B, or Bottle C. Other than the fact that one bottle contains placebo aspirin, no other information is given to the testing service regarding the bottles' contents. A. Double blind B. Single blind, only participant is blinded C. Single blind, only evaluator is blinded D. Neither 10. A Columbia University study linked soda consumption to behavior problems in children. Researchers examined data from a previous study that followed 2929 mother-child pairs. One survey asked about behaviors of the child and also about soda consumption. They found that the more soda the kids drank, the more behavior problems they had. What aspect of a welldesigned experiment is absent from this study? A. blinding B. a placebo C. randomization D. a control group E. all of these

.


Chapter 11 Experiments and Observational Studies

Statistics Quiz D – Chapter 11 - Key 1. C 2. D 3. C 4. C 5. B 6. E 7. A 8. D 9. A 10. E

.

11-17


11-18

Part III Gathering Data

Statistics Quiz E – Chapter 11

Name ___________________________

1. An advocacy group is interested in determining if gender (1 = Female, 2 = Male) affects executive level salaries. They take a random sample of executives in three different industries: (1 = Consumer Goods, 2 = Financial, 3 = Health Care). Salary data are collected. Which of the following statements is true? I. This is an experimental study. II. This is an observational study. III. This is a randomized block design. A. I only B. II only C. III only D. Both I and III E. Both II and III 2. The experiment is run once for each combination of brand of AAA batteries and device (TV remote, hand-held game, flashlight and digital camera). The twelve runs are ordered randomly. The time (in minutes) that each battery lasts under continuous usage is recorded. Which of the following statements is true about this design? A. The devices serve as blocks to account for the variability between the lengths of time batteries last in different devices. B. This is a completely randomized design in one factor. C. The interaction effect between brand of battery and type of device is significant. D. This is an observational study. E. This is a retrospective study. 3. In a clinical trial, 780 participants suffering from high blood pressure were randomly assigned to one of three groups. Over a one-month period, the first group received a low dosage of an experimental drug, the second group received a high dosage of the drug, and the third group received a placebo. The diastolic blood pressure of each participant was measured at the beginning and at the end of the period and the change in blood pressure was recorded. Identify the response variable measured. A. Change in diastolic blood pressure B. The treatment received (placebo, low dosage, high dosage. C. The participants in the experiment D. The dosage of the drug E. The one-month period

.


Chapter 11 Experiments and Observational Studies

11-19

4. In a clinical trial, 780 participants suffering from high blood pressure were randomly assigned to one of three groups. Over a one-month period, the first group received a low dosage of an experimental drug, the second group received a high dosage of the drug, and the third group received a placebo. The diastolic blood pressure of each participant was measured at the beginning and at the end of the period and the change in blood pressure was recorded. Identify the factor in the experiment. A. Diastolic blood pressure B. The dosage of the drug C. The participants in the experiment D. The experimental drug E. The one-month period 5. In a clinical trial, 780 participants suffering from high blood pressure were randomly assigned to one of three groups. Over a one-month period, the first group received a low dosage of an experimental drug, the second group received a high dosage of the drug, and the third group received a placebo. The diastolic blood pressure of each participant was measured at the beginning and at the end of the period and the change in blood pressure was recorded. Identify the levels of the factor. A. Placebo, low dosage, high dosage B. Diastolic blood pressure at the start, diastolic blood pressure at the end C. The experimental drug D. High blood pressure, low blood pressure E. The one-month period 6. In a clinical trial, 780 participants suffering from high blood pressure were randomly assigned to one of three groups. Over a one-month period, the first group received a low dosage of an experimental drug, the second group received a high dosage of the experimental drug, and the third group received a placebo. The diastolic blood pressure of each participant was measured at the beginning and at the end of the period and the change in blood pressure was recorded. Identify the treatments. A. The experimental drug B. The one-month period C. Placebo, low dosage of drug, high dosage of drug D. Low dosage of drug, high dosage of drug E. Diastolic blood pressure at start, diastolic blood pressure at end 7. Do energy snack bars improve afternoon alertness? 100 factory workers were randomly assigned to two groups. One group ate energy snack bars at lunch and the other group ate placebo snack bars. At the afternoon break (approximately 1-1/2 hour after lunch), a significant increase in alertness was observed. Later analysis of the experiment results showed 87% of the participants exhibiting increased alertness were from the group which ate energy snack bars. A. Double blind B. Single blind, only participant is blinded C. Single blind, only evaluator is blinded D. Neither .


11-20

Part III Gathering Data

8. Which is true about randomized experiments? I. Randomization reduces the effects of confounding variables. II. Random assignment of treatments allows results to be generalized to the larger population. III. Blocking can be used to reduce the within-treatment variability. A. I only B. II only C. III only D. I and III E. I, II, and III 9. It was discovered that a larger proportion of children who slept with nightlights later developed nearsightedness, compared to children who did not sleep with nightlights. The headlines read, “Leaving a light on for you children causes nearsightedness!” Later it was pointed out that nearsighted people have more trouble seeing in the dark and are more likely to leave lights on at night for their kids. And those same nearsighted parents are likely to have nearsighted kids. This is an example of A. a randomized block design. B. a control group. C. bias. D. a placebo. E. a lurking variable. 10. In order to see which variety of apple tree produces more fruit, a farmer sets up an experiment. He has three plots of land with different soil and natural water availability. Each plot has room for eight trees. The farmer randomly selects four locations in each plot for the first variety of tree and the other four get the second variety. This experiment is… A. completely randomized with one factor: the variety of tree B. randomized block, blocked by plot of land C. completely randomized with two factors D. completely randomized with one factor: the plot of land E. randomized block, blocked by variety of tree

.


Chapter 11 Experiments and Observational Studies

Statistics Quiz E – Chapter 11 - Key 1. B 2. A 3. A 4. D 5. A 6. C 7. A 8. D 9. E 10. B

.

11-21



Statistics Test A – Gathering Data – Part III

Name ________________________

___ 1. A company sponsoring a new Internet search engine wants to collect data on the ease of using it. Which is the best way to collect the data? A) census B) sample survey C) observational study D) experiment E) simulation ___ 2. The January 2005 Gallup Youth Survey telephoned a random sample of 1,028 U.S. teens and asked these teens to name their favorite movie from 2004. Napoleon Dynamite had the highest percentage with 8% of teens ranking it as their favorite movie. Which is true? I. The population of interest is all U.S. teens. II. 8% is a statistic and not the actual percentage of all U.S. teens who would rank this movie as their favorite. III. This sampling design should provide a reasonably accurate estimate of the actual percentage of all U.S. teens who would rank this movie as their favorite. A) I only B) II only C) III only D) I and II E) I, II, and III ___ 3. Suppose a school district decides to randomly test high school students for attention deficit disorder (ADD). There are three high schools in the district, each with grades 9-12. The school board pools all of the students together and randomly samples 250 students. Is this a simple random sample? A) Yes, because the students were chosen at random. B) Yes, because each student is equally likely to be chosen. C) Yes, because they could have chosen any 250 students from the district. D) No, because we can’t guarantee that students from each school in the sample. E) No, because we can’t guarantee that students from each grade in the sample. ___ 4. Because a medical experiment makes use of volunteers… A) there is no randomization used. B) we are cautious about generalizing to people not like our volunteers. C) we can’t draw causal conclusions. D) there are no difficulties because of the control group. E) All of these ___ 5. More dogs are being diagnosed with thyroid problems than have been diagnosed in the past. A researcher identified 50 puppies without thyroid problems and kept records of their diets for several years to see if any developed thyroid problems. This is a(n) A) randomized experiment B) survey C) prospective study D) retrospective study E) blocked experiment ___ 6. A chemistry professor who teaches a large lecture class surveys his students who attend his class about how he can make the class more interesting, hoping he can get more students to attend. This survey method suffers from A) voluntary response bias B) nonresponse bias C) response bias D) undercoverage E) None of the above ___ 7. Placebos are a tool for A) sampling B) blocking C) blinding D) control E) randomization

.


Part III-2

Part III Gathering Data

___ 8. Double-blinding in experiments is important so that I. The evaluators do not know which treatment group the participants are in. II. The participants do not know which treatment group they are in. III. No one knows which treatment any of the participants is getting. A) I only B) II only C) III only D) I and II E) I, II, and III ___ 9. Which of the following is not required in an experimental design? A) blocking B) control C) randomization D) replication E) All are required in an experimental design. ___ 10. A researcher wants to compare the effect of a new type of shampoo on hair condition. The researcher believes that men and women may react to the shampoo differently. Additionally, the researcher believes that the shampoo will react differently on hair that is dyed. The subjects are split into four groups: men who dye their hair; men who do not dye their hair; women who dye their hair; women who do not dye their hair. Subjects in each group are randomly assigned to the new shampoo and the old shampoo. This experiment A) is completely randomized. B) has three factors (shampoo type, gender, whether hair is dyed). C) has two factors (gender and whether hair is dyed) blocked by shampoo type. D) has two factors (shampoo type and whether hair is dyed) blocked by gender E) has one factor (shampoo type), blocked by gender and whether hair is dyed. 11. Video games A headline in a local newspaper announced “Video game playing can lead to better spatial reasoning abilities.” The article reported that a study found “statistically significant differences” between teens who play video games and teens who do not, with teens who play video games testing better in spatial reasoning. Do you think the headline was appropriate? Explain.

.


Review of Part III: Gathering Data

Part III-3

12. College students’ spending A consumer group wants to see if a new education program will improve the spending habits of college students. Students in an economics class are randomly assigned to three different courses on spending habits.

a. What are the experimental units?

_____________________________________

b. How many factors are there? c. How many treatments are there?

_____________ _____________

d. What is the response variable? ____________________________________________ 13. Bone Builder Researchers believe that a new drug called Bone Builder will help bones heal after children have broken or fractured a bone. The researchers believe that Bone Builder will work differently on bone breaks than on bone fractures because of differences in initial bone condition. Bone Builder will be used in conjunction with traditional casts. To test the impact of Bone Builder on bone healing, the researchers recruit 18 children with bone breaks and 30 children with bone fractures. Design an appropriate experiment to determine if Bone Builder will help bones heal.

14. Military funding A college group is investigating student opinions about funding of the military. They phone a random sample of students at the college, asking each person one of these questions (randomly chosen): A: “Do you think that funding of the military should be increased so that the United States can better protect its citizens?” B: “Do you think that funding of the military should be increased?” Which question do you expect will elicit greater support for increased military funding? Explain. What kind of bias is this?

.


Part III-4

Part III Gathering Data

Statistics Test A – Gathering Data – Part III – Key 1. B

2. E

3. C

4. B

5. C

6. D

7. C

8. D

9. A

10. E

11. Video games No, this was not a controlled experiment, so no determination of cause and effect can be made. Perhaps there is something about teens who play video games that make them good at video games and at spatial reasoning, or maybe teens with good spatial reasoning enjoy games more. 12. College students’ spending a. The subjects are the students in the economics class. b. There is one factor (course on spending habits) with three levels. c. Since there is only one factor at three levels, there are three treatments. d. The response variable is the spending habits of the students. 13. Bone Builder

Blocking is employed since breaks and fractures have different initial bone condition. This experiment can be double blind, if patients and bone evaluators don’t know whether or not the patient was given Bone Builder. 14. Military funding The first question will elicit greater support for increased military funding. The wording of the question appeals to the feelings of safety of the respondent. The second question does not do this – it is more neutral and will elicit less response. This is a form of response bias.

.


Review of Part III: Gathering Data

Statistics Test B – Gathering Data – Part III

Part III-5

Name ___________________________

___ 1. A group of volunteers is recruited to test a new lip balm that prevents chapping in winter. The volunteers are paid $100 to participate. Which of these is true? A) The paying of volunteers may indicate that our volunteers are not similar to the general population. B) The paying of volunteers invalidates the experiment. C) The researcher should instead pick a SRS to experiment upon. D) The researcher needs to convince the volunteers to work for free. E) There will no randomization in the experiment. ___ 2. We wish to compare the average ages of the math and science teachers at your high school. Which is the best way to collect the data? A) census B) sample survey C) observational study D) experiment E) simulation ___ 3. Hoping to get information that would allow them to negotiate new rates with their advertisers, Natural Health magazine phoned a random sample of 600 subscribers. 64% of those polled said they use nutritional supplements. Which is true? I. The population of interest is the people who read this magazine. II. “64%” is not a statistic; it’s the parameter of interest. III. This sampling design should provide the company with a reasonably accurate estimate of the percentage of all subscribers who use supplements. A) I only B) I and II only C) I and III only D) II and III only E) I, II, and III ___ 4. Suppose the state decides to randomly test high school wrestlers for steroid use. There are 16 teams in the league, and each team has 20 wrestlers. State investigators plan to test 32 of these athletes by randomly choosing two wrestlers from each team. Is this a simple random sample? A) Yes, because the wrestlers were chosen at random. B) Yes, because each wrestler is equally likely to be chosen. C) Yes, because stratified samples are a type of simple random sample. D) No, because not all possible groups of 32 wrestlers could have been the sample. E) No, because a random sample of teams was not first chosen. ___ 5. Which statement about bias is true? I. Bias results from random variation and will always be present. II. Bias results from a sampling method likely to produce samples that do not represent the population. III. Bias is usually reduced when sample size is larger. A) I only B) II only C) III only D) I and III only E) II and III only ___ 6. A researcher identified 100 men over forty who were not exercising and another 100 men over forty with similar medical histories who were exercising regularly. She followed all the men for several years to see if there was any difference between the two groups in the rate of heart attacks. This is a(n) … A) survey B) prospective study C) retrospective study D) randomized experiment E) matched pairs experiment. ___ 7. Of A-D, which is not a critical part of designing a good experiment? A) Control of known sources of variability B) Random selection of subjects. C) Random assignment of subjects to treatments. D) Replication of the on a sufficient number of subjects. E) All of these are important.

.


Part III-6

Part III Gathering Data

___ 8. In an experiment the primary purpose of blinding is to reduce … A) bias. B) confounding. C) randomness. D) undercoverage.

E) variation.

___ 9. Does donating blood lower cholesterol levels? 50 volunteers have a cholesterol test, then donate blood, and then have another cholesterol test. Which aspect of experimental design is present? A) randomization B) a control group C) a placebo D) blinding E) none of these ___ 10. A researcher wants to compare the performance of three types of pain relievers in volunteers suffering from arthritis. Because people of different ages may suffer arthritis of varying degrees of severity, the subjects are split into two groups: younger than 60 and older than 60. Subjects in each group are randomly assigned to take one of the medications. Twenty minutes later they rate their levels of pain. This experiment … A) is completely randomized. B) uses matched pairs. C) has two factors, medication and age. D) has one factor (med.) blocked by age. E) has one factor (age) blocked by medication type. 11. Insulators Ceramics engineers are testing a new formulation for the material used to make insulators for power lines. They will try baking the insulators at four different temperatures, followed by either slow or rapid cooling. They want to try every combination of the baking and cooling options to see which produces insulators least likely to break during adverse weather conditions. A) What are the experimental units? __________________ B) How many factors are there?

__________________

C) How many treatments are there?

__________________

D) What is the response variable?

__________________

12. Cloning A polling organization is investigating public opinion about cloning. They phone a random sample of 1200 adults, asking each person one of these questions (randomly chosen): A: “Do you favor allowing doctors to use cloned cells in attempts to find cures for such terrible diseases as Alzheimer’s, diabetes, and Parkinson’s?” B: “Should research scientists be allowed to use cloned human embryos in their experiments?” Which question do you expect will elicit greater support for cloning? Explain. What kind of bias is this?

13. Moods A headline in the New York Times announced “Research shows running can alter one’s moods.” The article reported that researchers gave a Personality Assessment Test to 231 males who run at least 20 miles a week, and found “statistically significant personality differences between the runners and the male population as a whole.” Do you think the headline was appropriate? Explain.

.


Review of Part III: Gathering Data

Part III-7

14. Property taxes Administrators of the fire department are concerned about the possibility of implementing a new property tax to raise moneys needed to replace old equipment. They decide to check on public opinion by having a random sample of the city’s population. a. Several plans for choosing the sample are proposed. Write the letter corresponding to the sampling strategy in the blank next to each plan. _____i.

The city has five property classifications: single family homes, apartments, condominiums, temporary housing (hotel and campgrounds), and retail property. Randomly select ten residents from each category. _____ii. Each property owner has a 5-digit ID number. Use a random number table to choose forty numbers. _____iii. At the start of each week, survey every tenth person who arrives at the city park. _____iv. Randomly select a housing classification (say, apartments) and survey all the people who live in that property classification. _____v. Have each firefighter survey 10 of his/her neighbors.

A. convenience B. stratified C. simple D. cluster E. systematic

b. Briefly explain why plan iv suggested above, sampling an entire housing classification, might be biased. Be sure to name the kind(s) of bias you describe.

c. Name and briefly describe the kind of bias that might be present if the administration decides that instead of subjecting people to a random sample they’ll just… i. interview people about the new property tax at a fire station open house.

ii. ask people who are willing to be taxed to sign a petition.

.


Part III-8

Part III Gathering Data

15. Grape juice and blood pressure Researchers who wanted to see if drinking grape juice could help people lower their blood pressure got 120 non-smokers to volunteer for a study. They measured each person’s blood pressure and then randomly divided the subjects into two groups. One group drank a glass of grape juice every day while the other did not. After sixty days the researchers measured everyone’s blood pressure again. They reported that differences in changes in blood pressure between the groups were not statistically significant. a. Was this an experiment or an observational study? Explain briefly.

b. Briefly explain what “not statistically significant” means in this context.

c. Briefly explain why the researchers randomly assigned the subjects to the groups.

d. Since everyone’s blood pressure was measured at the beginning and at the end of the study, the researchers could have simply looked at the grape juice drinkers to see if their blood pressure changed. Briefly explain why the researchers bothered to include the control group.

e. Briefly explain why the researchers studied only non-smokers.

f.

Other researchers now plan to replicate this study using both smokers and non-smokers. Briefly describe the design strategy you think they should use.

.


Review of Part III: Gathering Data

Part III-9

Statistics Test B – Gathering Data – Part III – Key 1. A 2. A

3. C

4. D

5. B

6. B

7. B

8. A

9. E

10. D

11. Insulators a. Material for insulators b. 2 – baking temp and cooling method c. 8 d. Likeliness to break during adverse weather 12. Cloning The first question will elicit greater support for cloning. The wording of the question appeals to the emotions of the respondent. The second question conjures up images of scientists experimenting on humans and cloned embryos, and will elicit less support. This is a form of response bias. 13. Moods No, this was not a controlled experiment, so no determination of cause and effect can be made. Perhaps people in better moods are more likely to be runners, or perhaps something in their body chemistry makes some people enjoy running and also affects their personality. 14. Property taxes a. Several plans for choosing the sample are proposed. Name the sampling strategy in each. i. (B)stratified ii. (C)simple iii. (E)systematic iv. (D)cluster v. (A)convenience b. The housing classification selected might not be a representative group of the population (undercoverage). In fact the housing classification might be made up of residents with similar viewpoints on property taxes, different from the views of the nonresidents. c. i. Convenience sample: People attending a fire station open house are likely to be very supportive of the fire station and not be representative of the entire city (undercoverage). ii. Voluntary Response Bias: individuals with strong opinions or who have reason to need fire equipment would sign. (Or possibly no one volunteers to pay an extra tax.) 15. Grape juice and blood pressure a. Experiment – there was an application of a treatment, grape juice, to groups containing randomly assigned subjects with a comparison of the blood pressure across the treatment groups. b. There was not a large reduction of blood pressure between the group who drank grape juice and the group who did not drink grape juice; the reduction could be reasonably attributed to random variation (or sampling error). c. Randomly assigning subjects to groups allows us to equalize the effects of unknown or uncontrollable sources of variation. d. The control group provided a basis for comparison after the sixty days to determine if the group drinking grape juice had lower blood pressure because of the juice. Maybe everyone’s blood pressure naturally drops at a certain time of the year. e. By studying only non-smokers the researchers were trying to reduce any impact the smoking might have had on blood pressure. The researchers were trying to remove any confounding the smoking might have made on blood pressure. The new study must be blocked by whether a person smokes or not. Half of each group would then be randomly assigned to the grape juice or no grape juice groups. f. Block by smoking status, randomly assigning people in each block to the treatments. Compare blood pressure after 60 days within each block.

.


Part III-10

Part III Gathering Data

Statistics Test C – Gathering Data – Part III

Name _____________________

___ 1. If we wish to compare the average PSAT scores of boys and girls taking Statistics at a high school, which would be the best way to gather these data? A) census B) SRS C) stratified sample D) observational study E) experiment ___ 2. A factory has 20 assembly lines producing a popular toy. To inspect a representative sample of 100 toys, quality control staff randomly selected 5 toys from each line’s output. Was this a simple random sample? A) Yes, because the toys were selected at random. B) Yes, because each toy produced had an equal chance to be selected. C) Yes, because a stratified sample is a type of simple random sample. D) No, because not all combinations of 100 toys could have been chosen. E) No, because toys do not come off the assembly line at random. ___ 3. Which is true about sampling? I. An attempt to take a census will always result in less bias than sampling. II. Sampling error is usually reduced when the sample size is larger. III. Sampling error is the result of random variations and is always present. A) I only B) II only C) III only D) II and III E) all three ___ 4. The owner of a car dealership planned to develop strategies to increase sales. He hoped to learn the reasons why many people who visit his car lot do not eventually buy a car from him. For one month he asked his sales staff to keep a list of the names and addresses of everyone who came in to test drive a car. At the end of the month he sent surveys to the people who did not buy the car, asking them why. About one third of them returned the survey, with 44% of those indicating that they found a lower price elsewhere. Which is true? I. The population of interest is all potential car buyers. II. This survey design suffered from non-response bias. III. Because it comes from a sample 44% is a parameter, not a statistic. A) I only B) II only C) I and II only D) II and III only E) I, II, and III ___ 5. Does regular exercise decrease the risk of cancer? A researcher finds 200 women over 50 who exercise regularly, pairs each with a woman who has a similar medical history but does not exercise, then follows subjects for 10 years to see which group develops more cancer. This is a A) survey B) retrospective study C) prospective study D) randomized experiment E) matched experiment ___ 6. Which is important in designing a good experiment? I. Randomization in assigning subjects to treatments. II. Control of potentially confounding variables. III. Replication of the experiment on a sufficient number of subjects. A) I only B) I and II C) I and III D) II and III E) all three ___ 7. Can watching a movie temporarily raise your pulse rate? Researchers have 50 volunteers check their pulse rates. Then they watch an action film, after which they take check their pulse rates once more. Which aspect of experimentation is present in this research? A) a placebo B) blinding C) randomization D) a control group E) none of these ___ 8. In an experiment the primary purpose of blocking is to reduce A) bias. B) confounding. C) randomness. D) undercoverage.

.

E) variation.


Review of Part III: Gathering Data

Part III-11

___ 9. To check the effect of cold temperatures on the battery’s ability to start a car researchers purchased a battery from Sears and one from NAPA. They disabled a car so it would not start, put the car in a warm garage, and installed the Sears battery. They tried to start the car repeatedly, keeping track of the total time that elapsed before the battery could no longer turn the engine over. Then they moved the car outdoors where the temperature was below zero. After the car had chilled there for several hours the researchers installed the NAPA battery and repeated the test. Is this a good experimental design? A) Yes B) No, because the car and the batteries were not chosen at random. C) No, because they should have tested other brands of batteries, too. D) No, because they should have tested more temperatures. E) No, because temperature is confounded by brand. ___ 10. Twenty dogs and 20 cats were subjects in an experiment to test the effectiveness of a new flea control chemical. Ten of the dogs were randomly assigned to an experimental group that wore a collar containing the chemical, while the others wore a similar collar without the chemical. The same was done with the cats. After 30 days veterinarians were asked to inspect the animals for fleas and evidence of flea bites. This experiment is… A) completely randomized with one factor: the type of collar B) completely randomized with one factor: the species of animal C) randomized block, blocked by species D) randomized block, blocked by type of collar E) completely randomized with two factors 11. Public opinion A member of the City Council has proposed a resolution opposing construction of a new state prison there. The council members decide they want to assess public opinion before they vote on this resolution. Below are some of the methods that are proposed to sample local residents to determine the level of public support for the resolution. Match each with one of the listed sampling techniques. ___ a) Place an announcement in the newspaper asking people to call their council representatives to register their opinions. Council members will tally the calls they receive. ___ b) Have each council member survey 50 friends, neighbors, or co-workers. 1 cluster ___ c) Have the Board of Elections assign each voter a number, 2 convenience then select 400 of them using a random number table. 3 judgment ___ d) Go to a downtown street corner, a grocery store, and a 4 multistage shopping mall; interview 100 typical shoppers at each 5 simple (SRS) location. 6 stratified ___ e) Randomly pick 50 voters from each election district. 7 systematic ___ f) Call every 500th person in the phone book. 8 voluntary response ___ g) Randomly pick several city blocks, then randomly pick 10 residents from each block.. ___ h) Randomly select several city blocks; interview all the adults living on each block. 12. Telephone poll The City Council decides to conduct a telephone poll. Pollsters ask a carefully chosen random sample of adults this question: “Do you favor the construction of a new prison to deal with the high level of violent crime in our State?” In what way might the proportion of “Yes” answers fail to accurately reflect true public opinion? Explain briefly. What kind of bias is this?

.


Part III-12

Part III Gathering Data

13. Preservative Leather furniture used in public places can fade, crack, and deteriorate rapidly. An airport manager wants to see if a leather preservative spray can make the furniture look good longer. He buys eight new leather chairs and places them in the waiting area, four near the south-facing windows and the other four set back from the windows as shown. He assigned the chairs randomly to these spots. a. Use the random numbers given to decide which chairs to spray. Explain your method clearly. 32219 00597 86374

b. Briefly explain why your assignment strategy is important in helping the manager assess the effectiveness of the leather preservative.

14. Candy packaging Marketing researchers wonder if the color and type of a candy’s packaging may influence sales of the candy. They manufacture test packages for chocolate mints in three colors (white, green, and silver) and three types (box, bag, and roll). Suspecting that sales may depend on a combination of package color and type, the researchers prepare nine different packages, then market them for several weeks in convenience stores in various locations. In this experiment. a. what are the experimental units?

____________________

b. how many factors are there?

_______

c. how many treatments are there?

_______

d. what is the response variable?

____________________

15. Aggressiveness A recent study evaluated elementary age children for aggressiveness. This study found that the children who played video games were more likely to engage in aggressive or violent play at school. The researchers said the difference was statistically significant. a. Briefly explain what “statistically significant” means in this context.

b. The news media reported that this study proved that playing computer games causes children to be aggressive or violent. Briefly explain why this conclusion is not justified.

c. But perhaps it is true. We wonder if playing computer games can lead to aggressive or violent behavior in elementary school children. We find 50 young children whose families volunteer to participate in our research. Design an appropriate experiment. (You need not explain how to randomize.)

.


Review of Part III: Gathering Data

Part III-13

Statistics Test C – Gathering Data – Part III – Key 1. A

2. D

3. D

4. C

5. C

6. E

7. E

8. E

9. E

10. C

11. Public opinion a) 8 b) 2 c) 5 d) 3 e) 6 f) 7 g) 4 h) 1 12. Telephone poll Response bias: the wording will induce people to answer “yes,” so the poll will overestimate the public’s support. 13. Preservative a. Use one digit at a time, ignoring 0, 9 and any repeated numbers. Choose two chairs from each row. (Example: 3, 2, 5, and 7) b. Windows may play a role in the leather’s life (sun, heat, etc.). Blocking can reduce variability from that source. 14. Candy packaging a. packages b. 2 c. 9 d. sales 15. Aggressiveness a. The difference in aggressiveness was greater than random variation might reasonably be expected to produce. b. This was not a controlled experiment, so no determination of cause and effect can be made. Perhaps only violent kids like to play video games. c. Randomly divide the kids into two groups of 25. Have one group play video games, and prevent the other group from playing them for a period of time. Have observers evaluate the kids’ aggressiveness on the playground. Don’t let the kids know the purpose of the experiment or that they are being observed. And don’t let the evaluators know which kids are in each group. Compare the aggressiveness ratings for the two groups.

.



Chapter 12 From Randomness to Probability Quiz A covers sections 12.1-12.3. Quiz B covers sections 12.4-12.7. Quizzes C and D are longer quizzes that cover the entire chapter. Solutions to Class Examples 1. Solutions to Class Example 1: 

A car is not U.S.-made. Think – The events U.S.-made and not U.S.-made are complementary events. The Complement Rule can be used. Show – P(not U.S.-made)  1  P(U.S.-made)  1  0.40  0.60 Tell – The probability that a randomly selected car is not made in the U.S. is 60%.

It is made in Japan or Germany. Think – A car cannot be Japanese and German, so the events are disjoint. The Addition Rule can be used. Show – P(Japanese or German)  P(Japanese)  P(German)  0.30  0.10  0.40 Tell – The probability that a randomly selected car was made in Japan or Germany is 40%.

You see two in a row from Japan. Think – Since the cars are selected at random, the events are independent. The Multiplication Rule can be used. Show – P(two Japanese in a row)  P (Japanese)P (Japanese)  (0.30)(0.30)  0.09 Tell – The probability that two randomly selected cars were made in Japan is 9%.

None of three cars came from Germany. Think – Since the cars are selected at random, the events are independent. The Multiplication Rule can be used. Also, German and not German are complementary events, so the Complement Rule can be used. Show

P (No German cars in three cars)  P (Not German)P (Not German)P(Not German)  (0.90)(0.90)(0.90)  0.729

Tell – The probability that none of the three randomly selected cars are made in Germany is 72.9%. .


12-2

Part IV From the Data at Hand to the World at Large

At least one of three cars is U.S.-made. Think – Since the cars are selected at random, the events are independent. The Multiplication Rule can be used. Also, “at least one” and “none” are complementary events, so the Complement Rule can be used. Show – P (at least one U.S. in three)  1  P(No U.S. in three)

 1  (0.60)(0.60)(0.60)  0.784 Tell – The probability that at least one of the three randomly selected cars is made in the U.S. is 78.4%

The first Japanese car is the fourth one you choose. Think – Since the cars are selected at random, the events are independent. The multiplication rule can be used. Also, “Japanese” and “Not Japanese” are complementary events, so the Complement Rule can be used. Show – P (First J is the fourth car)  P (not J and not J and not J and J)   P(Not J)  P(J) 3

 (0.70)3 (0.30)  0.1029

Tell – The probability that the first Japanese car is the fourth car chosen is 10.29%. 2. Solutions will vary depending on the object you choose and the data you collect. 3. See Class Example 3. 4. See Class Example 4. 5. Solutions to Class Example 5:

 One card is drawn. What is the probability it is an ace or red? (General Addition Rule) Answer:

Think – The events “ace” and “red” are not disjoint. There are 2 red aces in the deck. The General Addition Rule can be used. 26 28  522  52 Show – P(ace or red)  P(ace)  P(red)  P(ace and red)  524  52 28 Tell – The probability that a randomly selected card is an ace or red is 52 .

 Two cards are drawn without replacement. What is the probability they are both aces? (General Multiplication Rule) Extend to the probability of getting 5 hearts in a row. Answer:

Think – Two cards are drawn without replacement. The two draws are not independent. The General Multiplication Rule can be used. 12 Show – P(both cards are aces)  P(ace)P(ace|first draw was ace)   524  513   2652 12 . Tell – The probability that two randomly selected cards are both aces is 2652 10 9 12 11 P (5 hearts)   13 52  51  50  49  48   0.0005

.


Chapter 12 From Randomness to Probability

12-3

 I draw one card and look at it. I tell you it is red. What is the probability it is a heart? And what is the probability it is red, given that it is a heart? (Conditional probability) Answer: Think – The card is selected at random. Show –

P(heart|red) 

1 P(heart and red) 13  52  (half of the red cards are hearts) 26 P(red) 2 52

P(red|heart) 

P(red and heart) 13 52  13  1 (all of the hearts are red) P(heart) 52

Tell – If a randomly chosen card is a red, the probability that it is a heart is 0.5. If a randomly chosen card is a heart, it is certain to be red.

 Are “red card” and “spade” independent? Mutually exclusive?

Answer: “Red card” and “spade” are not independent events. There’s a 25% chance a card is a spade, but if you know that you have a red card, you are certain that you don’t have a spade. The events are mutually exclusive (disjoint), since there are no red cards that are also spades.

 Are “red card” and “ace” independent? Mutually exclusive?

Answer: “Red card” and “ace” are independent events. The probability that a card is red is 0.5. The probability that a card chosen from the four aces is red is also 0.5. P(Red)  P(Red | Ace) The events are not mutually exclusive (not disjoint), since there are two red aces.

 Are “face card” and “king” independent? Mutually exclusive?

Answer: “Face card” and “king” are not independent. The probability of drawing a face card is 12 52 . The probability that a face card is drawn from the four kings is 1. P(face)  P(face | King) The events are not mutually exclusive (not disjoint), since all four kings are face cards.

6. Solutions to Class Example 6:

Collect your own data, or use the two-way table provided. Remember that your own data may or may not show evidence that gender and wearing jeans are independent.

 What is the probability that a male wears

M F Total

Jeans 12 8 20

Other 5 11 16

Total 17 19 36

jeans? Answer:

Think – Assume that a male student is selected at random from the class. Consider only the “M” row. Show – P(Jeans|M)  12 17

Tell – The probability that a male wears jeans is 12 17 .

.


12-4

Part IV From the Data at Hand to the World at Large

 What is the probability that someone wearing jeans is a male? Answer:

Think – Assume that a student wearing jeans is selected at random from the class. Show – P(M|Jeans)  12 20 Tell – The probability that someone wearing jeans is male is 12 20 .

 Are being male and wearing jeans disjoint?

Answer: No, being male and wearing jeans are not disjoint events. There are 12 males who were wearing jeans.

 Are sex and attire independent? Have students explain in context what they need to know

to ascertain independence. There are many ways to check for independence; write several such statements down, in words and formula. Answer: Sex and attire are not independent in this class, since P(Jeans)  P(Jeans|M ). 20 , or 56%. The probability that a male is The overall probability of wearing jeans is 36 wearing jeans is 12 17 , or 71%. Males are more likely to be wearing jeans than students in general.

.


Chapter 12 From Randomness to Probability

12-5

7. Solution to Class Example 7: Think – We’re given conditional probabilities, so a tree diagram is useful in this case. We are interested in the probability that a woman who tests positive actually has ovarian cancer. Show –

P (Ovarian Cancer | positive) = =

P (Ovarian Cancer and positive) P (positive) (0.0002)(0.9997) (0.0002)(0.9997) + (0.9998)(0.05)

» 0.00398

Tell – The probability that a woman who tests positive using this method actually has ovarian cancer is only 0.4%. Since the cancer is so rare, the overwhelming majority of positive test results are false positives.

.


12-6

Part IV From the Data at Hand to the World at Large

Statistics Quiz A – Chapter 12

Name

1. Five multiple choice questions, each with four possible answers, appear on your history exam. What is the probability that if you just guess, you a. get none of the questions correct? b. get all of the questions correct? c. get at least one of the questions wrong? d. get your first incorrect answer on the fourth question? 2. Mars, Incorporated manufactures bags of Peanut Butter M&M’s. They report that they make 10% each brown and red candies, and 20% each yellow, blue, and orange candies. The rest of the candies are green. a. If you pick a Peanut Butter M&M at random, what is the probability that i. it is green? ii. it is a primary color (red, yellow, or blue)? iii. it is not orange? b. If you pick four M&M’s in a row, what is the probability that i. they are all blue? ii. none are green? iii. at least one is red? iv. the fourth one is the first one that is brown? c. After picking 10 M&M’s in a row, you still have not picked a red one. A friend says that you should have a better chance of getting a red candy on your next pick since you have yet to see one. Comment on your friend’s statement.

.


Chapter 12 From Randomness to Probability

12-7

Statistics Quiz A – Chapter 12 - Key

1. Five multiple choice questions, each with four possible answers, appear on your history exam. What is the probability that if you just guess, you a. get none of the questions correct? 5 P  get no questions correct   P  get all questions wrong    0.75   0.2373

b. get all of the questions correct? 5 P  get all questions correct    0.25   0.00098

c. get at least one of the questions wrong? 5 P  get at least 1 wrong   1  P  get all questions correct   1   0.25   0.999 d. get your first incorrect answer on the fourth question? P  first incorrect on fourth   P  correct   P  correct   P  correct   P  wrong    0.25    0.25   0.25    0.75   0.0117

2. Mars, Incorporated manufactures bags of Peanut Butter M&M’s. They report that they make 10% each brown and red candies, and 20% each yellow, blue, and orange candies. The rest of the candies are green. a. If you pick a Peanut Butter M&M at random, what is the probability that i. it is green? P  green   1  P  not green   1   0.1  0.1  0.2  0.2  0.2   1  0.8  0.2 ii. it is a primary color (red, yellow, or blue)? P  red, yellow, or blue   P  red   P  yellow   P  blue   0.1  0.2  0.2  0.5 iii. it is not orange? P  not orange   1  P  orange   1  0.2  0.8 b. If you pick four M&M’s in a row, what is the probability that i. they are all blue? 4 P  all blue    0.2   0.0016 ii. none are green? 4 P  none green    0.8   0.4096 iii. at least one is red? 4 P  at least one red   1  P  none red   1   0.9   1  0.6561  0.3439 iv. the fourth one is the first one that is brown? P  fourth is first brown   P  not br   P  not br   P  not br   P  br 

  0.9    0.9    0.9    0.1  0.0729

c. After picking 10 M&M’s in a row, you still have not picked a red one. A friend says that you should have a better chance of getting a red candy on your next pick since you have yet to see one. Comment on your friend’s statement. Assuming a very large number of M&Ms are manufactured, selections of candies are essentially independent, so the colors of the M&M’s picked so far have no effect on the color of the next M&M selected.

.


12-8

Part IV From the Data at Hand to the World at Large

Statistics Quiz B – Chapter 12

Name

1. A survey of families revealed that 58% of all families eat turkey at holiday meals, 44% eat ham, and 16% have both turkey and ham to eat at holiday meals. a. What is the probability that a family selected at random had neither turkey nor ham at their holiday meal?

b. What is the probability that a family selected at random had only ham without having turkey at their holiday meal?

c. What is the probability that a randomly selected family having turkey had ham at their holiday meal?

d. Are having turkey and having ham disjoint events? Explain.

2. Many school administrators watch enrollment numbers for answers to questions parents ask. Some parents wondered if preferring a particular science course is related to the student’s preference in foreign language. Students were surveyed to establish their preference in science and foreign language courses. Does it appear that preferences in science and foreign language are independent? Explain. Chemistry Physics Biology Total French 16 10 8 34 35 23 44 102 Spanish 51 33 52 136 Total

3. For purposes of making budget plans for staffing, a college reviewed student’s year in school and area of study. Of the students, 22.5% are seniors, 25% are juniors, 25% are sophomores, and the rest are freshmen. Also, 40% of the seniors major in the area of humanities, as did 39% of the juniors, 40 % of the sophomores, and 36% of the freshmen. What is the probability that a randomly selected humanities major is a junior? Show your work.

.


Chapter 12 From Randomness to Probability

12-9

Statistics Quiz B – Chapter 12 – Key

1. a. P(neither ham nor turkey)  1  P(ham  turkey)  1  [ P(ham)  P(turkey)  P (ham  turkey)]  1  [0.44  0.58  0.16]  1  0.86  0.14 Or, using the Venn diagram at the right, 14% b.

P(ham only)  P(ham) – P(ham  turkey)  0.44 – 0.16  0.28 Or, using Venn Diagram at the right, 28%.

c. P  ham | turkey  

P  ham  turkey  0.16   0.2759 P(turkey) 0.58

d. No, the events are not disjoint, since some families (16%) have both ham and turkey at their holiday meals. 2. Overall, 102 of 136, or 75%, preferred Spanish. 35 of 51, or 68.6%, of students in Chemistry had Spanish. 23 of 33, or 69.6%, of students in Physics had Spanish, and 44 of 52, or 84.6% of students in Biology had Spanish. Chemistry and Physics students were somewhat less likely to take Spanish than Biology students. It appears that there is an association between preference in science and foreign language. 3.

P(humanities major) = 0.09 + 0.0975 + 0.1 + 0.099 = 0.3865 P(J  H) 0.0975 P( junior | humanities major)    0.2523 P(H) 0.3865

.


12-10

Part IV From the Data at Hand to the World at Large

Statistics Quiz C – Chapter 12

Name

1. A new clothing store advertises that during its Grand Opening every customer that enters the store can throw a bouncy rubber cube onto a table that has squares labeled with discount amounts. The table is divided into ten regions. Five regions award a 10% discount, two regions award a 20% discount, two regions award a 30% discount, and the remaining region awards a 50% discount. (SHOW WORK) 10 30 10 30 10 20 10 50 10 20 a. What is the probability that a customer gets more than a 20% discount?

b. What is the probability that a customer gets less than a 20% discount?

c. What is the probability that the first two customers both get a 50% discount?

d. What is the probability that none of the first three customers gets more than a 30% discount?

e. What is the probability that the first customer to win a 30% discount is the sixth customer that enters the store?

f. What is the probability that there is at least one customer to win a 50% discount among the first five customers that enter the store?

g. As you enter the store you watch the four people in front of you all win 50% discounts. The store manager tells you how lucky you are to be throwing the cube while it is on a hot streak, but the friend with you says you’re unlucky because the streak can’t continue. Comment on their statements.

.


Chapter 12 From Randomness to Probability

12-11

2. According to the American Pet Products Manufacturers Association (APPMA) 2003-2004 National Pet Owners Survey, 39% of U.S. households own at least one dog and 34% of U.S. households own at least one cat. Assume that 60% of U.S. households own a cat or a dog. a. What is the probability that a randomly selected U.S. household owns neither a cat nor a dog?

b. What is the probability that a randomly selected U.S. household owns both a cat and a dog?

c. What is the probability that a randomly selected U.S. household owns a cat if the household has a dog?

3. A manufacturing firm orders computer chips from three different companies: 10% from Company A; 20% from Company B; and 70% from Company C. Some of the computer chips that are ordered are defective: 4% of chips from Company A are defective; 2% of chips from Company B are defective; and 0.5% of chips from Company C are defective. A worker at the manufacturing firm discovers that a randomly selected computer chip is defective. What is the probability that the computer chip came from Company B? Show your work.

.


12-12

Part IV From the Data at Hand to the World at Large

4. A survey of an introductory statistics class asked students whether or not they ate breakfast the morning of the survey. Results are as follows: Breakfast Yes No Total Male 66 66 132 Sex Female 125 74 199 Total 191 140 331 a. What is the probability that a randomly selected student is female?

b. What is the probability that a randomly selected student ate breakfast?

c. What is the probability that a randomly selected student is a female who ate breakfast

d. What is the probability that a randomly selected student is female, given that the student ate breakfast?

e. What is the probability that a randomly selected student ate breakfast, given that the student is female?

f. Does it appear that whether or not a student ate breakfast is independent of the student’s sex? Explain.

.


Chapter 12 From Randomness to Probability

12-13

Statistics Quiz C – Chapter 12 – Key

1. A new clothing store advertises that during its Grand Opening every customer that enters the store can throw a bouncy rubber cube onto a table that has squares labeled with discount amounts. The table is divided into ten regions. Five regions award a 10% discount, two regions award a 20% discount, two regions award a 30% discount, and the remaining region awards a 50% discount. (SHOW WORK) 10 30 10 30 10 20 10 50 10 20 a. What is the probability that a customer gets more than a 20% discount? 2 1 3 a. P(more than 20%) = P(30  50)  P(30)  P (50)     0.3 10 10 10 b. What is the probability that a customer gets less than a 20% discount? 5 a. P(less than 20%) = P(10)   0.5 10

c. What is the probability that the first two customers both get a 50% discount? 2 1 1 a. P  50  50   P(50)  P(50)      0.01  10  100 d. What is the probability that none of the first three customers gets more than a 30% discount? a. P(not over 30%) = 1 – P(50%) = 1 – 0.1 = 0.9 b. P(three in a row not over 30%) = (0.9)3 = 0.729 e. What is the probability that the first customer to win a 30% discount is the sixth customer that enters the store? 2 a. P(not 30%) = 1 – P(30%) = 1   0.8 10 5 b. P(5 not 30%, then 30%) =  0.8  0.2   0.065536  0.07 f. What is the probability that there is at least one customer to win a 50% discount among the first five customers that enter the store? a. P(at least one 50% in five) = 1 – P(no 50% in five) = 5 9 1     1  0.59049  0.40951  0.41  10  g. As you enter the store you watch the four people in front of you all win 50% discounts. The store manager tells you how lucky you are to be throwing the cube while it is on a hot streak, but the friend with you says you’re unlucky because the streak can’t continue. Comment on their statements. a.

The tosses are independent, so if the table and cube are fair the four winners in front have no effect on the next person’s chances of tossing a 50% discount.

.


12-14

Part IV From the Data at Hand to the World at Large

2. According to the American Pet Products Manufacturers Association (APPMA) 2003-2004 National Pet Owners Survey, 39% of U.S. households own at least one dog and 34% of U.S. households own at least one cat. Assume that 60% of U.S. households own a cat or a dog. NOTE: solutions based on a Venn Diagram are acceptable. a. What is the probability that a randomly selected U.S. household owns neither a cat nor a dog? P  neither cat nor dog   1  P  cat  dog   1  0.6  0.4 b. What is the probability that a randomly selected U.S. household owns both a cat and a dog? P  cat  dog   P  cat   P  dog   P  cat  dog  0.60  0.34  0.39  P  cat  dog 

P  cat  dog   0.13

c. What is the probability that a randomly selected U.S. household owns a cat if the household has a dog? P  cat  dog  0.13 1 P  cat dog     P  dog  0.39 3 3. A manufacturing firm orders computer chips from three different companies: 10% from Company A; 20% from Company B; and 70% from Company C. Some of the computer chips that are ordered are defective: 4% of chips from Company A are defective; 2% of chips from Company B are defective; and 0.5% of chips from Company C are defective. A worker at the manufacturing firm discovers that a randomly selected computer chip is defective. What is the probability that the computer chip came from Company B? Show your work.

P  B defective  

P  B  defective  0.0040   0.3478 P  defective  0.0040  0.0040  0.0035

.


Chapter 12 From Randomness to Probability

12-15

4. A survey of an introductory statistics class asked students whether or not they ate breakfast the morning of the survey. Results are as follows: Breakfast Yes No Total Male 66 66 132 Sex Female 125 74 199 Total 191 140 331 a. What is the probability that a randomly selected student is female? 199 P  female    0.601 331 b. What is the probability that a randomly selected student ate breakfast? 191 P  ate breakfast    0.577 331 c. What is the probability that a randomly selected student is a female who ate breakfast? 125 P  female  ate breakfast    0.378 331 d. What is the probability that a randomly selected student is female, given that the student ate breakfast? P(female  breakfast) 125 P  female breakfast     0.654 P(breakfast) 191 e. What is the probability that a randomly selected student ate breakfast, given that the student is female? P(breakfast  female) 125 P  breakfast female     0.628 P(female) 199 f. Does it appear that whether or not a student ate breakfast is independent of the student’s sex? Explain. Since P  breakfast female   0.628  0.577  P  breakfast  , eating breakfast is not independent of the sex of the student. Females seem to be a bit more likely to eat breakfast.

.


12-16

Part IV From the Data at Hand to the World at Large

Statistics Quiz D – Chapter 12

Name

1. A survey of local car dealers revealed that 64% of all cars sold last month had CD players, 28% had alarm systems, and 22% had both CD players and alarm systems. a. What is the probability one of these cars selected at random had neither a CD player nor an alarm system?

__________ b.

What is the probability that a car had a CD player unprotected by an alarm system?

__________ c.

What is the probability a car with an alarm system had a CD player? __________

d.

Are having a CD player and an alarm system disjoint events? Explain.

2. A sporting goods store announces a “Wheel of Savings” sale. Customers select the merchandise they want to purchase, then at the cash register they spin a wheel to determine the size of the discount they will receive. The wheel is divided into 12 regions, like a clock. Six of those regions are red, and award a 10% discount. The three white regions award a 20% discount and two blue regions a 40% discount. The remaining region is gold, and a customer whose lucky spin lands there gets a 100% discount - the merchandise is free! (SHOW WORK)

a. What is the probability that a customer gets at least a 40% discount? __________ b. What is the probability that three consecutive customers all get 20% discounts? __________ c. What is the probability that none of the first four customers gets a discount over 20%? __________ d. What is the probability that the first gold winner (100%) is the fifth customer in line? __________ .


Chapter 12 From Randomness to Probability

12-17

e. As you wait your turn in line there are three gold winners in a row. A lively discussion ensues between the next two customers. One thinks that streak about kills her chances of winning free merchandise, as the wheel won’t come up gold again for a very long time. The other says that the wheel is clearly on a hot streak, so they are lucky to be next in line. Comment on their opinions.

3. All airline passengers must pass through security screenings, but some are subjected to additional searches as well. Some travelers who carry laptops wonder if that makes them more likely to be searched. Data for 420 passengers aboard a cross-country flight are summarized in the table shown. Does it appear that being subjected to an additional search is independent of carrying a laptop computer? Explain. Searched? Yes No Total Laptop 30 42 72 No laptop 145 203 348 Total 175 245 420

4. For purposes of making on-campus housing assignments, a college classifies its students as Priority A (seniors), Priority B (juniors), and Priority C (freshmen and sophomores). Of the students who choose to live on campus, 10% are seniors, 20% are juniors, and the rest are underclassmen. The most desirable dorm is the newly constructed Gold dorm, and 60% of the seniors elect to live there. 15% of the juniors also live there, along with only 5% of the freshmen and sophomores. What is the probability that a randomly selected resident of the Gold dorm is a senior? Show your work clearly on the back of this page.

.


12-18

Part IV From the Data at Hand to the World at Large

Statistics Quiz D – Chapter 12 – Key

1. a. P  CD  A   P(CD)  P( A)  P(CD  A)  0.64  0.28  0.22  0.70 , so P(neither) = 1 – 0.70 = 0.30 b. P  CD  AC   0.64  0.22  0.42 c. P(CD | A) 

P(CD  A) 0.22   0.786 P( A) 0.28

d. No. P(CD  A)  0.22 . If they were disjoint, the probability of a car having both features would be 0. 2. A sporting goods store announces a “Wheel of Savings” sale. Customers select the merchandise they want to purchase, then at the cash register they spin a wheel to determine the size of the discount they will receive. The wheel is divided into 12 regions, like a clock. Six of those regions are red, and award a 10% discount. The three white regions award a 20% discount and two blue regions a 40% discount. The remaining region is gold, and a customer whose lucky spin lands there gets a 100% discount the merchandise free! (SHOW WORK) a. What is the probability that a customer gets at least a 40% discount? P(at least 40) = P(40 or 100) = P(40) + P(100) = 122  121  123 = 0.25 b. What is the probability that three consecutive customers all get 20% discounts? 3 P(20 and 20 and 20) = P(20)  P(20)  P(20) =  123  = 0.016 c. What is the probability that none of the first four customers gets a discount over 20%? P(not over 20) = 1 – P(over 20) = 1 – P(at least 40) = 1 – 0.25 = 0.75 4 P(4 in a row not over 20) =  0.75  = 0.316 d. What is the probability that the first gold winner (100%) is the fifth customer in line? 11 P(not gold) = 1 – P(gold) = 1 - 121 = 12 11 P(4 not gold, then gold) =  12   121  = 0.059 4

.


Chapter 12 From Randomness to Probability

12-19

e. As you wait your turn in line there are three gold winners in a row. A lively discussion ensues between the next two customers. One thinks that streak about kills her chances of winning free merchandise, as the wheel won’t come up gold again for a very long time. The other says that the wheel is clearly on a hot streak, so they are lucky to be next in line. Comment on their opinions. The spins are independent, so if the wheel is fair the three gold winners in a row have no effect on the next person’s chances. The first argument is based on the Law of Averages, and certainly wrong. If the wheel is somehow imbalanced to favor the gold space, the second argument might be valid, but we haven’t enough evidence.

3. Overall, 175 of 420 passengers, or 41.7% were searched. 30 of 72 people carrying laptops were searched, also 41.7%. Carrying a laptop does not change the probability that a passenger is searched, so the events are independent.

4.

P( A | G) 

P( A  G) 0.060 0.060    0.48 P(G ) 0.060  0.030  0.035 0.125

.


12-20

Part IV From the Data at Hand to the World at Large

Statistics Quiz E – Chapter 12

Name

1. The weather reporter predicts that there is a 20% chance of snow tomorrow for a certain region. What is meant by this phrase? A. It will rain 20% of the day tomorrow. B. The occurrence of snow is "truly random" and will occur 20% of the time. C. 20% of the time it snows on this date. D. In circumstances "like this," snow occurs 20% of the time. E. Snow occurs 20% of the time in this region. 2. A manufacturing process has a 70% yield, meaning that 70% of the products are acceptable and 30% are defective. If three of the products are randomly selected, find the probability that all of them are acceptable. A. 0.343 B. 2.1 C. 0.027 D. 0.429 E. 0.7 3. In a blood testing procedure, blood samples from 5 people are combined into one mixture. The mixture will only test negative if all the individual samples are negative. If the probability that an individual sample tests positive is 0.12, what is the probability that the mixture will test positive? A. 0.528 B. 0.0144 C. 1.00 D. 0.0000249 E. 0.472 4. Determine whether the events of rolling a fair die two times are disjoint, independent, both, or neither. A. Disjoint B. Independent C. Both D. Neither

.


Chapter 12 From Randomness to Probability

12-21

5. A professor divided the students in her business class into three groups: those who have never taken a statistics class, those who have taken only one semester of a statistics class, and those who have taken two or more semesters of statistics. The professor randomly assigns students to groups of three to work on a project for the course. If 10% of the students have never taken a statistics class, 35% have taken only one semester of a statistics class, and the rest have taken two or more semesters of statistics, what is the probability that at least one of the first two groupmates you meet has studied more than one semester of statistics? A. 0.303 B. 0.798 C. 0.45 D. 0.203 E. 0.55 6. A survey of senior citizens at a doctor's office shows that 40% take blood pressure-lowering medication, 47% take cholesterol-lowering medication, and 13% take both medications. What is the probability that a senior citizen takes either blood pressure-lowering or cholesterol-lowering medication? A. 1 B. 0.6 C. 0.74 D. 0.87 E. 0 7. For a person selected randomly from a certain population, events A and B are defined as follows. A = event the person is male B = event the person is a smoker Find P( A or B) . Round approximations to two decimal places. A. 0.4128 B. 0.57 C. 0.41 D. 0.87 E. 0.72

.


12-22

Part IV From the Data at Hand to the World at Large

8. College students were given three choices of pizza toppings and asked to choose one favorite. The following table shows the results. toppings freshman sophomore junior senior cheese 16 16 26 29 meat 25 29 16 16 veggie 16 16 25 29 Find P(favorite topping is meat | student is junior). A. 0.239 B. 0.16 C. 0.186 D. 0.301 E. 0.062 9. College students were given three choices of pizza toppings and asked to choose one favorite. The following table shows the results. toppings freshman sophomore junior senior cheese 16 16 26 29 meat 25 29 16 16 veggie 16 16 25 29 Given that a student's favorite topping is meat, what is the probability that the student is a junior? A. 0.209 B. 0.14 C. 0.059 D. 0.289 E. 0.179 10. You roll a fair die five times. What is the probability that the numbers you roll are not all 3's? A. 0.9999 B. 0.0001 C. 0.833 D. 0.402 E. 0.598

.


Chapter 12 From Randomness to Probability

Statistics Quiz E – Chapter 12 - Key

1. D 2. A 3. E 4. B 5. B 6. C 7. B 8. A 9. E 10. A

.

12-23


12-24

Part IV From the Data at Hand to the World at Large

Statistics Quiz F – Chapter 12

Name

1. At the track, a gambler bets on the wrong horse in a 10-horse field nine times in a row. Later, when talking to a friend, he said he was confident that he would pick the winner the next time, because he was "due to pick a winner." Comment on his reasoning. A. If he doesn't pick the winning horse next time, he will shortly after that. B. When there are 10 horses in a race and he has chosen the wrong horse nine times in a row, he statistically should pick a winner the next time. C. This is false reasoning because there is no "law of averages" for independent events. D. This is false reasoning because he doesn't appear to be lucky. E. None of the above apply. 2. A professor divided the students in her business class into three groups: those who have never taken a statistics class, those who have taken only one semester of a statistics class, and those who have taken two or more semesters of statistics. The professor randomly assigns students to groups of three to work on a project for the course. If 65% of the students have never taken a statistics class, 15% have taken only one semester of a statistics class, and the rest have taken two or more semesters of statistics, what is the probability that the first groupmate you meet has studied at least two semesters of statistics? A. 0.85 B. 0.35 C. 0.15 D. 0.80 E. 0.20 3. A professor divided the students in her business class into three groups: those who have never taken a statistics class, those who have taken only one semester of a statistics class, and those who have taken two or more semesters of statistics. The professor randomly assigns students to groups of three to work on a project for the course. If 55% of the students have never taken a statistics class, 25% have taken only one semester of a statistics class, and the rest have taken two or more semesters of statistics, what is the probability that the first groupmate you meet has studied some statistics? A. 0.75 B. 0.80 C. 0.45 D. 0.25 E. 0.20

.


Chapter 12 From Randomness to Probability

12-25

4. A professor divided the students in her business class into three groups: those who have never taken a statistics class, those who have taken only one semester of a statistics class, and those who have taken two or more semesters of statistics. The professor randomly assigns students to groups of three to work on a project for the course. If 35% of the students have never taken a statistics class, 25% have taken only one semester of a statistics class, and the rest have taken two or more semesters of statistics, what is the probability that neither of the first two groupmates you meet has studied any statistics? A. 0.35 B. 0.70 C. 0.65 D. 0.123 E. 0.40 5. A professor divided the students in her business class into three groups: those who have never taken a statistics class, those who have taken only one semester of a statistics class, and those who have taken two or more semesters of statistics. The professor randomly assigns students to groups of three to work on a project for the course. If 10% of the students have never taken a statistics class, 35% have taken only one semester of a statistics class, and the rest have taken two or more semesters of statistics, what is the probability that both of the first two groupmates you meet have studied at least one semester of statistics? A. 0.90 B. 0.35 C. 0.123 D. 0.81 E. 1.8 6. One ball is removed from a bag containing 1 blue ball, 1 red ball, 1 yellow ball, and 1 green ball. Without returning the first ball to the bag a second ball is removed. Determine whether the events are disjoint, independent, both, or neither. A. Disjoint B. Independent C. Both D. Neither 7. A survey revealed that 45% of people are entertained by reading books, 46% are entertained by watching TV, and 9% are entertained by both books and TV. What is the probability that a person will be entertained by either books or TV? A. 0.82 B. 0.9 C. 0.91 D. 0.8 E. 1

.


12-26

Part IV From the Data at Hand to the World at Large

8. In one city, 47.5% of adults are female, 10.2% of adults are left-handed, and 4.8% are females. For an adult selected at random from the city, let F = event the person is female L = event the person is left-handed. Find P(F or L). Round approximations to three decimal places. A. 0.482 B. 0.529 C. 0.577 D. 0.6245 E. 0.679 9. College students were given three choices of pizza toppings and asked to choose one favorite. The following table shows the results. toppings freshman sophomore junior senior cheese 16 16 26 29 meat 25 29 16 16 veggie 16 16 25 29 Find P(favorite topping is veggie | student is junior or senior). A. 0.316 B. 0.46 C. 0.359 D. 0.605 E. 0.197 10. College students were given three choices of pizza toppings and asked to choose one favorite. The following table shows the results. toppings freshman sophomore junior senior cheese 16 16 26 29 meat 25 29 16 16 veggie 16 16 25 29 Given that a student's favorite topping is veggie, what is the probability that the student is a junior or a senior? A. 0.208 B. 0.380 C. 0.385 D. 0.41 E. 0.631

.


Chapter 12 From Randomness to Probability

Statistics Quiz F – Chapter 12 - Key

1. C 2. E 3. C 4. D 5. D 6. A 7. A 8. B 9. C 10. E

.

12-27



Chapter 13 Sampling Distribution Models and Confidence Intervals for Proportions Solutions to Class Examples 1. Solution to Class Example 1: Think – We want a confidence interval for the proportion of all voters who will vote “yes” on the upcoming school budget, based on a random sample of 330 voters. In our sample, 144 respondents stated that they would vote “yes.” We want to be reasonably confident in our results, so we will construct a confidence interval of ±2 standard errors around our sample proportion, or approximately 95% confidence. Plausible Independence Condition: It is reasonable to think that the responses were mutually independent, provided good surveying techniques were used. Random Sampling Condition: The voters were sampled randomly. 10% Condition: Provided there are more than 3300 eligible voters, 330 is less than 10% of the population of voters. Success/Failure Condition: npˆ  144 and nqˆ  186 , which are both at least 10, so the sample is large enough.

The conditions are satisfied, so I can use the Normal model to find a one proportion z-interval. ˆ Show – n  330 and pˆ  144 330  0.436 , so SE ( p ) 

ˆˆ pq (0.436)(0.564)   0.0273 330 n

The margin of error is ME  z *  SE ( pˆ )  2(0.0273)  0.055 The confidence interval is 0.436  0.055 or (0.381, 0.491) Tell – We are (approximately) 95% confident that between 38.1% and 49.1% of voters will vote “yes” on the upcoming budget. 2. Solution to Class Example 2: Think – We want a confidence interval for the proportion of all people who will improve after using a new medication, based on an experiment involving 53 subjects. In our sample, 27% of the subjects improved. Plausible Independence Condition: One patient responding to the medication shouldn’t have an influence on other patients responding to the medication. Random Sampling Condition: The patients were part of an experiment, which hopefully included random assignment of volunteers to treatment groups. 10% Condition: The 10% condition doesn’t apply. We are testing medication, not patients. Success/Failure Condition: npˆ  53(0.27)  14 and nqˆ  53(0.73)  39 , which are both at least 10, so the sample is large enough.

.


13-2

Part IV From the Data at Hand to the World at Large

The conditions are satisfied, so I can use the Normal model to find a one proportion z-interval. Show – n  53 and pˆ  0.27 , so SE ( pˆ ) 

ˆˆ pq (0.27)(0.73)   0.061 53 n

The margin of error is ME  z *  SE ( pˆ )  1.96(0.061)  0.12 The confidence interval is 0.27  0.12 or (0.15, 0.39) Tell – We are 95% confident that between 15% and 39% of people will improve after using the new medication.

This interval is quite wide for a couple of reasons. The sample size of 53 people doesn’t provide a great deal of accuracy in our estimate. The standard error is still quite large. Also, the more confident we are that we have succeeded in capturing the true proportion, the less precise our interval becomes. 90% Confidence Interval

The margin of error is ME  z *  SE ( pˆ )  1.645(0.061)  0.10 The confidence interval is 0.27  0.10 or (0.17, 0.37) We are 90% confident that between 17% and 37% of people will improve after using the new medication. This interval has the advantage of being more precise, but we are less confident in our ability to capture the true proportion within our interval. Finding the sample size required for ± 5%, with 98% confidence.

Consider the formula for margin of error. We believe the improvement rate to be 0.27 from our preliminary study. The value of z * for 98% confidence is 2.326. p̂q̂

To be safe, always round the estimate up.

ME  z

We need to run an experiment with at least 427 people in receiving the new medication to have a margin of error of ± 5%, with 98% confidence.

0.05  2.326

*

n (0.27)(0.73) n

 (0.27)(0.73)   n 

0.052  2.3262  n

.

2.3262 (0.27)(0.73) 0.052

 426.546


Chapter 13 Sampling Distribution Models and Confidence Intervals for Proportions

13-3

3. Solution to Class Example 3:

We’ll do what polling organizations usually do, using 95% confidence and choosing the most cautious proportion, 50%. A sample size of at least 1068 likely voters is required.

ME  z

*

ˆˆ pq n

(0.5)(0.5) n (0.5)(0.5) 0.032 1.962 n 0.031.96

n

1.962 (0.5)(0.5)  1067.11 0.032

4. See Class Example 4. 5. See Class Example 5. 6. Answers will vary depending on the class data. 7. Solutions to Class Example 7:

 Of all cars on the interstate, 80% exceed the speed limit. What proportion of speeders might we see among the next 50 cars? Solution: Think – We want to find the distribution of the proportion of the next 50 cars that may be speeding on the highway. 80% of all cars on the highway are speeding. 10% Condition: The 50 cars can be considered a representative sample of all cars on the highway, and 50 is less than 10% of all cars on the highway. Success/Failure Condition: np = 50(0.80) = 40, nq = 50(0.20) = 10 are both at least 10.

Therefore, the sampling model for the proportion of speeders in 50 cars has mean 0.80 pq (0.80)(0.20) and standard deviation of SD( pˆ )    0.057 . n 50 The model for p̂ is N(0.80,0.057). Show –

Tell – According to the Normal model, we expect 68% of the samples of 50 cars to have proportions of speeders between 0.743 and 0.857, 95% of the samples to have proportions between 0.686 and 0.914, and 99.7% of the samples to have proportions between 0.629 and 0.971.

 We don’t know it, but 52% of voters plan to vote “Yes” on the upcoming school budget.

We poll a random sample of 300 voters. What might the percentage of yes-voters appear to be in our poll? .


13-4

Part IV From the Data at Hand to the World at Large

Solution: Think – We want to find the distribution of the proportion of 300 voters that may vote “Yes” on the upcoming school budget. 52% of voters plan to vote “Yes.” 10% Condition: The 300 voters can be considered a representative sample of all voters, and 300 is less than 10% of all voters. Success/Failure Condition: np = 300(0.52) = 156, nq = 300(0.48) = 144 are at least 10.

Therefore, the sampling model for the proportion of “Yes”-voters in a sample of 300 has pq (0.52)(0.48)   0.029 . mean 0.52 and standard deviation of SD( pˆ )  n 300 The model for p̂ is N(0.52,0.029). Show –

Tell – According to the Normal model, we expect 68% of the samples of 300 voters to have proportions of “Yes”-voters between 0.491 and 0.549, 95% of the samples to have proportions between 0.462 and 0.578, and 99.7% of the samples to have proportions between 0.433 and 0.607.

.


Chapter 13 Sampling Distribution Models and Confidence Intervals for Proportions

13-5

8. Solution to Class Example 8:

 Blue M&M’s are supposed to make up 24% of the candies sold. In a large bag of 250 M&M’s, what is the probability that we get at least 20% blue candies? Solution: Think – We want to find the probability of getting a bag of 250 M&M’s that has at least 20% blue candies. 10% Condition: The 250 M&M’s can be considered a representative sample of all M&M’s, and 250 is less than 10% of all M&M’s. Success/Failure Condition: np = 250(0.24) = 60, nq = 250(0.76) = 190 are at least 10.

Therefore, the sampling model for the proportion of “groovy” M&M’s in a sample of 250 pq (0.24)(0.76) has mean 0.24 and standard deviation of SD( pˆ )    0.027 . n 250 The model for p̂ is N(0.24,0.027). Show –

z

z

pˆ  p pq n 0.20  0.24

(0.24)(0.76) 250 z  1.48

Tell – According to the Normal model, the probability that a bag contains at least 20% blue M&M’s is approximately 93.1%. Investigative Tasks

An Investigative Task is provided. Quiz A covers only sampling distributions. Quiz B covers only confidence intervals. Quizzes C and D are comprehensive and longer.

.


13-6

Part IV From the Data at Hand to the World at Large

Statistics – Investigative Task

Chapter 15

Simulated Coins In this Task you will investigate the sampling distribution and model for the proportion of heads that may show up when a coin is tossed repeatedly. Toss the coins if you want, but it’s much easier and faster to do a simulation! 1. Set up the calculator’s or computer software package’s random number generator to simulate tossing a coin 25 times. (The easiest way to do this is to generate 0’s and 1’s with equal probability, with 1 representing heads. By adding up all the 0’s and 1’s you can effectively count the number of heads. Dividing that count by the number of tosses will get you p̂ , the sample proportion of heads. For example, on a TI-84, the complete calculator command would be sum(randInt(0,1,25))/25.) 2. Run 20 trials, recording all the sample proportions and make a histogram of the results. 3. Repeat your simulation, this time tossing the coin 100 times. Again make a histogram of twenty sample proportions. 4. Compare your two distributions of the proportions of heads observed in your simulations. 5. What should have happened? Describe the sampling model for 100 tosses. 6. Compare the actual distribution of your twenty sample proportions for 100 tosses to what the sampling model predicts. 7. Describe how your results might differ if you had run 1000 trials of the simulation instead of only 20.

.


Chapter 13 Sampling Distribution Models and Confidence Intervals for Proportions

Statistics – Investigative Task

Chapter 13

Components Demonstrates clear understanding of sampling distributions and models. Simulations and Histograms: o completes 20 trials for each o constructs well-labeled graphs o uses the same scale for comparison Model: o checks conditions o has correct parameters Compares histograms: o shapes (roughly symmetric?) o centers (both near 50%?) o spreads (less variability for 100?) Compares histogram to model: o sketches curve on comparable scale or uses the 68-95-99.7 Rule o compares shape, center, and spread o suggests that more trials should produce a better fit

Think

Show

Tell

13-7

Comments

Components are scored as Essentially correct, Partially correct, or Incorrect 1: Simulations and Histograms E – Have all 3 features P – Graphs constructed or labeled poorly, have different scales, or not 20 trials I – Graphs are inappropriate or incorrect.

2: Comparison of Histograms E – Correctly compares shapes, centers, and spreads P – Correctly compares only two of the three features I – At most one comparison is correct

3: The Model E – Checks conditions, then correctly describes the shape, center, and spread P – Fails to check conditions, or some aspect of description is incorrect I – Makes several mistakes in verifying or describing the sampling model.

4: Comparison of Histogram and Model – A complete comparison will discuss shapes, centers, and spreads, will have comparative graphs or invoke the 68-95-99.7 Rule, and will note that more runs should produce an observed distribution in closer agreement with the theoretical model. E – Comparison has all 5 listed properties. P – Comparison has 3 or 4 of the listed properties. I – Comparison has fewer than 3 of the properties.

Scoring

 

E’s count 1 point, P’s are 1/2 Grade A = 4, B = 3, etc. +/- based on rounding (Ex: 3.5 rounded to 3 is B+)

Name ________________________________

.

Grade ____


13-8

Part IV From the Data at Hand to the World at Large

NOTE: We present a model solution with some trepidation. This is not a scoring key, just an example. Many other approaches could fully satisfy the requirements outlined in the scoring rubric. That (not this) is the standard by which student responses should be evaluated. Model Solution - Investigative Task – Simulated Coins

In order to investigate the sampling distribution and model for the proportion of heads that may show up when a coin is tossed repeatedly, I will use simulation. I will examine the similarities and differences between the sampling distributions of the proportions of heads in 25 flips and 100 flips. Component – One coin toss is a component. Generate random digits 0-1, with 0 = tails and 1 = heads. Trial – A trial consists of 25 simulated flips for the first simulation, then 100 flips for the second simulation. Response variable – The response variable is the number of heads. Statistic – The statistics calculated is the proportion of heads. This will be calculated by dividing the number of heads by 25 or 100, depending on which simulation is being conducted. The results of the simulations are summarized in the following histograms.

Both distributions of proportions are at least roughly unimodal and symmetric. The distribution of 25 simulated flips has a mean proportion of heads of 0.464, while the distribution of 100 simulated flips has a mean proportion of heads of 0.505, both fairly close to the expected proportion of 0.5. The distribution of 25 simulated flips is much more spread out than the distribution of 100 simulated flips, with standard deviations of 0.122 and 0.057, respectively. Coin flips are independent of one another, and 100 flips is a sufficiently large sample size. Furthermore, we know that coin flips follow a binomial model, which is unimodal and symmetric. The sampling model is Normal, with mean 0.50 and standard deviation, pq (0.5)(0.5) SD( pˆ )    0.05 . n 100

.


Chapter 13 Sampling Distribution Models and Confidence Intervals for Proportions

13-9

According to the Normal model, we expect 68% of samples of 100 flips to have between 45% and 55% heads, 95% of samples to have between 40% and 60% heads, and 99.7% of samples to have between 35% and 65% heads.

Both the simulated distribution and the sampling model are unimodal and symmetric, centered around 50% heads, with similar spread. The simulated distribution has standard deviation 5.7%, compared to 5% for the sampling model. If 1000 trials had been run, we would expect the sampling model to generally be a much better fit to the simulated distribution of the proportion of heads.

.


13-10

Part IV From the Data at Hand to the World at Large

Statistics Quiz A – Chapter 13

Name

1. According to Gallup, about 33% of Americans polled said they frequently experience stress in their daily lives. Suppose you are in a class of 45 students. a. What is the probability that no more than 12 students in the class will say that they frequently experience stress in their daily lives? (Make sure to identify the sampling distribution you use and check all necessary conditions.)

b. If 20 students in the class said they frequently experience stress in their daily lives, would you be surprised? Explain, and use statistics to support your answer.

2. A fair coin is flipped 100 times. a. Describe the center, shape, and spread of the sampling distribution of the proportion of heads.

b. What conditions are satisfied that ensure your answers in part (a) are correct?

.


Chapter 13 Sampling Distribution Models and Confidence Intervals for Proportions

13-11

Statistics Quiz A – Chapter 13 – Key

1. a. We want to find the probability that no more than 12 students in the class will say that they frequently experience stress. This is the same as asking the probability of finding less than 26.7% of “stressed” students in a class of 45 students. Check the conditions: 1. 10% condition: 45 students is less than 10% of all students who could take the class 2. Success/failure cond.: np  45(0.33)  14.85 , nq  45(0.67)  30.15 , which both exceed 10 (0.33)(0.67) We can use the N(0.33 ,  0.070 ) to model the sampling distribution. 45 We need to standardize the 26.7% and the find the probability of getting a z-score less 0.267  0.33  0.90 than or equal to the one we find: z  0.070 P( pˆ  0.267)  P  z  0.90   0.1841 , so the probability is about 18.4% that no more than 12 students will say they frequently experience stress in their daily lives. b. From part a, we can use N( 0.33, 0.070) to model the sampling distribution. Twenty students is about 44.4% of the class. This is about 1.63 standard deviations above what we would expect, which is not a surprising result. 2. A fair coin is flipped 100 times. a. Describe the center, shape, and spread of the sampling distribution of the proportion of heads. The center will be at 0.50, the shape will be approximately normal, and the spread will be (0.5)(0.5)  0.05 100

b. What conditions are satisfied that ensure your answers in part (a) are correct? A fair coin is … fair, so the center will be at 0.50. Coin flips are independent, so the standard deviation is correct. Because 0.5*100 = 50 > 10, we are assured a normal model is appropriate.

.


13-12

Part IV From the Data at Hand to the World at Large

Statistics Quiz B – Chapter 13

Name

The countries of Europe report that 46% of the labor force is female. The United Nations wonders if the percentage of females in the labor force is the same in the United States. Representatives from the United States Department of Labor plan to check a random sample of over 10,000 employment records on file to estimate a percentage of females in the United States labor force. 1. They select a random sample of 525 employment records, and find that 229 of the people are females. Check the required conditions to construct a confidence interval.

2. Create the confidence interval.

3. Interpret the confidence interval in this context.

4. Explain what 90% confidence means in this context.

5. Should the representatives from the Department of Labor conclude that the percentage of females in their labor force is lower than Europe’s rate of 46%? Explain.

.


Chapter 13 Sampling Distribution Models and Confidence Intervals for Proportions

13-13

Statistics Quiz B – Chapter 13 – Key

The countries of Europe report that 46% of the labor force is female. The United Nations wonders if the percentage of females in the labor force is the same in the United States. Representatives from the United States Department of Labor plan to check a random sample of over 10,000 employment records on file to estimate a percentage of females in the United States labor force. 1. They select a random sample of 525 employment records, and find that 229 of the people are females. Check the required conditions to construct a confidence interval. We have a random sample of less than 10% of the employment records, with 229 successes (females) and 296 failures (males), so a Normal model applies.

2. Create the confidence interval. ˆˆ  0.436  0.564   0.022 pq  n 525 margin of error: ME  z *  SE ( pˆ )  1.645  0.022   0.0362 n = 525, pˆ  0.436 and qˆ  0.564 , so SE ( pˆ ) 

Confidence interval: pˆ  ME  0.436  0.0362 or (0.3998, 0.4722)

3. Interpret the confidence interval in this context. We are 90% confident that between 40.0% and 47.2% of the employment records from the United States labor force are female.

4. Explain what 90% confidence means in this context. If many random samples were taken, 90% of the confidence intervals produced would contain the actual percentage of all female employment records in the United States labor force.

5. Should the representatives from the Department of Labor conclude that the percentage of females in their labor force is lower than Europe’s rate of 46%? Explain. No. Since 46% lies in the confidence interval, (0.3998, 0.4722), it is possible that the percentage of females in the labor force matches Europe’s rate of 46% female in the labor force.

.


13-14

Part IV From the Data at Hand to the World at Large

Statistics Quiz C – Chapter 13

Name

1. It is generally believed that electrical problems affect about 14% of new cars. An automobile mechanic conducts diagnostic tests on 128 new cars on the lot. a. Describe the sampling distribution for the sample proportion by naming the model and telling its mean and standard deviation. Justify your answer.

b. Sketch and clearly label the model.

2. A statistics professor asked her students whether or not they were registered to vote. In a sample of 50 of her students (randomly sampled from her 700 students), 35 said they were registered to vote. a. Find a 95% confidence interval for the true proportion of the professor’s students who were registered to vote. (Make sure to check any necessary conditions and to state a conclusion in the context of the problem.)

b. Explain what 95% confidence means in this context.

c. According to a Gallup poll, about 73% of 18- to 29-year-olds said that they were registered to vote. Does the 73% figure from Gallup seem reasonable for the professor’s students? Explain.

.


Chapter 13 Sampling Distribution Models and Confidence Intervals for Proportions

13-15

Statistics Quiz C – Chapter 13 – Key

1. It is generally believed that electrical problems affect about 14% of new cars. An automobile mechanic conducts diagnostic tests on 128 new cars on the lot. a. Describe the sampling distribution for the sample proportion by naming the model and telling its mean and standard deviation. Justify your answer. We can assume these cars are a representative sample of all new cars, and certainly less than 10% of them. We expect np = (128)(0.14) = 17.92 successes (electrical problems) and nq = (128)(0.86) = 110.08 failures (no problems) so the sample is large enough to use the sampling model N(0.14, 0.031).  0.14  0.86   0.031 pq SD  pˆ    n 128

b. Sketch and clearly label the model.

2. A statistics professor asked her students whether or not they were registered to vote. In a sample of 50 of her students (randomly sampled from her 700 students), 35 said they were registered to vote. a. Find a 95% confidence interval for the true proportion of the professor’s students who were registered to vote. (Make sure to check any necessary conditions and to state a conclusion in the context of the problem.) We have a random sample of less than 10% of the professor’s students, with 35 expected successes (registered) and 15 expected failures (not registered), so a Normal model applies. ˆˆ  0.70  0.30   0.065 35 pq n  50, pˆ   0.70, qˆ  1  pˆ  0.30 , so SE  pˆ    50 n 50 Our 95% confidence interval is: pˆ  z  SE  pˆ   0.70  1.96  0.065   0.70  0.127  0.573 to 0.827 We are 95% confident that between 57.3% and 82.7% of the professor’s students are registered to vote.

b. Explain what 95% confidence means in this context. If many random samples were taken, 95% of the confidence intervals produced would contain the actual percentage of the professor’s students who are registered to vote.

c. According to a Gallup poll, about 73% of 18- to 29-year-olds said that they were registered to vote. Does the 73% figure from Gallup seem reasonable for the professor’s students? Explain. The 73% figure from Gallup seems reasonable since 73% lies in our confidence interval.

.


13-16

Part IV From the Data at Hand to the World at Large

Statistics Quiz D – Chapter 13

Name

1. It is generally believed that nearsightedness affects about 12% of children. A school district gives vision tests to 133 incoming kindergarten children. a. Describe the sampling distribution model for the sample proportion by naming the model and telling its mean and standard deviation. Justify your answer.

b. Sketch and clearly label the model.

c. What is the probability that in this group over 15% of the children will be found to be nearsighted? __________ 2. A state’s Department of Education reports that 12% of the high school students in that state attend private high schools. The State University wonders if the percentage is the same in their applicant pool. Admissions officers plan to check a random sample of the over 10,000 applications on file to estimate the percentage of students applying for admission who attend private schools. a. They select a random sample of 450 applications, and find that 46 of those students attend private schools. Check the conditions required for inference.

b. Create the 90% confidence interval.

.


Chapter 13 Sampling Distribution Models and Confidence Intervals for Proportions

13-17

c. Interpret the confidence interval in this context.

d. Explain what 90% confidence means in this context.

e. Should the admissions officers conclude that the percentage of private school students in their applicant pool is lower than the statewide enrollment rate of 12%? Explain.

.


13-18

Part IV From the Data at Hand to the World at Large

Statistics Quiz D – Chapter 13 – Key

1. It is generally believed that nearsightedness affects about 12% of children. A school district gives vision tests to 133 incoming kindergarten children. a. Describe the sampling distribution model for the sample proportion by naming the model and telling its mean and standard deviation. Justify your answer. We can assume these kids are a random sample of all children, and certainly less than 10% of them. We expect np = 133(0.12) = 15.96 successes and 117.04 failures so the sample size is large enough to use the sampling model N(0.12, 0.028).

b. Sketch and clearly label the model.

c. What is the probability that in this group over 15% of the children will be found to be nearsighted? 0.15  0.12  1.07 0.028 P( pˆ  0.15)  P(z > 1.07) = 0.142 z

2. A state’s Department of Education reports that 12% of the high school students in that state attend private high schools. The State University wonders if the percentage is the same in their applicant pool. Admissions officers plan to check a random sample of the over 10,000 applications on file to estimate the percentage of students applying for admission who attend private schools. a. They select a random sample of 450 applications, and find that 46 of those students attend private schools. Check the conditions required for inference. We have a random sample of less than 10% of the applicants, with 46 successes and 404 failures, so a Normal model applies

b. Create the confidence interval. The confidence interval is (0.079, 0.126).

.


Chapter 13 Sampling Distribution Models and Confidence Intervals for Proportions

13-19

c. Interpret the confidence interval in this context. We are 90% confident that between 7.9% and 12.6% of the applicants attend private high schools.

d. Explain what 90% confidence means in this context. If many random samples were taken, 90% of the confidence intervals produced would contain the actual percentage of all applicants who attend private schools.

e. Should the admissions officers conclude that the percentage of private school students in their applicant pool is lower than the statewide enrollment rate of 12%? Explain. No. Since 12% lies in the confidence interval it’s possible that the percentage of private school students in the applicant pool matches the statewide enrollment rate.

.


13-20

Part IV From the Data at Hand to the World at Large

Statistics Quiz E – Chapter 13

Name

1. In a large class, the professor has each person toss a coin several times and calculate the proportion of his or her tosses that were heads. The students then report their results, and the professor plots a histogram of these several proportions. If the students toss the coin 80 times each, about 95% should have proportions between what two numbers? Use the 68-95-99.7 Rule to provide the appropriate response. A. 0.025 and 0.975 B. 0.39 and 0.62 C. 0.1118 and 0.1677 D. 0.4875 and 0.5125 E. 0.2375 and 0.7375 2. Based on past experience, a bank believes that 4% of the people who receive loans will not make payments on time. The bank has recently approved 300 loans. What is the mean of the proportion of clients in this group who may not make timely payments? A.   4% B.   3.46% C.   1.1% D.   2% E.   12% 3. Based on past experience, a bank believes that 6% of the people who receive loans will not make payments on time. The bank has recently approved 300 loans. What is the standard deviation of the proportion of clients in this group who may not make timely payments? A.   1.88% B.   1.41% C.   1.37% D.   4.11% E.   4.24% 4. Based on past experience, a bank believes that 7% of the people who receive loans will not make payments on time. The bank has recently approved 300 loans. What is the probability that over 8% of these clients will not make timely payments? A. 0.752 B. 0.504 C. 0.248 D. 0.68 E. 0.496

.


Chapter 13 Sampling Distribution Models and Confidence Intervals for Proportions

13-21

5. The number of hours per week that high school seniors spend on computers is normally distributed, with a mean of 4 hours and a standard deviation of 2 hours. 60 students are chosen at random. Let represent the mean number of hours spent on the computer for this group. Find the probability that is between 4.2 and 4.4. A. 0.041 B. 0.159 C. 0.317 D. 0.720 E. 0.775 6. In a survey of 300 T.V. viewers, 40% said they watch network news programs. Find the margin of error for this survey if we want 95% confidence in our estimate of the percent of T.V. viewers who watch network news programs. A. 4.16% B. 7.28% C. 6.37% D. 5.54% E. 8.36% 7. Of 346 items tested, 12 are found to be defective. Construct a 98% confidence interval for the percentage of all such items that are defective. A. (0.13%, 6.80%) B. (3.34%, 3.59%) C. (0.93%, 6.00%) D. (1.85%, 5.09%) E. (1.18%, 5.76%) 8. A survey of shoppers is planned to see what percentage use credit cards. Prior surveys suggest 54% of shoppers use credit cards. How many randomly selected shoppers must we survey in order to estimate the proportion of shoppers who use credit cards to within 1% with 95% confidence? A. 16,471 B. 8589 C. 20,745 D. 6718.3 E. 9543 9. In a poll of 929 voters in a certain city, 68% said that they backed a bill which would limit growth and development in their city. The margin of error in the poll was reported as 3 percentage points (with a 95% degree of confidence). Which statement is correct? A. The reported margin of error is consistent with the sample size. B. The sample size is too large to achieve the stated margin of error. C. For the given sample size, the margin of error should be smaller than stated D. The sample size is too small to achieve the stated margin of error. E. There is not enough information to determine whether the margin of error is consistent with the sample size.

.


13-22

Part IV From the Data at Hand to the World at Large

10. The college daily reported: "300 students living in university housing were polled. 180 said that they were satisfied with their living conditions. Based on this survey we conclude that 60% of students living in dormitories are satisfied. The margin of error of the study is ± 6 percentage points (with a 95% degree of confidence). Which statement is correct? A. A larger sample should be used to achieve the stated margin of error. B. The margin of error is consistent with the sample size. C. The sample size is too small to achieve the stated margin of error. D. The stated margin of error could have been achieved with a smaller sample size. E. There is not enough information to determine whether the margin of error is consistent with the sample size.

.


Chapter 13 Sampling Distribution Models and Confidence Intervals for Proportions

Statistics Quiz E – Chapter 13 – Key

1. B 2. A 3. C 4. C 5. B 6. C 7. E 8. E 9. A 10. D

.

13-23


13-24

Part IV From the Data at Hand to the World at Large

Statistics Quiz F – Chapter 13

Name

1. In a large class, the professor has each person toss a coin several times and calculate the proportion of his or her tosses that were heads. The students then report their results, and the professor plots a histogram of these several proportions. If the students toss the coin 100 times each, about 68% should have proportions between what two numbers? Use the 68-95-99.7 Rule to provide the appropriate response. A. 0.34 and 0.67 B. 0.45 and 0.55 C. 0.16 and 0.84 D. 0.495 and 0.505 E. 0.05 and 0.1 2. Assume that 15% of students at a university wear contact lenses. We randomly pick 200 students. What is the mean of the proportion of students in this group who may wear contact lenses? A.   5.48% B.   2.52% C.   7.5% D.   15% E.   30% 3. Assume that 10% of students at a university wear contact lenses. We randomly pick 200 students. What is the standard deviation of the proportion of students in this group who may wear contact lenses? A.   4.5% B.   2.24% C.   2.12% D.   4.47% E.   4.24% 4. Assume that 15% of students at a university wear contact lenses. We randomly pick 200 students. What is the probability that more than 16% of this sample wear contact lenses? A. 0.309 B. 0.397 C. 0.654 D. 0.346 E. 0.692

.


Chapter 13 Sampling Distribution Models and Confidence Intervals for Proportions

13-25

5. The number of hours per week that high school seniors spend on homework is normally distributed, with a mean of 10 hours and a standard deviation of 3 hours. 60 students are chosen at random. Let represent the mean number of hours spent on homework for this group. Find the probability that is between 9.8 and 10.4. A. 0.547 B. 0.5161 C. 0.080 D. 0.1528 E. 0.3043 6. A survey found that 79% of a random sample of 1024 American adults approved of cloning endangered animals. Find the margin of error for this survey if we want 90% confidence in our estimate of the percent of American adults who approve of cloning endangered animals. A. 2.49% B. 21.4% C. 2.09% D. 5.43% E. 4.56% 7. When 346 college students are randomly selected and surveyed, it is found that 121 own a car. Construct a 99% confidence interval for the percentage of all college students who own a car. A. (12.1%, 57.8%) B. (29.9%, 40.0%) C. (29.0%, 40.9%) D. (28.4%, 41.6%) E. (30.8%, 39.2%) 8. A university's administrator proposes to do an analysis of the proportion of graduates who have not found employment in their major field one year after graduation. In previous years, the percentage averaged 15%. He wants the margin of error to be within 5% at a 99% confidence level. What sample size will suffice? A. 17 B. 138.3 C. 407 D. 196 E. 339

.


13-26

Part IV From the Data at Hand to the World at Large

9. In a poll of 268 voters in a certain city, 60% said that they backed a bill which would limit growth and development in their city. The margin of error in the poll was reported as 6 percentage points (with a 95% degree of confidence). Which statement is correct? A. The stated margin of error could be achieved with a larger sample size. B. The reported margin of error is consistent with the sample size. C. The stated margin of error could be achieved with a smaller sample size. D. The sample size is too small to achieve the stated margin of error. E. There is not enough information to determine whether the margin of error is consistent with the sample size. 10. After conducting a survey, a researcher wishes to cut the standard error (and thus the margin of error) to of its original value. How will the necessary sample size change? A. It will decrease by a factor of 3. B. It will increase by a factor of 3. C. It will decrease by a factor of 9. D. It will increase by a factor of 9. E. Not enough information is given.

.


Chapter 13 Sampling Distribution Models and Confidence Intervals for Proportions

Statistics Quiz F – Chapter 13 – Key

1. B 2. D 3. C 4. D 5. A 6. C 7. D 8. E 9. C 10. D

.

13-27



Chapter 14 Confidence Intervals for Means Solutions to Class Examples 1. Solution to Class Example 1:

 Is there evidence that the company has met its goal? We want to know if the mean battery lifespan exceeds the 300-minute goal set by the manufacturer. We have 12 battery lifespans in our sample to test the claim. We will use our confidence interval (below) to answer this question. 

Find a 90% confidence interval for the mean lifespan of this type of battery.

4 3 2 1 240

280

320

Battery Lifespans (min)

Randomization Condition: This is not a random sample of batteries, but merely 12 batteries produced for preliminary testing. However, it is reasonable to assume that these batteries are representative of all batteries and independent from each other. Nearly Normal Condition: The distribution of battery lifespans is roughly unimodal and symmetric, so it’s reasonable to assume that the lifespans of all batteries could be described by a Normal model.

Since the conditions have been met, we can do a one sample tinterval for the mean, with 11 degrees of freedom.

 29.31  y  t11*  SE ( y )  306.25  1.796    (291.05,321.45)  12  I am 90% confident that the mean battery lifespan is between 291.05 and 321.45 minutes. Since 300 is contained in our confidence interval, we can comfortably state that we have met our goal.

 If we wish to conduct another trial, how many batteries must we test to be 95% sure of estimating the mean lifespan to within 15 minutes? To within 5 minutes?

We want to know how many batteries to test to be 95% sure of estimating the mean lifespan to within 15 minutes. First, do a preliminary estimate using z *  1.96 as the critical value. Our first estimate is about 15 batteries, a small sample.  s  ME  z *    n  29.31  15  1.96    n  (1.96)(29.31) n 15 n  14.67

.


14-2

Part IV From the Data at Hand to the World at Large

Now, do a better estimate, using t14*  2.145 as the critical value.  s  ME  t *    n  29.31  15  2.145    n  (2.145)(29.31) n 15 n  17.56

We would need to sample about 18 batteries in order to estimate the mean battery lifespan to within 15 minutes, with 95% confidence.

Finally, to estimate the mean battery lifespan to within 5 minutes, you could do the entire process again, perhaps using a critical value with much higher degrees of freedom. We know that it’s going to take lots more batteries to cut the margin of error to a third of what it was. Alternatively, we know it will take a sample about 9 times as large, 18(9) = 162 batteries, since the margin of error was decreased to a third of its size. (The standard error involves the square root of the sample size in its denominator.) 2. See Class Example 2. (If you used our data, the mean SAT I Math Score is 611.32) Investigative Tasks

An Investigative Task is provided. Quiz A covers only sampling distributions. Quiz B covers only confidence intervals. Quiz C covers both topics.

.


Chapter 14 Confidence Intervals for Means

Statistics Quiz A – Chapter 14

14-3

Name

1. Herpetologists (snake specialists) found that a certain species of reticulated python have an average length of 20.5 feet with a standard deviation of 2.3 feet. The scientists collect a random sample of 30 adult pythons and measure their lengths. a. Describe the sampling distribution of this sample mean.

b. What conditions need to be satisfied for your answer on part (a) to be correct?

c. In their sample, the mean length was 19.5 feet long. One of the herpetologists fears that pollution might be affecting the natural growth of the pythons. Do you think this sample result is unusually small? Explain.

.


14-4

Part IV From the Data at Hand to the World at Large

Statistics Quiz A – Chapter 14-Key

1. Herpetologists (snake specialists) found that a certain species of reticulated python have an average length of 20.5 feet with a standard deviation of 2.3 feet. The scientists collect a random sample of 30 adult pythons and measure their lengths. a. Describe the sampling distribution of this sample mean. The approximate sampling model for sample means will be N(20.5, 0.42).

b. What conditions need to be satisfied for your answer on part (a) to be correct? We have a random sample of adult pythons drawn from a much larger population. A sample size of 30 is large enough, by the CLT to ensure normality if the sample is not strongly skewed or contains outliers.

c. In their sample, the mean length was 19.5 feet long. One of the herpetologists fears that pollution might be affecting the natural growth of the pythons. Do you think this sample result is unusually small? Explain. A sample mean of only 19.5 feet is about 2.38 standard deviations below what we expect. The sample mean of 19.5 feet is unusually small (probability of this length or smaller is 0.87%).

.


Chapter 14 Confidence Intervals for Means

Statistics Quiz B – Chapter 14

14-5

Name

A professor at a large university believes that students take an average of 15 credit hours per term. A random sample of 24 students who reported the following number of credit hours that they were taking: 12 13 14 14 15 15 15 16 16 16 16 16 17 17 17 18 18 18 18 19 19 19 20 21 1. What does this sample tell us about students at the university in general? Find and interpret a 95% confidence for the mean number of credit hours students are taking. Make sure to check the necessary conditions.

2. What does this interval say about the professor’s belief that students take an average of 15 credit hours?

.


14-6

Part IV From the Data at Hand to the World at Large

Statistics Quiz B – Chapter 14 – Key 1. * Randomization condition: Students from the class were randomly sampled, so the students are likely to be independent from each other. * Nearly Normal condition: The histogram of credit hours is unimodal and reasonably symmetric.

We know: n = 24, y  16.6 , s  2.22 , 2.22 and SE ( y )   0.453 . 24 s . We have n  t23  2.069 . Our 95% confidence interval is then 2.22 16.6  2.069  16.6  0.94 , or 15.66 to 17.54. 24

Our confidence interval has the form y  tn1

We are 95% confident that the interval 15.66 to 17.54 contains the true mean number of credit hours that students at the university are taking. 2. This interval is above 15 (its lower bound is 15.66). Thus, we conclude that the professor’s estimate is too low. We are fairly confident that students are taking more than 15 credit hours, on average.

.


Chapter 14 Confidence Intervals for Means

Statistics Quiz C – Chapter 14

14-7

Name

Insurance companies track life expectancy information to assist in determining the cost of life insurance policies. The insurance company knows that, last year, the life expectancy of its policyholders was 77 years. They want to know if their clients this year have a longer life expectancy, on average, so the company randomly samples some of the recently paid policies to see if the mean life expectancy of policyholders has increased. The insurance company will only change their premium structure if there is evidence that people who buy their policies are living longer than before. 86 75 83 84 81 77 78 79 79 81 76 85 70 76 79 81 73 74 72 83 1. To begin to answer this question, we will use a confidence interval. Find and interpret a 90% confidence interval for the mean life expectancy of these policyholders. Make sure to check the conditions required for this interval.

2. Using your confidence interval, what can you conclude the average life expectancy? Do we have evidence that it is increasing?

3. Your work on #1 depends on a sampling distribution model. Describe the center, shape, and spread of this model.

.


14-8

Part IV From the Data at Hand to the World at Large

Statistics Quiz C – Chapter 14 – Key 1. * Randomization condition: The records from the insurance company were randomly sampled, so the clients are likely to be independent from each other.

* Nearly Normal condition: The histogram of the ages at death is unimodal and reasonably symmetric. This is close enough to Normal for our purposes. Under these conditions, the sampling distribution of the mean can be modeled by Student’s t with df = n – 1 = 20 – 1 = 19. We will use a one-sample t-interval for the mean. We know: n = 20, y  78.6 years, and s  4.48 years. Our confidence interval has the form s y  tn1 . We have t19  1.729 . Our 90% confidence interval is then n 4.48 78.6  1.729  78.6  1.732 , or 76.9 to 80.3 years. 20

We are 90% confident that the interval 76.9 to 80.3 contains the true mean life expectancy for this company’s clients. 2. This interval contains 77. Thus, we don’t have evidence that the average life expectancy is increasing. 3. If conditions and assumptions are satisfied, the sampling distribution follows a t-distribution with 19 degrees of freedom with the estimated mean of 78.6 and an estimated standard deviation of 1.001.

.


Chapter 14 Confidence Intervals for Means

14-9

Investigative Task

This is the first of a pair of Tasks that use the same data set. The thrust of the Tasks is more important than the particular data you use. We use the Math and Verbal SAT scores reported for one high school, but other data will work. You may want to find something more relevant to your students and revise the Task accordingly. Ideally you need a reasonably large but manageable data set; 250 to 500 cases would work well. Math and Verbal SAT’s (or ACT’s) are ideal because each individual has two scores (making paired comparisons possible) and lots of statistics are available. The first part of the Task asks students to select a random sample, reviewing use of random numbers and sampling issues from Chapter 10. Based on the various samples, students construct confidence intervals, and then use their interval to compare the local performance to state or national results. And this affords you the opportunity to look at all their confidence intervals together, noting that most hit the target while others miss. The second in this pair of these Tasks comes after Chapter 17, asking students to do two hypothesis tests – one involving paired data and the other comparing two independent means. Have them hang on to the data sheet; they’ll need it again.

.


14-10

Part IV From the Data at Hand to the World at Large

Statistics – Investigative Task

Chapter 14

SAT Performance This is the first of two related Tasks. In these Tasks you will investigate four questions about students’ SAT scores at one high school. 1. What is the mean SAT-Math score at this high school? 2. How do these students’ SAT Math scores stack up against the rest of the state? 3. Is there a significant difference between Verbal and Math scores for these students? 4. Nationally males tend to have higher Math scores than females. Is that true at this high school, too? You have a copy of data from the SAT Score Roster sent to the high school by the College Board. It shows gender, Verbal scores, and Math scores, for 303 students in a recent graduating class. NOTE: You will not use all of these data, just a sample of 20 – 30 students. Your current assignment: investigate the first two questions posed above.  Draw a sample of these students, carefully explaining your procedure.  Use your sample to create a 95% confidence interval for the mean SAT-I Math score at this high school.  Based on your confidence interval, compare the performance of these students with the statewide mean Math score of 503.  Prepare a written report that includes complete demonstrations of the statistical procedures you use and your conclusions (in context, of course). Important: Save the Score Roster and your sample. You’ll be using them to answer the other two questions in a future Investigative Task – coming soon to a Statistics class near you.

.


Chapter 14 Confidence Intervals for Means

Statistics – SAT Data Gender F F M M M M F M F F M F F M M M F M F M M M F F F M M F F M F F M F M F F F M M M M M M M F F F F

Verbal 450 640 590 400 600 610 630 660 660 590 580 500 480 530 690 480 450 650 620 630 490 560 710 360 590 520 510 630 740 680 660 690 540 670 580 600 600 710 640 490 650 580 710 740 510 650 580 600 670

Math 450 540 570 400 590 610 610 570 720 640 650 540 360 520 640 570 420 650 600 740 520 620 720 390 530 630 600 630 760 690 700 750 560 520 720 560 560 640 650 590 680 710 730 700 560 760 660 640 700

Gender F F F M F M M F M F M M M M M M M F F M M M F M F M F M M M F M M M M F F M F F F F F M F M M M F F F

Verbal 640 370 620 680 620 590 430 680 650 650 660 640 460 430 460 800 310 510 770 750 580 410 430 630 630 480 420 800 650 580 560 670 610 510 560 770 530 650 730 690 610 710 730 450 400 700 580 600 630 600 630

.

Math 670 460 470 650 620 690 540 740 650 680 700 680 680 480 600 720 530 550 700 670 630 390 400 680 630 550 430 650 500 690 540 760 740 610 530 690 430 740 790 640 510 680 590 570 410 620 660 670 690 640 700

Gender M F F F F M M M M F F M M F F M M F F F F M F F F F F F F M F M F M M M F M F M F F M M M M F F M F F

Verbal 690 630 480 680 650 460 560 610 620 590 390 510 450 520 470 690 510 770 520 550 680 740 660 740 730 560 570 560 670 650 690 610 500 560 640 430 700 620 610 580 730 520 540 640 680 580 640 700 600 540 480

14-11

Math 740 560 500 630 680 410 560 760 650 640 440 530 440 690 410 620 540 710 570 600 580 700 690 640 680 630 530 540 520 710 700 740 650 700 650 490 570 670 640 640 570 530 580 610 720 490 630 650 630 510 540


14-12

Part IV From the Data at Hand to the World at Large

Gender F M F F M F F M M M F F M F M M F M F M F F M F M M M F M F M M M F F M F F M M M M F M M F M F M F M M

Verbal 710 650 640 370 710 630 590 750 600 610 490 680 520 680 650 600 550 490 530 560 630 510 710 550 690 700 540 280 710 640 600 610 680 520 730 510 620 530 550 590 620 640 490 530 560 710 480 490 460 630 510 600

Math 700 780 570 410 700 660 580 800 690 550 800 610 540 660 700 560 560 390 530 560 590 520 740 560 620 700 620 500 760 710 590 670 670 470 740 680 740 500 680 670 640 570 530 580 600 700 530 490 560 520 520 610

Gender F F M F F F F M M M M M M F M F F F M M M M F M F M M M F F F M M M F F M F F M M F F F M F F F F F M M

Verbal 550 670 560 600 630 440 650 650 690 360 510 610 700 540 600 610 700 490 390 580 760 730 590 640 610 760 570 700 630 530 490 600 580 710 600 560 570 630 640 550 690 530 580 550 690 420 690 560 560 490 430 690

.

Math 570 580 520 600 610 470 530 560 670 290 510 570 740 490 670 710 570 560 630 530 760 760 620 740 620 700 630 740 630 670 700 660 760 700 450 590 690 610 570 600 670 490 640 560 740 410 720 570 560 510 570 670

Gender F M F M M M M M F M F M M M F M M F F F M M M F M F M M M M M F F F F F F M F F M M F M M M M M

Verbal 600 700 630 620 600 450 690 590 550 760 650 540 580 670 650 660 500 610 580 570 690 570 600 540 640 680 730 610 500 580 620 570 690 610 630 540 760 470 340 480 690 620 710 440 400 650 600 700

Math 550 630 710 610 670 610 680 600 550 580 600 630 530 650 680 660 580 510 500 620 620 540 670 670 760 600 670 550 550 630 710 630 610 570 570 660 690 470 450 500 760 610 630 740 710 740 660 570


Chapter 14 Confidence Intervals for Means

Statistics – Investigative Task

14-13

Chapter 14

This task has 4 components; each is scored as Essentially correct, Partially correct, or Incorrect. 1. The Sample E – Selects a random sample, explains the sampling process clearly, calculates the correct summary statistics (perhaps with minor arithmetic errors), and uses the proper vocabulary and notation. P – Selects a random sample, but explanation of the process may be unclear or there may be mistakes in notation, or vocabulary. I – Sample is not random or the process is not explained or there are several major mistakes in arithmetic, notation, or vocabulary. 2. The Conditions E – Cites randomness, <10% of all possible students, and checks normality with a plot. P – Discusses normality but omits the plot or fails to mention one of the other conditions. I – Misunderstands or omits the conditions, or lists irrelevant issues (np ≥ 10). 3. The Mechanics E – Identifies the procedure, shows the sample statistics and degrees of freedom, writes the formula using correct critical value and notation, and calculates the correct interval (perhaps with minor arithmetic or rounding errors). P – Appears to be doing the proper procedure but omits important information, uses the wrong critical value, uses the wrong notation, or makes major errors in calculations. I – Uses the wrong procedure or shows no work or makes several major mistakes. 4. The Interpretation E – Correctly interprets the confidence interval in the proper context, and compares the local performance to statewide results. P – Writes a conclusion that is correct but not in context or doesn’t compare local performance to statewide results. I – Does not interpret the confidence interval correctly. Comments:

Scoring  E’s count 1 point, P’s are 1/2  Grade: A = 4, B = 3, etc., with +/- based on rounding (ex: 3.5 rounded to 3 is a B+)

Name ___________________________________ Grade ____ .


14-14

Part IV From the Data at Hand to the World at Large

NOTE: We present a model solution with some trepidation. This is not a scoring key, just an example. Many other approaches could fully satisfy the requirements outlined in the scoring rubric. That (not this) is the standard by which student responses should be evaluated. Model Solution - Investigative Task – SAT Performance

I want to determine the mean SAT-Math score at this high school. I will take a simple random sample of 25 students by assigning a number 001-300 to each of the students, and then choosing 25 random numbers 001-300, ignoring repeats. I want to find a 95% confidence interval for the mean SAT-Math score of all students at this high school. I have data on the scores of 25 students, from a simple random sample of the 300 students at this high school. Model. Randomization Condition: The students were chosen by a simple random sample. 10% Condition: 25 students represent less than 10% of the population of 300 students. Nearly Normal Condition: The distribution of SAT-Math scores in the sample is unimodal and symmetric.

The conditions are satisfied, so I will use a Student’s t-model with (n – 1) = 25 – 1 = 24 degrees of freedom to find a one-sample t-interval for the mean. Mechanics. From my sample of 25 students: n  25 scores

y  602.8

s  74.3034

 74.3034  * y  t24  SE ( y )  602.8  2.064    602.8  30.672  (572.13, 633.47) 25  

Conclusion. I am 95% confident that the interval from 572.13 to 633.47 contains the true mean SAT-Math score for students at this high school. According to my confidence interval, students at this high school had a higher mean SAT-Math score than students nationwide. The national mean of 503 was not included in my 95% confidence interval.

.


Chapter 14 Confidence Intervals for Means

Statistics Quiz D – Chapter 14

14-15

Name

1. Using the t-tables, software, or a calculator, estimate the critical value of t for the given confidence interval and degrees of freedom: 80% confidence interval with df = 11 A. 1.372 B. 2.718 C. 1.280 D. 1.356 E. 1.363 2. Use the t-tables, software, or a calculator to estimate the indicated P-value: P-value for t ≥ 1.76 with 24 degrees of freedom. A. 0.0592 B. 0.9544 C. 0.0456 D. 0.0912 E. 0.0228 3. n  10, x  13.0, s  4.5. Find a 95% confidence interval for the mean. A. (9.80, 16.20) B. (9.78, 16.22) C. (10.39, 15.61) D. (9.83, 16.20) E. (9.83, 16.17) 4. A sociologist develops a test to measure attitudes about public transportation, and 27 randomly selected subjects are given the test. Their mean score is 76.2 and their standard deviation is 21.4. Construct the 95% confidence interval for the mean score of all such subjects. A. (64.2, 83.2) B. (64.2, 88.2) C. (74.6, 77.8) D. (69.2, 83.2) E. (67.7, 84.7) 5. The principal randomly selected six students to take an aptitude test. Their scores were: 71.6

81.0

88.9

80.4

78.1

72.0

Determine a 90% confidence interval for the mean score for all students. A. (73.26, 84.07) B. (84.07, 73.26) C. (83.97, 83.97) D. (73.36, 83.97) E. (83.97, 73.36)

.


14-16

Part IV From the Data at Hand to the World at Large

6. Analysis of a random sample of 250 Illinois nurses produced a 95% confidence interval for the mean annual salary of $42,846 < μ(Nurse Salary) < $49,686 A. About 95% of Illinois nurses earn between $42,846 and $49,686. B. We are 95% confident that the average nurse salary in the U.S. is between $42,846 and $49,686. C. If we took many random samples of Illinois nurses, about 95% of them would produce this confidence interval. D. We are 95% confident that the interval from $42,846 to $49,686 contains the true mean of all Illinois nurses. E. About 95% of the nurses surveyed earn between $42,846 and $49,686. 7. A credit union took a random sample of 40 accounts and yielded the following 90% confidence interval for the mean checking account balance at the institution: $2183 < μ(balance) < $3828. A. If we took random samples of checking accounts at this credit union, about nine out of ten of them would produce this confidence interval. B. We are 90% confident that the mean checking account balance at this credit union is between $2183 and $3828, based on this sample. C. We are 90% confident that the mean checking account balance in the U.S. is between $2183 and $3828. D. About 9 out of 10 people have a checking account balance between $2183 and $3828. E. We are 90% sure that the mean balance for checking accounts in the sample was between $2183 and $3828 8. Based on a sample of 30 randomly selected years, a 90% confidence interval for the mean annual precipitation in one city is from 48.7 inches to 51.3 inches. Find the margin of error. A. 0.39 inches B. 0.10 inches C. 2.6 inches D. 1.3 inches E. There is not enough information to find the margin of error. 9. A certain population is bimodal. We want to estimate its mean, so we will collect a sample. Which should be true if we use a large sample rather than a small one? I. The distribution of our sample data will be more clearly bimodal. II. The sampling distribution of the sample means will be approximately normal. III. The variability of the sample means will be smaller. A. I only B. II only C. III only D. II and III E. I, II, and III

.


Chapter 14 Confidence Intervals for Means

14-17

10. We have calculated a confidence interval based on a sample of size n = 100. Now we want to get a better estimate with a margin of error that is only one-fourth as large. How large does our new sample need to be? A. 200 B. 1600 C. 50 D. 25 E. 400

.


14-18

Part IV From the Data at Hand to the World at Large

Statistics Quiz D – Chapter 14-Key

1. E 2. C 3. B 4. E 5. D 6. D 7. B 8. D 9. E 10. A

.


Chapter 14 Confidence Intervals for Means

Statistics Quiz E – Chapter 14

14-19

Name

1. Using the t-tables, software, or a calculator, estimate the critical value of t for the given confidence interval and degrees of freedom: 90% confidence interval with df = 4. A. 1.645 B. 1.533 C. 2.353 D. 2.132 E. 4.604 2. Use the t-tables, software, or a calculator to estimate the indicated P-value: P-value for t ≤ 1.76 with 24 degrees of freedom A. 0.0228 B. 0.9088 C. 0.9544 D. 0.9772 E. 0.0456 3. n  12, x  28.0, s  5.7. Find a 99% confidence interval for the mean. A. (22.79, 33.21) B. (22.89, 33.11) C. (22.79, 33.09) D. (22.91, 33.09) E. (23.53, 32.47) 4. Thirty randomly selected students took the calculus final. If the sample mean was 95 and the standard deviation was 6.4, construct a 99% confidence interval for the mean score of all students. A. (92.12, 97.88) B. (91.78, 98.22) C. (93.01, 98.21) D. (93.01, 96.99) E. (91.79, 98.21) 5. The amounts (in ounces) of juice in eight randomly selected juice bottles are: 15.0 15.9 15.8 15.7 15.4 15.2 15.2 15.3 Construct a 98% confidence interval for the mean amount of juice in all such bottles. A. (15.04, 15.84) B. (15.74, 15.14) C. (15.14, 15.74) D. (15.04, 15.74) E. (15.84, 15.04)

.


14-20

Part IV From the Data at Hand to the World at Large

6. Data collected by child development scientists produced the following 90% confidence interval for the average age (in months) at which children say their first word: 10.6 < μ(age) < 13.7. A. Based on this sample, we can say, with 90% confidence, that the mean age at which children say their first word is between 10.6 and 13.7 months. B. 90% of the children in this sample said their first word when they were between 10.6 and 13.7 months old. C. We are 90% sure that a child will say his first word when he is between 10.6 and 13.7 months old. D. We are 90% sure that the average age at which children in this sample said their first word was between 10.6 and 13.7 months. E. If we took many random samples of children, about 90% of them would produce this confidence interval. 7. A random sample of clients at a weight loss center were given a dietary supplement to see if it would promote weight loss. The center reported that the 100 clients lost an average of 48 pounds, and that a 95% confidence interval for the mean weight loss this supplement produced has a margin of error of ±7 pounds. A. 95% of the people who use this supplement will lose between 41 and 55 pounds. B. The average weight loss of clients who take this supplement will be between 41 and 55 pounds. C. 95% of the clients in the study lost between 41 and 55 pounds. D. We are 95% sure that the average weight loss among the clients in this study was between 41 and 55 pounds. E. We are 95% confident that the mean weight loss produced by the supplement in weight loss center clients is between 41 and 55 pounds. 8. Based on a sample of size 49, a 95% confidence interval for the mean score of all students, μ, on an aptitude test is from 59.2 to 64.8. Find the margin of error. A. 5.6 B. 0.78 C. 0.05 D. 2.8 E. There is not enough information to find the margin of error. 9. A certain population is strongly skewed to the right. We want to estimate its mean, so we will collect a sample. Which should be true if we use a large sample rather than a small one? I. The distribution of our sample data will be closer to normal. II. The sampling model of the sample means will be closer to normal. III. The variability of the sample means will be greater. A. I only B. II only C. III only D. I and III only E. II and III only

.


Chapter 14 Confidence Intervals for Means

10. Which is true about a 95% confidence interval based on a given sample? I. The interval contains 95% of the population. II. Results from 95% of all samples will lie in the interval. III. The interval is narrower than a 98% confidence interval would be. A. I only B. II only C. III only D. II and III only E. none of these

.

14-21


14-22

Part IV From the Data at Hand to the World at Large

Statistics Quiz A – Chapter 14-Key

1. D 2. C 3. B 4. B 5. C 6. A 7. E 8. D 9. B 10. C

.


Chapter 15 Testing Hypotheses Solutions to Class Examples 1. See Class Example 1. 2. Solution to Class Example 2: Hypotheses. The null hypothesis is that the proportion of M&M’s that are orange is 20%. The alternative hypothesis is that the proportion of M&M’s that are orange is something other than 20%. In symbols: H 0 : p  0.20, H A : p  0.20 Model. Independence Assumption: It is reasonable to think that M&M colors are mutually independent. Random Sampling Condition: The 122 M&M’s we have can be considered representative of all M&M’s. 10% Condition: 122 M&M’s is certainly less than 10% of all M&M’s. Success/Failure Condition: np  122(0.20)  24.4 and nq  122(0.80)  97.6 , which are both at least 10, so the sample is large enough.

The conditions are satisfied, so I can use the Normal model to perform a one proportion z-test. Since we are testing the hypothesis that the proportion of orange M&M’s is different than 0.20, we will use a 2 tailed test. Mechanics.

21  0.172 , n  122 , x  21 , pˆ  122

pˆ  p pq n 0.172  0.2 z (0.2)(0.8) 122 z  0.77

z

P  2  P( pˆ  0.172)  2  P ( z  0.77)  0.44

Conclusion. The P-value is quite high, so we fail to reject the null hypothesis. There is no evidence to suggest that the proportion of orange M&M’s differs from the advertised 20%. 3. See Class Example 3. 4. See Class Example 4.

Quizzes A and B are only proportions. Quiz C is proportions and means. Quiz D is only means.

.


15-2

Part IV From the Data at Hand to the World at Large

Statistics Quiz A – Chapter 15

Name

A report on health care in the United States said that 28% of Americans have experienced times when they haven’t been able to afford medical care. A news organization randomly sampled 801 black Americans, of whom 38% reported that there had been times in the last year when they had not been able to afford medical care. Does this indicate that this problem is more severe among black Americans? 1. Test an appropriate hypothesis and state your conclusion. (Make sure to check any necessary conditions and to state a conclusion in the context of the problem.)

2. Was your test one-tail upper tail, one-tail lower tail, or two-tail? Explain why you chose that kind of test in this situation.

3. Explain what your P-value means in this context.

.


Chapter 15 Testing Hypotheses

15-3

Statistics Quiz A – Chapter 15 – Key

A report on health care in the United States said that 28% of Americans have experienced times when they haven’t been able to afford medical care. A news organization randomly sampled 801 black Americans, of whom 38% reported that there had been times in the last year when they had not been able to afford medical care. Does this indicate that this problem is more severe among black Americans? 1. Test an appropriate hypothesis and state your conclusion. (Make sure to check any necessary conditions and to state a conclusion in the context of the problem.) Hypotheses: H 0 : p  0.28 . The proportion of all black Americans that were unable to afford medical care in the last year is 28%. H A : p  0.28 . The proportion of all black Americans that were unable to afford medical care in the last year is greater than 28%.

Model: Okay to use the Normal model because the sample is random, these 801 black Americans are less than 10% of all black Americans, and np  (801)(0.28)  224.28  10 and nq  (801)(0.72)  576.72  10 . We will do a one-proportion z-test. Mechanics:

n  801, pˆ  0.38 0.38  0.28 z  6.30  0.28 0.72  801 P  P( pˆ  0.38)  P  z  6.30   0

Conclusion: With a P-value so small (barely above zero), I reject the null hypothesis. There is evidence to suggest that the proportion of black Americans who were not able to afford medical care in the past year is more than 28%.

2. Was your test one-tail upper tail, one-tail lower tail, or two-tail? Explain why you chose that kind of test in this situation. One-tail, upper tail test. We are concerned that the proportion of people who are not able to afford medical care is higher among black Americans.

3. Explain what your P-value means in this context. If the proportion of black Americans was 28%, we would almost never expect to find at least 38% of 801 randomly selected black Americans responding “yes.”

.


15-4

Part IV From the Data at Hand to the World at Large

Statistics Quiz B – Chapter 15

Name

Females participated in every event at the 2012 Summer Olympic Games. Prior to 2012, there had been a steady increase in the female participation rate. The International Olympic Committee states that the female participation in the 2004 Summer Olympic Games was 42%, even with new sports such as weight lifting, hammer throw, and modern pentathlon recently added to the Games. Broadcasting and clothing companies wanted to change their advertising and marketing strategies if the female participation increased at the next games. Prior to the 2008 games, an independent sports expert arranged for a random sample of pre-Olympic exhibitions. The sports expert reported that 202 of 454 athletes in the random sample were women. Is this strong evidence that the participation rate may have increased? 1. Test an appropriate hypothesis and state your conclusion.

2. Was your test one-tail upper tail, lower tail, or two-tail? Explain why you choose that kind of test in this situation.

3. Explain what your P-value means in this context.

.


Chapter 15 Testing Hypotheses

15-5

Statistics Quiz B – Chapter 17 – Key

Females participated in every event at the 2012 Summer Olympic Games. Prior to 2012, there had been a steady increase in the female participation rate. The International Olympic Committee states that the female participation in the 2004 Summer Olympic Games was 42%, even with new sports such as weight lifting, hammer throw, and modern pentathlon recently added to the Games. Broadcasting and clothing companies wanted to change their advertising and marketing strategies if the female participation increased at the next games. Prior to the 2008 games, an independent sports expert arranged for a random sample of pre-Olympic exhibitions. The sports expert reported that 202 of 454 athletes in the random sample were women. Is this strong evidence that the participation rate may have increased? 1. Test an appropriate hypothesis and state your conclusion. Hypotheses: H 0 : p  0.42 . The female participation rate in the Olympics is 42%. H A : p  0.42 . The female participation rate in the Olympics is greater than 42%. Model: Okay to use the Normal model because the sample is random, these 454 athletes are less than 10% of all athletes at exhibitions, and np  (454)(0.42)  190.68  10 and nq  (454)(0.58)  263.32  10 . Use a N(0.42, 0.023) model, do a 1-proportion z-test. Mechanics: 202  0.445 454 0.445  0.42 z  1.08  0.42 .58 454

n  454, x  202, pˆ 

P  P( pˆ  0.445)  P( z  1.08)  0.140

Conclusion: With a P-value (0.140) so large, I fail to reject the null hypothesis that the proportion of female athletes is 0.42. There is not enough evidence to suggest that the proportion of female athletes had increased.

2. Was your test one-tail upper tail, lower tail, or two-tail? Explain why you choose that kind of test in this situation. One-tail, upper test. The companies will change strategies only if there is strong evidence of an increase in female participation rate from current rate of 42%.

3. Explain what your P-value means in this context. If the proportion of female athletes has not increased, we could expect to find at least 202 females out of 454 pre-Olympic athletes about 14.0% of the time.

.


15-6

Part IV From the Data at Hand to the World at Large

Statistics Quiz C – Chapter 15

Name

A company claims to have invented a hand-held sensor that can detect the presence of explosives inside a closed container. Law enforcement and security agencies are very interested in purchasing several of the devices if they are shown to perform effectively. An independent laboratory arranged a preliminary test. If the device can detect explosives at a rate greater than chance would predict, a more rigorous test will be performed. They placed four empty boxes in the corners of an otherwise empty room. For each trial they put a small quantity of an explosive in one of the boxes selected at random. The company’s technician then entered the room and used the sensor to try to determine which of the four boxes contained the explosive. The experiment consisted of 50 trials, and the technician was successful in finding the explosive 16 times. Does this indicate that the device is effective in sensing the presence of explosives, and should undergo more rigorous testing? 1. Test an appropriate hypothesis and state your conclusion.

2. Was your test one-tail upper tail, lower tail, or two-tail? Explain why you chose that kind of test in this situation.

3. Explain what your P-value means in this context.

.


Chapter 15 Testing Hypotheses

15-7

Textbook authors must be careful that the reading level of their book is appropriate for the target audience. Some methods of assessing reading level require estimating the average word length. We’ve randomly chosen 20 words from a randomly selected page in Intro Stats and counted the number of letters in each word: 5, 5, 2, 11, 1, 5, 3, 8, 5, 4, 7, 2, 9, 4, 8, 10, 4, 5, 6, 6 4. Suppose that our editor was hoping that the book would have a mean word length of 6.5 letters. Does this sample indicate that the authors failed to meet this goal? Test an appropriate hypothesis and state your conclusion.

.


15-8

Part IV From the Data at Hand to the World at Large

Statistics Quiz C – Chapter 15 – Key

A company claims to have invented a hand-held sensor that can detect the presence of explosives inside a closed container. Law enforcement and security agencies are very interested in purchasing several of the devices if they are shown to perform effectively. An independent laboratory arranged a preliminary test. If the device can detect explosives at a rate greater than chance would predict, a more rigorous test will be performed. They placed four empty boxes in the corners of an otherwise empty room. For each trial they put a small quantity of an explosive in one of the boxes selected at random. The company’s technician then entered the room and used the sensor to try to determine which of the four boxes contained the explosive. The experiment consisted of 50 trials, and the technician was successful in finding the explosive 16 times. Does this indicate that the device is effective in sensing the presence of explosives, and should undergo more rigorous testing? 1. Test an appropriate hypothesis and state your conclusion. Hypotheses: H 0 : p  0.25 . The device can detect explosives at the same level as guessing. H A : p  0.25 . The device can detect explosives at a level greater than chance.

Model: OK to use a Normal model because trials are independent (box is randomly chosen each time), and np = 12.5, nq = 37.5. Do a 1-proportion z-test. Mechanics: pˆ  0.32 , z = 1.14, P = 0.13 Conclusion: With a P-value so high I fail to reject the null hypothesis. This test does not provide convincing evidence that the sensor can detect the presence of explosives inside a box.

2. Was your test one-tail upper tail, lower tail, or two-tail? Explain why you chose that kind of test in this situation. One-tail, upper tail. The device is effective only if it can detect explosives at a rate higher than chance (25%).

3. Explain what your P-value means in this context. Even if the device actually performs no better than guessing, we could expect to find the explosives 16 or more times out of 50 about 13% of the time.

.


Chapter 15 Testing Hypotheses

4. H 0 :   6.5 . The mean word length in the book is 6.5 words. H A :   6.5 . The mean word length in the book is not 6.5 words. * Randomization Condition: We have a random sample of 8 the words in the book. 6 * 10% Condition: The words are less than 10% of all words in the book. 4 * Nearly Normal Condition: A histogram of the observed 2 word lengths looks roughly unimodal and symmetric, so the population of all word lengths may be approximately Normal. Since the conditions are satisfied, we will to use a two-tail, one-sample t-test. y = 5.5 n = 20 s = 2.685 df = 19 5.5  6.5 t  1.67 2.865 20 P = 2  P(t19  1.67)  0.11

0

15-9

6 Word Length (letters)

Because the P-value is so high we do not reject the null hypothesis. This sample does not provide evidence that the average word length differs from the goal of 6.5 letters.

.

12


15-10

Part IV From the Data at Hand to the World at Large

Statistics Quiz D – Chapter 15

Name

A collegiate long jumper is hoping to improve his distance with improved conditioning. His new conditioning routine should help him be able to jump further than he has in the past. Before the conditioning began, the jumper was averaging 24.5 feet. After his new routine was finished, he jumps a series of 30 jumps over the course of a week. His new average is 25.4 feet with a standard deviation of 1.26 feet. Does this new sample give evidence that he has actually increased his average jumping distance?

.


Chapter 15 Testing Hypotheses

15-11

Statistics Quiz D – Chapter 15—Key

A collegiate long jumper is hoping to improve his distance with improved conditioning. His new conditioning routine should help him be able to jump further than he has in the past. Before the conditioning began, the jumper was averaging 24.5 feet. After his new routine was finished, he jumps a series of 30 jumps over the course of a week. His new average is 25.2 feet with a standard deviation of 1.26 feet. Does this new sample give evidence that he has actually increased his average jumping distance? H 0 :   24.5 The mean jumping distance is 24.5 feet. H A :   24.5 The mean jumping distance is greater than 24.5 feet. *Randomization Condition: His 30 jumps are the sample he has to work with. We will assume that they are representative of his current ability and that they are independent from each other. *Nearly Normal Condition: 30 is a large sample size, so by the CLT it is safe to assume this mean is approximately normal. Since the conditions are satisfied, we will perform a one-tailed t-test. n = 30 y = 24.5 s = 1.26 df = 29 25.2  24.5 t  3.04 1.26 30 P = P(t29  3.04)  0.002

With a p-value with low (0.2%) we reject the null hypothesis. We find significant evidence that this jumper has increased his average distance.

.


15-12

Part IV From the Data at Hand to the World at Large

Statistics Quiz E – Chapter 15

Name

1. A mayor is concerned about the percentage of city residents who express disapproval of his job performance. His political committee pays for a newspaper ad, hoping that it will keep the disapproval rate below 12%. They will use a follow-up poll to assess the ad's effectiveness. A. H 0 : p  0.12, H A : p  0.12 B. H 0 : p  0.12, H A : p  0.12 C. H 0 : p  0.12, H A : p  0.12 D. H 0 : p  0.12, H A : p  0.12 E. H 0 : p  0.12, H A : p  0.12 2. Data in 1980 showed that about 40% of the adult population had never smoked cigarettes. In 2004, a national health survey interviewed a random sample of 2000 adults and found that 50% had never been smokers. Create a 95% confidence interval for the proportion of adults (in 2004) who had never been smokers. A. Based on the data, we are 95% confident the proportion of adults in 2004 who had never smoked cigarettes is between 47.8% and 52.2%. B. Based on the data, we are 95% confident the proportion of adults in 2004 who had never smoked cigarettes is between 37.4% and 52.2%. C. Based on the data, we are 95% confident the proportion of adults in 2004 who had never smoked cigarettes is between 47.8% and 58.2%. D. Based on the data, we are 95% confident the proportion of adults in 2004 who had never smoked cigarettes is between 40% and 60%. E. Based on the data, we are 95% confident the proportion of adults in 2004 who had never smoked cigarettes is between 47.8% and 63.2%. 3. An entomologist writes an article in a scientific journal which claims that fewer than 17% of male fireflies are unable to produce light due to a genetic mutation. A. H 0 : p  0.17, H A : p  0.17 B. H 0 : p  0.17, H A : p  0.17 C. H 0 : p  0.17, H A : p  0.17 D. H 0 : p  0.17, H A : p  0.17 E. H 0 : p  0.17, H A : p  0.17 4. Using the t-tables, software, or a calculator, estimate the critical value of t for the given confidence interval and degrees of freedom: 99% confidence interval with df = 24 A. 2.492 B. 2.797 C. 2.807 D. 2.779 E. 1.711

.


Chapter 15 Testing Hypotheses

15-13

5. Use the t-tables, software, or a calculator to estimate the indicated P-value: P-value for t ≥ 1.76 with 24 degrees of freedom A. 0.0456 B. 0.9544 C. 0.0228 D. 0.0592 E. 0.0912 6. In the past, the mean running time for a certain type of flashlight battery has been 8.5 hours. The manufacturer has introduced a change in the production method and wants to perform a hypothesis test to determine whether the mean running time has changed as a result. A. Lower-tailed B. Upper-tailed C. Two-sided 7. The owner of a football team claims that the average attendance at games is over 74,900, and he is therefore justified in moving the team to a city with a larger stadium. An independent investigator will conduct a hypothesis test to determine whether his claim is accurate. A. Lower-tailed B. Upper-tailed C. Two-sided 8. A report on the U.S. economy indicates that 28% of Americans have experienced difficulty in making mortgage payments. A news organization randomly sampled 400 Americans from 10 cities named the "fastest dying cities in the U.S." (Forbes Magazine, August 2008) and found that 136 reported such difficulty. Does this indicate that the problem is more severe among these cities? The correct null and alternative hypotheses for testing this claim are A. H 0 : p  0.28 and H A : p  0.28 B. H 0 : p  0.28 and H A : p  0.28 C. H 0 : p  0.28 and H A : p  0.28 D. H 0 : p  0.28 and H A : p  0.28 E. H 0 : p  0.28 and H A : p  0.28 9. A report on the U.S. economy indicates that 28% of Americans have experienced difficulty in making mortgage payments. A news organization randomly sampled 400 Americans from 10 cities named the "fastest dying cities in the U.S." (Forbes Magazine, August 2008) and found that 136 reported such difficulty. Does this indicate that the problem is more severe among these cities? The correct value of the test statistic for testing this claim is A. z = 1.28 B. z = –1.28 C. z = 1.96 D. z = –2.67 E. z = 2.67

.


15-14

Part IV From the Data at Hand to the World at Large

10. A report on the U.S. economy indicates that 28% of Americans have experienced difficulty in making mortgage payments. A news organization randomly sampled 400 Americans from 10 cities named the “fastest dying cities in the U.S.” (Forbes Magazine, August 2008) and found that 136 reported such difficulty. Does this indicate that the problem is more severe among these cities? The P-value associated with the test statistic for testing this claim is 0.0029. At α = .05, A. We can conclude that the percentage of Americans in these cities experiencing difficulty making mortgage payments is equal to 28%. B. We can conclude that the percentage of Americans in these cities experiencing difficulty making mortgage payments is not significantly different from 28%. C. We can conclude that the percentage of Americans in these cities experiencing difficulty making mortgage payments is significantly lower than 28%. D. We can conclude that the percentage of Americans in these cities experiencing difficulty making mortgage payments is significantly higher than 28%. E. None of these

.


Chapter 15 Testing Hypotheses

Statistics Quiz E – Chapter 15 – Key

1. D 2. A 3. C 4. B 5. A 6. C 7. B 8. B 9. E 10. D

.

15-15


15-16

Part IV From the Data at Hand to the World at Large

Statistics Quiz F – Chapter 15

Name

1. 5% of trucks of a certain model have needed new engines after being driven between 0 and 100 miles. The manufacturer hopes that the redesign of one of the engine's components has solved this problem. A. H 0 : p  0.05, H A : p  0.05 B. H 0 : p  0.05, H A : p  0.05 C. H 0 : p  0.05, H A : p  0.05 D. H 0 : p  0.05, H A : p  0.05 E. H 0 : p  0.05, H A : p  0.05 2. A company hopes to improve its engines, setting a goal of no more than 3% of customers using their warranty on defective engine parts. A random survey of 1400 customers found only 30 with complaints. Create a 95% confidence interval for the true level of warranty users among all customers. A. Based on the data, we are 95% confident the proportion of warranty users is between 2.0% and 2.9%. Therefore, the company has met its goal. B. Based on the data, we are 95% confident the proportion of warranty users is between 1.4% and 2.9%. Therefore, the company has met its goal. C. Based on the data, we are 95% confident the proportion of warranty users is between 1.4% and 3.8%. Therefore, the company has not met its goal. D. Based on the data, we are 95% confident the proportion of warranty users is between 1% and 3%. Therefore, the company has met its goal. E. Based on the data, we are 95% confident the proportion of warranty users is between 0% and 2.9%. Therefore, the company has met its goal. 3. A skeptical paranormal researcher claims that the proportion of Americans that have seen a UFO is less than 5%. A. H 0 : p  0.05, H A : p  0.05 B. H 0 : p  0.05, H A : p  0.05 C. H 0 : p  0.05, H A : p  0.05 D. H 0 : p  0.05, H A : p  0.05 E. H 0 : p  0.05, H A : p  0.05 4. Using the t-tables, software, or a calculator, estimate the critical value of t for the given confidence interval and degrees of freedom: 98% confidence interval with df = 20 A. 2.518 B. 2.539 C. 2.330 D. 1.325 E. 2.528

.


Chapter 15 Testing Hypotheses

15-17

5. Use the t-tables, software, or a calculator to estimate the indicated P-value: P-value for t ≤ 1.76 with 24 degrees of freedom A. 0.0456 B. 0.0228 C. 0.9544 D. 0.9088 E. 0.9772 6. A manufacturer claims that the mean amount of juice in its 16 ounce bottles are 16.1 ounces. A consumer advocacy group wants to perform a hypothesis test to determine whether the mean amount is actually less than this. A. Lower-tailed B. Two-sided C. Upper-tailed 7. The manufacturer of a refrigerator system produces refrigerators that are supposed to maintain a true mean temperature, μ, of 48°F, ideal for certain beverages. The owner of a beverage company does not agree with the refrigerator manufacturer, and will conduct a hypothesis test to determine whether the true mean temperature differs from this value. A. Two-sided B. Upper-tailed C. Lower-tailed 8. A statistician consulting for a state department of transportation conducts a study to see if the proportion of drivers who are intoxicated has decreased since last year. H 0 : The proportion is the same as last year H A : The proportion is less than last year. The P-value of a significance test is greater than the significance level. She can conclude: A. the alternative hypothesis is true. B. the null hypothesis is true. C. there is not enough evidence to reject the null hypothesis. D. the test was too powerful. E. there is enough evidence to reject the null hypothesis.

.


15-18

Part IV From the Data at Hand to the World at Large

9. A pharmaceutical company wants to answer the question whether it takes longer than 45 seconds for a drug in pill form to dissolve in the gastric juices of the stomach. A sample was taken from patients taken the given drug in pill form and times for the pills to be dissolved were measured. State assumptions of the test. I. The subjects must be part of a medical study group investigating the drug. II. The subjects in the sample must be randomly selected from a population. III. The sample data must come from a normally distributed population of observations for the variable under study. A. I only B. II only C. III only D. II and III E. None of these 10. A pharmaceutical company wants to answer the question whether it takes longer than 45 seconds for a drug in pill form to dissolve in the gastric juices of the stomach. A sample was taken from patients taken the given drug in pill form and times for the pills to be dissolved were measured. State the hypotheses to test this question. A. H 0 :   45 sec; H A :   45 sec B. H 0 :   45 sec; H A :   45 sec C. H 0 :   45 sec; H A :   45 sec D. H 0 :   45 sec; H A :   45 sec E. H 0 :   45 sec; H A :   45 sec

.


Chapter 15 Testing Hypotheses

Statistics Quiz F – Chapter 15 – Key

1. D 2. B 3. D 4. E 5. C 6. A 7. A 8. C 9. D 10. C

.

15-19



Chapter 16 More About Tests and Intervals Solutions to Class Examples 1. See Class Example 1. 2. See Class Example 2. 3. See Class Example 3. 4. See Class Example 4. A note about the Investigative Task The investigative task here is quite comprehensive, covering topics from three chapters. It asks students to determine a sample size needed for a survey, and then gives them the survey results and asks for a confidence interval, a hypothesis test, and interpretations of the possible errors. The investigative tasks now start to become more complex, and that is reflected in the rubrics.

.


16-2

Part IV From the Data at Hand to the World at Large

Statistics Quiz A – Chapter 16

Name

A company manufacturing computer chips finds that 8% of all chips manufactured are defective. Management is concerned that employee inattention is partially responsible for the high defect rate. In an effort to decrease the percentage of defective chips, management decides to offer incentives to employees who have lower defect rates on their shifts. The incentive program is instituted for one month. If successful, the company will continue with the incentive program. 1. Write the company’s null and alternative hypotheses. 2. In this context describe a Type I error and the impact such an error would have on the company.

3. In this context describe a Type II error and the impact such an error would have on the company.

4. Based on the data they collected during the trial program, management found that a 95% confidence interval for the percentage of defective chips was (5.0%, 7.0%). What conclusion should management reach about the new incentive program? Explain.

5. What level of significance did management use? 6. Describe to management an advantage and disadvantage of using a 1% alpha level of significance instead.

7. Management decided to extend the incentive program so that the decision can be made on three months of data instead. Will the power increase, decrease, or remain the same?

8. Over the trial month, 6% of the computer chips manufactured were defective. Management decided that this decrease was significant. Why might management choose not to permanently institute the employee incentive program?

.


Chapter 16 More About Tests and Intervals

16-3

Statistics Quiz A – Chapter 16 – Key A company manufacturing computer chips finds that 8% of all chips manufactured are defective. Management is concerned that employee inattention is partially responsible for the high defect rate. In an effort to decrease the percentage of defective chips, management decides to offer incentives to employees who have lower defect rates on their shifts. The incentive program is instituted for one month. If successful, the company will continue with the incentive program. 1. Write the company’s null and alternative hypotheses. The null hypothesis is that the defect rate is 8%. The alternative hypothesis is that the defect rate is lower than 8%. In symbols: H 0 : p  0.08 and H A : p  0.08

2. In this context describe a Type I error and the impact such an error would have on the company. A Type I error would be deciding the percentage of defective chips has decreased, when in fact it has not. The company would waste money on a new incentive program that does not decrease the defect rate of the chips.

3. In this context describe a Type II error and the impact such an error would have on the company. A Type II error would be deciding the percentage of defective chips has not decreased, when in fact it has. The company would miss an opportunity to decrease the defect rate of the chips.

4. Based on the data they collected during the trial program, management found that a 95% confidence interval for the percentage of defective chips was (5.0%, 7.0%). What conclusion should management reach about the new incentive program? Explain. The confidence interval contains values that are all below the hypothesized value of 8%, so the data provide convincing evidence that the incentive program lowers the defect rate of the computer chips.

5. What level of significance did management use? 

1  0.95  0.025  2.5% 2

6. Describe to management an advantage and disadvantage of using a 1% alpha level of significance instead. Advantage: management would be less likely to conclude the incentive program was effective if it really were not. Disadvantage: the test would have less power to detect a positive effect of the new incentive program.

7. Management decided to extend the incentive program so that the decision can be made on three months of data instead. Will the power increase, decrease, or remain the same? The power would increase because of the larger sample size.

8. Over the trial month, 6% of the computer chips manufactured were defective. Management decided that this decrease was significant. Despite this, management might choose not to permanently institute the employee incentive program? Why? Although statistically significant, the practical significance (cost of the incentive program compared to the savings due to the decrease in defect rate of the chips) might not be great enough to warrant instituting the program permanently.

.


16-4

Part IV From the Data at Hand to the World at Large

Statistics Quiz B – Chapter 16

Name

The board of directors for Procter & Gamble is concerned that only 19.5% of the people who use toothpaste buy Crest toothpaste. A marketing director suggests that the company invest in a new marketing campaign which will include advertisements and new labeling for the toothpaste. The research department conducts product trials in test markets for one month to determine if the market share increases with new labels. 1. Write the company’s null and alternative hypotheses. 2. In this context describe a Type I error and the impact such an error would have on the company.

3. In this context describe a Type II error and the impact such an error would have on the company.

4. Based on the data they collected during the trial the research department found that a 98% confidence interval for the proportion of all consumers who might buy Crest toothpaste was (16%, 28%). What conclusion should the company reach about the new marketing campaign? Explain.

5. What level of significance did the research department use? 6. Describe to the board of directors an advantage and a disadvantage of using a 5% alpha level of significance instead.

7. The board of directors asked the research department to extend the trial period so that the decision can be made on two months worth of data. Will the power increase, decrease, or remain the same?

8. Over the trial month the market share in the sample rose to 22% of shoppers. The company’s board of directors decided this increase was significant. Now that they have concluded the new marketing campaign works, why might they still choose not to invest in the campaign? .


Chapter 16 More About Tests and Intervals

16-5

Statistics Quiz B – Chapter 16 – Key The board of directors for Procter & Gamble is concerned that only 19.5% of the people who use toothpaste buy Crest toothpaste. A marketing director suggests that the company invest in a new marketing campaign which will include advertisements and new labeling for the toothpaste. The research department conducts product trials in test markets for one month to determine if the market share increases with new labels. 1. Write the company’s null and alternative hypotheses. The null hypothesis is that 19.5% of all people who use toothpaste buy Crest. The alternative hypothesis is that the percentage of all people who use toothpaste who use Crest is greater than 19.5 %. In symbols: H 0 : p  0.195 and H a : p  0.195

2. In this context describe a Type I error and the impact such an error would have on the company. A Type I error would be concluding the proportion of people who will buy Crest toothpaste will go up when in fact it won’t. The company would waste money on a new marketing campaign that will not increase sales.

3. In this context describe a Type II error and the impact such an error would have on the company. A Type II error would be deciding the proportion of people who will buy Crest toothpaste won’t go up when in fact it would have. The company would miss an opportunity to increase sales.

4. Based on the data they collected during the trial the research department found that a 98% confidence interval for the proportion of all consumers who might buy Crest toothpaste was (16%, 28%). What conclusion should the company reach about the new marketing campaign? Explain. The confidence interval contains the current value of 19.5%, so the product trials do not provide convincing evidence that the new marketing campaign would increase sales.

5. What level of significance did the research department use? 

1  0.98  0.01  1% 2

6. Describe to the board of directors an advantage and a disadvantage of using a 5% alpha level of significance instead. Advantage: the test would have greater power to detect a positive effect of the new marketing campaign. Disadvantage: they would be more likely to think the marketing campaign was effective even if it were not.

7. The board of directors asked the research department to extend the trial period so that the decision can be made on two months worth of data. Will the power increase, decrease, or remain the same? The power would increase because of the larger sample size.

8. Over the trial month the market share in the sample rose to 22% of shoppers. The company’s board of directors decided this increase was significant. Now that they have concluded the new marketing campaign works, why might they still choose not to invest in the campaign? Although statistically significant, the small increase in sales from 19.5% to 22% might not be enough to justify the expense of the new marketing campaign nationwide.

.


16-6

Part IV From the Data at Hand to the World at Large

Statistics Quiz C – Chapter 16

Name

The owner of a small clothing store is concerned that only 28% of people who enter her store actually buy something. A marketing salesman suggests that she invest in a new line of celebrity mannequins (think Seth Rogan modeling the latest jeans…). He loans her several different “people” to scatter around the store for a two-week trial period. The owner carefully counts how many shoppers enter the store and how many buy something so that at the end of the trial she can decide if she’ll purchase the mannequins. She’ll buy the mannequins if there is evidence that the percentage of people that buy something increases. 1. Write the owner’s null and alternative hypotheses. 2. In this context describe a Type I error and the impact such an error would have on the store.

3. In this context describe a Type II error and the impact such an error would have on the store.

4. Based on data that she collected during the trial period the store’s owner found that a 98% confidence interval for the proportion of all shoppers who might buy something was (27%, 35%). What conclusion should she reach about the mannequins? Explain.

5. What alpha level did the store’s owner use? _______ 6. Describe to the owner an advantage and a disadvantage of using an alpha level of 5% instead.

7. The owner talked the salesman into extending the trial period so that she can base her decision on data for a full month. Will the power of the test increase, decrease, or remain the same?

8. Over the trial month the rate of in-store sales rose to 30% of shoppers. The store’s owner decided this increase was statistically significant. Now that she’s convinced the mannequins work, why might she still chose not to purchase them?

.


Chapter 16 More About Tests and Intervals

16-7

Statistics Quiz C – Chapter 16 – Key The owner of a small clothing store is concerned that only 28% of people who enter her store actually buy something. A marketing salesman suggests that she invest in a new line of celebrity mannequins (think Seth Rogan modeling the latest jeans…). He loans her several different “people” to scatter around the store for a two-week trial period. The owner carefully counts how many shoppers enter the store and how many buy something so that at the end of the trial she can decide if she’ll purchase the mannequins. She’ll buy the mannequins if there is evidence that the percentage of people that buy something increases. 1. Write the owner’s null and alternative hypotheses. The null hypothesis is that percentage of all customers who buy something is 28%. The alternative hypothesis is that the percentage of all customers who buy something is greater than 28%. In symbols: H 0 : p  0.28 H A : p  0.28

2. In this context describe a Type I error and the impact such an error would have on the store. A Type I error would be deciding the percentage of customers who’ll make purchases will go up when in fact it won’t. The store’s owner would waste money buying useless mannequins.

3. In this context describe a Type II error and the impact such an error would have on the store. A Type II error would be deciding the percentage of customers who’ll make purchases won’t go up when in fact it would have. The store’s owner would miss an opportunity to increase sales.

4. Based on data that she collected during the trial period the store’s owner found that a 98% confidence interval for the proportion of all shoppers who might buy something was (27%, 35%). What conclusion should she reach about the mannequins? Explain. Because the hypothesized value of 28% is in the confidence interval the trial results do not provide convincing evidence that the mannequins would help increase sales.

5. What alpha level did the store’s owner use? 

1  0.98  0.01  1% 2

6. Describe to the owner an advantage and a disadvantage of using an alpha level of 5% instead. Advantage: the test would have greater power to detect a positive effect of the mannequins. Disadvantage: she’d be more likely to think the mannequins were effective even if they were not.

7. The owner talked the salesman into extending the trial period so that she can base her decision on data for a full month. Will the power of the test increase, decrease, or remain the same? The power of the test will increase.

8. Over the trial month the rate of in-store sales rose to 30% of shoppers. The store’s owner decided this increase was statistically significant. Now that she’s convinced the mannequins work, why might she still chose not to purchase them? Although statistically significant, the small increase in sales from 28% to 30% of shoppers might not be enough to justify the expense of purchasing the mannequins.

.


16-8

Part IV From the Data at Hand to the World at Large

Statistics – Investigative Task

Chapter 16

Life After High School? Did the events of September 11 impact the future plans of high school seniors? Educators have expressed fear that the economic impact may mean that fewer students can afford to attend college. Some commentators have suggested that a heightened sense of patriotism may increase military enlistments, while others think that the existence of actual hostilities may deter young people from choosing a military path. A polling organization wants to investigate what this year’s high school seniors are planning to do after they graduate. The pollsters randomly select 5 cities in Upstate New York and then randomly select one high school in each city. The guidance office at each of the chosen schools is instructed to ask 100 randomly selected seniors what their current plans are, and to report the results back to the pollsters. The data collected from the 5 schools are summarized in the table. Plans Count College 289 Question #1: Determine a 90% confidence interval for the percentage of seniors planning to go to college this year. Explain carefully what your interval means.

Employment Military Other (travel, parenting, etc) Undecided/No response

112 26 51 22

Question #2: During the 90’s about 4.5%1 of high school seniors enlisted in the military. Do these data suggest that the percentage who enlist is different this year? Test an appropriate hypothesis and state your conclusion. Question #3: A few of the seniors did not respond to the guidance queries, and others said they were undecided. Some of these people might eventually decide to enlist in the military. Suppose that half of this small group also enlist. Would that cause you to change your conclusion in Question #3? Explain. Question #4: Explain, in this context, what a Type I error and a Type II error would be.

1

Defense Technical Information Center .


Chapter 16 More About Tests and Intervals

Statistics – Investigative Task

16-9

Chapter 16

This task consists of three components. Each is rated on a 0 - 4 scale, based on Essentially correct, Partially correct, or Incorrect solutions. Confidence Interval E: Checks conditions and identifies procedure by name or formula Model P: Checks conditions but does not identify the procedure E: Determines the correct confidence interval (minor errors okay) Mechanics P: Has correct procedure, but uses wrong critical value or notation E: Correctly interprets the interval in context Conclusion P: Has a correct interpretation but the wrong population

Score:

Hypothesis Test E: Writes correct hypotheses (two-tail) with proper notation Hypotheses P: Writes one tail hypotheses or uses wrong notation E: Checks conditions and identifies test by name or formula Model P: Checks conditions but does not identify the test E: Has correct sketch, notation, z, and P-value (consistent with Ha) Mechanics P: Makes one mistake among the four listed “E” requirements

Score:

Conclusion

E: Links P-value to the correct conclusion in proper context P: Has the correct conclusion, but the linkage is missing or unclear

More about the Test E: Correctly finds the revised P-value Mechanics P: Makes one mistake in the sketch, notation, z, or P-value E: Links P-value to the revised conclusion in proper context Conclusion P: Has the correct conclusion, but the linkage is missing or unclear E: Correctly identifies errors, consistent with stated hypotheses I: Reverses the errors (with respect to stated hypotheses) Errors E: Correctly explains both kinds of error in context. P: Explains only one type of error correctly in context

E P I E P I E P I

E P I E P I E P I E P I Score: E P I E P I E

E P I

Scoring  E’s count 1 point, P’s are 1/2  Component scores = sum of parts; rounding based on quality of “Partial” responses.  Grade: based on the sum of the three components, 11=A+, 10=A, 9=A-, etc.

Name _________________________________

.

Grade ____

I


16-10

Part IV From the Data at Hand to the World at Large

Model Solution – Investigative Task – Life After High School? Question #1 Think – We want a 90% confidence interval for the proportion of all seniors who plan to attend college, based on a random sample of 500 students. In the sample, 289 stated that they plan to attend college. Plausible Independence Condition: It is reasonable to think that the responses were mutually independent. Random Sampling Condition: The students were selected using a random, multi-stage sample. 10% Condition: 500 seniors represent less than 10% of all seniors. Success/Failure Condition: npˆ  289 and nqˆ  211 , which are both at least 10, so the sample is large enough.

The conditions are satisfied, so I can use the Normal model to find a one proportion z-interval. 289 Show – n  500 and pˆ  500  0.578 , so SE ( pˆ ) 

ˆˆ pq (0.578)(0.422)   0.022 500 n

The margin of error is ME  z *  SE ( pˆ )  1.645(0.022)  0.036 The confidence interval is 0.578  0.036 or (0.542, 0.614) Tell – I am 90% confident that between 54.2% and 61.4% of seniors plan to go to college. This means that 90% of samples of 500 students will produce intervals that will succeed in capturing the true proportion of seniors who plan to go to college. Question #2 Hypotheses – The null hypothesis is that the proportion of seniors who enlist in the military is 4.5%. The alternative hypothesis is that the proportion of seniors who enlist in the military is different from 4.5%. In symbols: H 0 : p = 0.045, H A : p ¹ 0.045 Model – Independence Assumption: It is reasonable to think responses were mutually independent. Random Sampling Condition: Students were selected using a random, multi-stage sample. 10% Condition: 500 seniors represent less than 10% of all seniors. Success/Failure Condition: np  500(0.045)  22.5 and nq  500(0.955)  477.5 , which are both at least 10, so the sample is large enough.

The conditions are satisfied, so I can use the Normal model to perform a one proportion z—test. Since we are testing the hypothesis that the proportion of students who enlist in the military is different than 0.045, we will use a 2 tailed test.

.


Chapter 16 More About Tests and Intervals

16-11

26 Mechanics– n  500 , x  26 , pˆ  500  0.052 ,

pˆ  p pq n 0.052  0.045 z (0.045)(0.955) 500 z  0.755 z

P  2  P( pˆ  0.052)  2  P( z  0.755)  0.45

Conclusion – The P-value is quite high, so we fail to reject the null hypothesis. There is no evidence to suggest that the proportion of students enlisting in the military differs from the 4.5% that enlisted in the 1990s. Question #4

If half of the undecided/no response group enlists in the military, then 11 additional students would enlist in the military, for a total of 36 students. 37 Mechanics– n  500 , x  37 , pˆ  500  0.074 ,

pˆ  p pq n 0.074  0.045 z (0.045)(0.955) 500 z  3.128 z

P  2  P( pˆ  0.074)  2  P( z  3.128)  0.0018

Conclusion – The P-value is low, so we reject the null hypothesis. There is strong evidence to suggest that the proportion of students enlisting in the military differs from the 4.5% that enlisted in the 1990s. It appears that the proportion of students entering the military this year is higher than 4.5%. Adding 11 students from the undecided/no response group would cause me to change my decision. Question #5

If the pollsters believe that the percentage of students entering the military has changed, when, in fact, the percentage has stayed the same, they have committed a Type I error. If the pollsters believe that the percentage of students entering the military has stayed the same, when, in fact, the percentage has changed, they have committed a Type II error.

.


16-12

Part IV From the Data at Hand to the World at Large

Statistics Quiz D – Chapter 16

Name

1. An entomologist writes an article in a scientific journal which claims that fewer than 6% of male fireflies are unable to produce light due to a genetic mutation. Identify the Type I error in this context. A. The error of failing to reject the claim that the true proportion is at least 6% when it is actually less than 6%. B. The error of rejecting the claim that the true proportion is less than 6% when it really is less than 6%. C. The error of accepting the claim that the true proportion is at least 6% when it really is at least 6%. D. The error of failing to accept the claim that the true proportion is at least 6% when it is actually less than 6%. E. The error of rejecting the claim that the true proportion is at least 6% when it really is at least 6%. 2. A psychologist claims that more than 2.2% of the population suffers from professional problems due to extreme shyness. Identify the Type II in this context. A. The error of accepting the claim that the true proportion is more than 2.2% when it really is more than 2.2%. B. The error of rejecting the claim that the true proportion is more than 2.2% when it really is more than 2.2%. C. The error of rejecting the claim that the true proportion is at most 2.2% when it really is at most 2.2%. D. The error of failing to reject the claim that the true proportion is at most 2.2% when it is actually more than 2.2%. E. The error of failing to accept the claim that the true proportion is at most 2.2% when it is actually more than 2.2%. 3. A large software development firm recently relocated its facilities. Top management is interested in fostering good relations with their new local community and has encouraged their professional employees to engage in local service activities. They believe that the firm’s professionals volunteer an average of more than 15 hours per month. If this is not the case, they will institute an incentive program to increase community involvement. A random sample of 24 professionals yields a mean of 16.6 hours and a standard deviation of 2.22 hours. The correct value of the test statistic for the appropriate hypothesis test is A. t = –3.532 B. t = 3.532 C. t = 0.789 D. t = –1.223 E. t = 1.223

.


Chapter 16 More About Tests and Intervals

16-13

4. A large software development firm recently relocated its facilities. Top management is interested in fostering good relations with their new local community and has encouraged their professional employees to engage in local service activities. They believe that the firm’s professionals volunteer an average of more than 15 hours per month. If this is not the case, they will institute an incentive program to increase community involvement. A random sample of 24 professionals yields a mean of 16.6 hours and a standard deviation of 2.22 hours. The P-value associated with the resulting test statistic is 0.0009. At α = 0.05, which of the following is the correct conclusion? I. We reject the null hypothesis. II. We fail to reject the null hypothesis. III. The firm shouldn’t need to institute an incentive program because the evidence indicates that professional employees volunteer an average of more than 15 hours per month in their local community. A. I only B. Both II and III C. Both I and III D. III only E. II only 5. A large software development firm recently relocated its facilities. Top management is interested in fostering good relations with their new local community and has encouraged their professional employees to engage in local service activities. They believe that the firm’s professionals volunteer an average of more than 15 hours per month. If this is not the case, they will institute an incentive program to increase community involvement. Based on data collected they perform the appropriate hypothesis test. A Type II error in this context means I. They failed to detect that the average number of hours volunteered by the firm’s professional employees is more than 15 hours when in fact it is. II. They would decide not to institute an incentive program to increase community involvement when they should have. III. They would waste money instituting an incentive program to increase community involvement among its professional employees that was not needed. A. I only B. II only C. III only D. Both I and II E. Both I and III

.


16-14

Part IV From the Data at Hand to the World at Large

6. In the past, the mean running time for a certain type of flashlight battery has been 8.3 hours. The manufacturer has introduced a change in the production method and wants to perform a hypothesis test to determine whether the mean running time has increased as a result. The hypotheses are: H 0 :   8.3 hours H A :   8.3 hours Explain the result of a Type I error. A. The manufacturer will decide the mean battery life is less than 8.3 hours when in fact it is greater than 8.3 hours. B. The manufacturer will decide the mean battery life is greater than 8.3 hours when in fact it is less than 8.3 hours. C. The manufacturer will decide the mean battery life is greater than 8.3 hours when in fact it is greater than 8.3 hours. D. The manufacturer will decide the mean battery life is 8.3 hours when in fact it is greater than 8.3 hours. E. The manufacturer will decide the mean battery life is greater than 8.3 hours when in fact it is 8.3 hours. 7. In the past, the mean running time for a certain type of flashlight battery has been 8.4 hours. The manufacturer has introduced a change in the production method and wants to perform a hypothesis test to determine whether the mean running time has increased as a result. The hypotheses are: H 0 :   8.4 hours H A :   8.4 hours Explain the result of a Type II error. A. The manufacturer will decide the mean battery life is greater than 8.4 hours when in fact it is greater than 8.4 hours. B. The manufacturer will decide the mean battery life is less than 8.4 hours when in fact it is greater than 8.4 hours. C. The manufacturer will decide the mean battery life is 8.4 hours when in fact it is 8.4 hours. D. The manufacturer will decide the mean battery life is 8.4 hours when in fact it is greater than 8.4 hours. E. The manufacturer will decide the mean battery life is greater than 8.4 hours when in fact it is 8.4 hours.

.


Chapter 16 More About Tests and Intervals

16-15

8. Top management of a large multinational corporation wants to create a culture of innovativeness and change. A consultant hired to assess the company’s organizational culture finds that only 15% of employees are open to new ideas and approaches toward their work. Consequently the company conducts a program for employees in order to reinforce the new corporate philosophy. After the program is completed, employees are surveyed to see if a greater percentage is now open to innovativeness and change. The correct alternative hypothesis is A. p > 0.15 B. μ > 0.15 C. p = 0.15 D. p < 0.15 E. μ ≠ 0.15 9. Top management of a large multinational corporation wants to create a culture of innovativeness and change. A consultant hired to assess the company's organizational culture finds that only 15% of employees are open to new ideas and approaches toward their work. Consequently the company conducts a program for employees in order to reinforce the new corporate philosophy. After the program is completed, employees are surveyed to see if a greater percentage is now open to innovativeness and change. Suppose that based on the sample results the company rejects the null hypothesis when in fact it is true. Which of the following statements is correct? A. This led to the conclusion that the percentage of employees open to new ideas did not increase when in fact it did. B. This led to the conclusion that the percentage of employees open to new ideas did increase when in fact it did not. C. This is known as a fatal error. D. This is known as a Type II error. E. This led to a P-value that was greater than the level of significance.

.


16-16

Part IV From the Data at Hand to the World at Large

10. The average diastolic blood pressure of a group of men suffering from high blood pressure is 97 mmHg. During a clinical trial, the men receive a medication which it is hoped will lower their blood pressure. After three months, the researcher wants to perform a hypothesis test to determine whether the average diastolic blood pressure of the men has decreased. The hypotheses are: H 0 :   97 mmHg H A :   97 mmHg Explain the result of a Type II error. A. The researcher will conclude that the average diastolic blood pressure of the men is the same when in fact it has decreased. B. The researcher will conclude that the average diastolic blood pressure of the men is the same when in fact it is the same. C. The researcher will conclude that the average diastolic blood pressure of the men has increased when in fact it has decreased. D. The researcher will conclude that the average diastolic blood pressure of the men has decreased when in fact it has increased. E. The researcher will conclude that the average diastolic blood pressure of the men has decreased when in fact it is the same.

.


Chapter 16 More About Tests and Intervals

Statistics Quiz D – Chapter 16 – Key

1. E 2. D 3. C 4. B 5. E 6. E 7. D 8. A 9. B 10. A

.

16-17


16-18

Part IV From the Data at Hand to the World at Large

Statistics Quiz E – Chapter 16

Name

1. A skeptical paranormal researcher claims that the proportion of Americans that have seen a UFO, p, is less than 5%. Identify the Type II error in this context. A. The error of rejecting the claim that the true proportion is at least 5% when it really is at least 5%. B. The error of failing to reject the claim that the true proportion is at least 5% when it is actually less than 5%. C. The error of accepting the claim that the true proportion is at least 5% when it really is at least 5%. D. The error of rejecting the claim that the true proportion is less than 5% when it really is less than 5%. E. The error of accepting the claim that the true proportion is at least 5% when it is actually less than 5%. 2. Insurance companies track life expectancy information to assist in determining the cost of life insurance policies. Last year the average life expectancy of all policyholders was 77 years. ABI Insurance wants to determine if their clients now have a longer life expectancy, on average, so they randomly sample 20 of their recently paid policies. The sample has a mean of 78.6 years and a standard deviation of 4.48 years. The P-value associated with the resulting test statistic is 0.063. At α = 0.05, which of the following is the correct conclusion? I. We fail to reject the null hypothesis. II. We reject the null hypothesis. III. There is not significant evidence to indicate an increase in average life expectancy. A. I only B. II only C. III only D. Both I and III E. Both II and III 3. Insurance companies track life expectancy information to assist in determining the cost of life insurance policies. Last year the average life expectancy of all policyholders was 77 years. ABI Insurance wants to determine if their clients now have a longer life expectancy, on average, so they randomly sample some of their recently paid policies. The insurance company will only change their premium structure if there is evidence that people who buy their policies are living longer than before. Which of the following statement is true about this hypothesis test? A. It is a one tailed test about a mean. B. It is a one tailed test about a proportion. C. It is a two tailed test about a mean. D. It is a two tailed test about a proportion. E. None of these.

.


Chapter 16 More About Tests and Intervals

16-19

4. A pharmaceutical company has a drug they think is better at lowering blood pressure than the current medication, but it costs more money for consumers. They conduct a significance test on their clinical study (H0: The new drug works the same as the old drug, HA: The new drug works better than the old drug.) Which statement about errors is correct? A. If they fail to reject the null hypothesis, they might be committing a Type I error. B. A Type I error means people pay more money for a drug that doesn't work better. C. A Type II error would mean that they did not use randomization in their experiment. D. A Type I error means people keep using the old drug when a new drug works better. E. If they reject the null hypothesis, it is possible that they are rejecting a Type II error. 5. Which of the following would increase the power of a test? I. Increase the sample size II. Decrease the significance level III. Increase alpha A. I only B. II only C. III only D. I and II E. I and III 6. In 1990, the average math SAT score for students at one school was 472. Five years later, a teacher wants to perform a hypothesis test to determine whether the average math SAT score of students at the school has changed from the 1990 mean of 472. The hypotheses are: H 0 :   472 H A :   472 Explain the result of a Type I error. A. The teacher will conclude that the mean math SAT score at the school is the same when in fact it is different. B. The teacher will conclude that the mean math SAT score at the school is the same when in fact it is the same. C. The teacher will conclude that the mean math SAT score at the school has decreased when in fact it is the same. D. The teacher will conclude that the mean math SAT score at the school is different when in fact it is the same. E. The teacher will conclude that the mean math SAT score at the school has increased when in fact it is the same.

.


16-20

Part IV From the Data at Hand to the World at Large

7. In 1990, the average math SAT score for students at one school was 490. Five years later, a teacher wants to perform a hypothesis test to determine whether the average SAT score of students at the school has changed from the 1990 mean of 490. The hypotheses are: H 0 :   490 H A :   490 Explain the result of a Type II error. A. The teacher will conclude that the mean math SAT score at the school is the same when in fact it is the same. B. The teacher will conclude that the mean math SAT score at the school has increased when in fact it is the same. C. The teacher will conclude that the mean math SAT score at the school is different when in fact it is the same. D. The teacher will conclude that the mean math SAT score at the school has decreased when in fact it is the same. E. The teacher will conclude that the mean math SAT score at the school is the same when in fact it is different 8. Marcy’s Consignment shop is based in Port Angeles, Washington. It offers both male and female never or slightly used clothing and accessories on a consignment basis. Marcy’s recently redesigned their website and wants to show that sales have increased. The sales manager selects 30 sales records at random from the store’s online client list and she finds a mean increase in spending of $3.50 with a standard deviation of $7.20. The resulting P-value for an α = 0.05 is A. 0.012 B. 0.315 C. 0.006 D. 0.024 E. 0.630 9. A company that sells eco-friendly cleaning products is concerned that only 19.5% of people who use such products select their brand. A marketing director suggests that the company invest in new advertising and labeling to strengthen its green image. The company decides to do so in a test market so that the effectiveness of the marketing campaign may be evaluated. In this context, committing a Type I error I. Occurs when they conclude that the percentage of customers purchasing the company’s brand has increased when in fact it has not. II. Occurs then they conclude that the percentage of customers purchasing the company’s brand has not increased when in fact it has. III. Would result in the company wasting money on a new marketing campaign that does not increase the percentage of customers buying their brand. A. I only B. II only C. III only D. Both I and III E. Both II and III .


Chapter 16 More About Tests and Intervals

16-21

10. A man is on trial accused of murder in the first degree. The prosecutor presents evidence that he hopes will convince the jury to reject the hypothesis that the man is innocent. This situation can be modeled as a hypothesis test with the following hypotheses: H 0 : The defendant is not guilty. H A : The defendant is guilty. Explain the result of a Type II error. A. The jury will fail to reach a decision. B. The jury will conclude that the defendant is guilty when in fact he is not guilty. C. The jury will conclude that the defendant is not guilty when in fact he is not guilty. D. The jury will conclude that the defendant is guilty when in fact he is guilty. E. The jury will conclude that the defendant is not guilty when in fact he is guilty.

.


16-22

Part IV From the Data at Hand to the World at Large

Statistics Quiz E – Chapter 16 – Key

1. B 2. D 3. A 4. B 5. E 6. D 7. E 8. C 9. D 10. E

.


Statistics Test A – Inference – Part IV __ 1.

We have calculated a 95% confidence interval and would prefer for our next confidence interval to have a smaller margin of error without losing any confidence. In order to do this, we can I. change the z  value to a smaller number. II. take a larger sample. III. take a smaller sample. A) I only

__ 2.

B) II only

C) III only

D) I and II

E) I and III

Which is true about a 98% confidence interval for a population mean based on a given sample? I. We are 98% confident that other sample means will be in our interval. II. There is a 98% chance that our interval contains the population mean. III. The interval is wider than a 95% confidence interval would be. A) None

__ 3.

Name ____________________

B) I only

C) II only

D) III only

E) I and II

A certain population is bimodal. We want to estimate its mean, so we will collect a sample. Which should be true if we use a large sample rather than a small one? I. The distribution of our sample data will be more clearly bimodal. II. The sampling distribution of the sample means will be approximately normal. III. The variability of the sample means will be smaller. A) I only

B) II only

C) III only

D) II and III

E) I, II, and III

Use the following information for questions 4-5: In a Stats class, 57% of students eat breakfast in the morning and 80% of students floss their teeth. Fortysix percent of students eat breakfast and also floss their teeth. __ 4.

What is the probability that a student from this class eats breakfast but does not floss? A) 9% B) 11% C) 34% D) 57% E) 91%

__ 5.

What is the probability that a student from this class eats breakfast or flosses? A) 9% B) 11% C) 34% D) 57% E) 91%

__ 6.

A P-value indicates A) the probability that the null hypothesis is true. B) the probability that the alternative hypothesis is true. C) the probability the null is true given the observed statistic. D) the probability of the observed statistic given that the null hypothesis is true. E) the probability of the observed statistic given that the alternative hypothesis is true.

__ 7.

A statistics professor wants to see if more than 80% of her students enjoyed taking her class. At the end of the term, she takes a random sample of students from her large class and asks, in an anonymous survey, if the students enjoyed taking her class. Which set of hypotheses should she test? A) D)

H 0 : p  0.80 H A : p  0.80

H 0 : p  0.80 H A : p  0.80

B) E)

H 0 : p  0.80 H A : p  0.80 H 0 : p  0.80 H A : p  0.80

.

C)

H 0 : p  0.80 H A : p  0.80


Part IV-2 __ 8.

Part IV From the Data at Hand to the World at Large

Not wanting to risk poor sales for a new soda flavor, a company decides to run one more taste test on potential customers, this time requiring a higher approval rating than they had for earlier tests. This higher standard of proof will increase I. the risk of Type I error II. the risk of Type II error III. power A) I only

B) II only C) III only

D) I and II

E) I and III

__ 9.

Suppose that a manufacturer is testing one of its machines to make sure that the machine is producing more than 97% good parts  H 0 : p  0.97 and H A : p  0.97  . The test results in a P-value of 0.122. Unknown to the manufacturer, the machine is actually producing 99% good parts. What probably happens as a result of the testing? A) They correctly fail to reject H 0 . B) They correctly reject H 0 . C) They reject H 0 , making a Type I error. D) They fail to reject H 0 , making a Type I error. E) They fail to reject H 0 , making a Type II error.

__10.

Which of the following is true about Type I and Type II errors? I. Type I errors are always worse than Type II errors. II. The severity of Type I and Type II errors depends on the situation being tested. III. In any given situation, the higher the risk of Type I error, the lower the risk of Type II error. A) I only

B) II only C) III only

D) I and III

E) II and III

11. Approval rating The President’s job approval rating is always a hot topic. Your local paper conducts a poll of 100 randomly selected adults to determine the President’s job approval rating. A CNN/USA Today/Gallup poll conducts a poll of 1010 randomly selected adults. Which poll is more likely to report that the President’s approval rating is below 50%, assuming that his actual approval rating is 54%? Explain.

.


Review of Part IV: From the Data at Hand to the World at Large

Part IV-3

12. Cereal A box of Raspberry Crunch cereal contains a mean of 13 ounces with a standard deviation of 0.5 ounce. The distribution of the contents of cereal boxes is approximately Normal. What is the probability that a case of 12 cereal boxes contains a total of more than 160 ounces?

13. Too much TV? A father is concerned that his teenage son is watching too much television each day, since his son watches an average of 2 hours per day. His son says that his TV habits are no different than those of his friends. Since this father has taken a stats class, he knows that he can actually test to see whether or not his son is watching more TV than his peers. The father collects a random sample of television watching times from boys at his son’s high school and gets the following data: 1.9 2.3 2.2 1.9 1.6 2.6 1.4 2.0 2.0 2.2 Is the father right? That is, is there evidence that other boys average less than 2 hours of television per day? Conduct a hypothesis test, making sure to state your conclusions in the context of the problem.

.


Part IV-4

Part IV From the Data at Hand to the World at Large

14. Passing the test Assume that 70% of teenagers who go to take the written drivers license test have studied for the test. Of those who study for the test, 95% pass; of those who do not study for the test, 60% pass. What is the probability that a teenager who passes the written drivers license test did not study for the test?

15. Sleep Do more than 50% of U.S. adults feel they get enough sleep? According to Gallup’s Lifestyle poll, 55% of U.S. adults said that that they get enough sleep. The poll was based on a random sample of 1003 U.S. adults. Test an appropriate hypothesis and state your conclusion in the context of the problem.

.


Review of Part IV: From the Data at Hand to the World at Large

Part IV-5

Statistics Test A – Inference – Part IV – Key 1. B

2. D

3. E

4. B

5. E

6. D

7. B

8. B

9. E

10. E

11. President’s approval rating: The smaller poll would have more variability and would thus be more likely to vary from the actual approval rating of 54%. We would expect the larger poll to be more consistent with the 54% rating. So, it is more likely that the smaller poll would report that the President’s approval rating is below 50%. 12. Cereal:  160  13.33  13   P y    P  z  2.286   0.0111 .   P  y  13.33  P  z  12  0.5 12    There is a 1.11% chance that a case of 12 cereal boxes will weigh more than 160 ounces. 13. Too much TV? H 0 :   2.0 hours H A :   2.0 hours Conditions: * Randomization condition: Boys from the high school were randomly sampled. * 10% condition: The sample is less than 10% of all boys at the high school. * Nearly Normal condition: The histogram of credit hours is unimodal and reasonably symmetric. This is close enough to Normal for our purposes. (A histogram must be shown.) Under these conditions, the sampling distribution of the mean can be modeled by Student’s t with degrees of freedom: df = n – 1 = 10 – 1 = 9. We will use a one-sample t-test for the mean. 0.3446  0.109 . We know: n = 9, y  2.01 , and s = 0.3446. So, SE  y   10 y  0 2.01  2 t   0.092 The P-value is P(t9  0.092)  0.536 . SE ( y ) 0.109 The P-value of 0.536 is high, so I fail to reject the null hypothesis. There is no evidence that the son is watching more TV, on average, than his peers.

14. Passing the test

P(Didn't study | Pass) 

P(Didn't study  Pass) P(Pass) 

.

0.180  0.213 0.665  0.180


Part IV-6

Part IV From the Data at Hand to the World at Large

15. Sleep Hypothesis: H 0 : p  0.50 H A : p  0.50 Plan: Okay to use the Normal model because the trials are independent (random sample of U.S. adults), these 1003 U.S. adults are less than 10% of all U.S. adults, and np0  1003 (0.50)  501.5  10 and nq0  1003 (0.50)  501.5  10 . We will do a one-proportion z-test.

 0.50  0.50   0.0158 ; sample proportion: pˆ  0.55 p0 q0  n 1003 0.55  0.50 P( pˆ  0.55)  P( z  )  P  z  3.167   0.0008 0.0158 With a P-value of 0.0008, I reject the null hypothesis. There is strong evidence that the proportion of U.S. adults who feel they get enough sleep is more than 50%. Mechanics: SD ( p0 ) 

.


Review of Part IV: From the Data at Hand to the World at Large

Statistics Test B – Inference – Part IV

Part IV-7

Name ____________________

__ 1.

Insurance company records indicate that 12% of all teenage drivers have been ticketed for speeding and 9% for going through a red light. If 4% have been ticketed for both, what is the probability that a teenage driver has been issued a ticket for speeding but not for running a red light? A) 3% B) 8% C) 12% D) 13% E) 17%

__ 2.

A certain population is strongly skewed to the right. We want to estimate its mean, so we will collect a sample. Which should be true if we use a large sample rather than a small one? I. The distribution of our sample data will be closer to normal. II. The sampling model of the sample means will be closer to normal. III. The variability of the sample means will be greater. A) I only B) II only

__ 3.

C) III only

E) II and III only

A coffee house owner knows that customers pour different amounts of coffee into their cups. She samples cups from 10 costumers she believes to be representative of the customers and weighs the cups, finding a mean of 12.5 ounces and standard deviation of 0.5 ounces. Assuming these cups of coffee can be considered a random sample of all cups of coffee which of the following formulas gives a 95% confidence interval for the mean weight of all cups of coffee?  0.5   0.5   0.5  A) 12.5  1.969   B) 12.5  2.228   C) 12.5  2.262    10   10   10 

 0.5  D) 12.5  2.228    9 __ 4.

D) I and III only

 0.5  E) 12.5  2.262    9

Which is true about a 95% confidence interval based on a given sample? I. The interval contains 95% of the population. II. Results from 95% of all samples will lie in the interval. III. The interval is narrower than a 98% confidence interval would be. A) None B) I only

C) II only

D) III only

E) II and III only

__ 5.

A truck company wants on-time delivery for 98% of the parts they order from a metal manufacturing plant. They have been ordering from Hudson Manufacturing but will switch to a new, cheaper manufacturer (Steel-R-Us) unless there is evidence that this new manufacturer cannot meet the 98% on-time goal. As a test the truck company purchases a random sample of metal parts from Steel-R-Us, and then determines if these parts were delivered on-time. Which hypothesis should they test? H 0 : p  0.98 H 0 : p  0.98 H 0 : p  0.98 A) B) C) H A : p  0.98 H A : p  0.98 H A : p  0.98 H 0 : p  0.98 H 0 : p  0.98 D) E) H A : p  0.98 H A : p  0.98

__ 6.

We are about to test a hypothesis using data from a well-designed study. Which is true? I. A small P-value would be strong evidence against the null hypothesis. II. We can set a higher standard of proof by choosing  = 10% instead of 5%. III. If we reduce the alpha level, we reduce the power of the test. A) None

B) I only

C) II only

D) III only

.

E) I and III only


Part IV-8

Part IV From the Data at Hand to the World at Large

__ 7.

A pharmaceutical company investigating whether drug stores are less likely than food markets to remove over-the-counter drugs from the shelves when the drugs are past the expiration date found a P-value of 2.8%. This means that: A) 2.8% more drug stores remove over-the-counter drugs from the shelves when the drugs are past the expiration date. B) 97.2% more drug stores remove over-the-counter drugs from the shelves when the drugs are past the expiration date than drug stores. C) There is a 2.8% chance the drug stores remove more expired over-the-counter drugs. D) There is a 97.2% chance the drug stores remove more expired over-the-counter drugs. E) None of these. __ 8. To plan the course offerings for the next year a university department dean needs to estimate what affect the “No Child Left Behind” legislation might have on the teacher credentialing program. Historically, 40% of this university’s pre-service teachers have qualified for paid internship positions each year. The Dean of Education looks at a random sample of internship applications to see what proportion indicate the applicant has achieved the content-mastery that is required for the internship. Based on these data he creates a 90% confidence interval of (33%, 41%). Could this confidence interval be used to test the hypothesis H 0 : p  0.40 versus H A : p  0.40 at the  = 0.05 level of significance? A) Yes, since 40% is in the confidence interval he accepts the null hypothesis, concluding that the percentage of applicants qualified for paid internship positions will stay the same. B) Yes, since 40% is in the confidence interval he fails to reject the null hypothesis, concluding that there is not strong enough evidence of any change in the percent of qualified applicants. C) Yes, since 40% is not the center of the confidence interval he rejects the null hypothesis, concluding that the percentage of qualified applicants will decrease. D) No, because he should have used a 95% confidence interval. E) No, because the dean only reviewed a sample of the applicants instead of all of them. __ 9. Suppose that a conveyor used to sort packages by size does not work properly. We test the conveyor on several packages (with H 0 : incorrect sort) and our data results in a P-value of 0.016. What probably happens as a result of our testing? A) Correctly fail to reject H 0 . B) Correctly reject H 0 . C) Reject H 0 , making a Type I error. D) Reject H 0 , making a Type II error. E) Fail to reject H 0 , committing a Type II error. __ 10. We test the hypothesis that p  35% versus p  35% . We don’t know it but actually p  26% . With which sample size and significance level will our test have the greatest power? A)   0.01, n  250 B)   0.01, n  400 C)   0.03, n  250 D)   0.03, n  400 E) The power will be the same as long as the true proportion p remains 26%

.


Review of Part IV: From the Data at Hand to the World at Large

Part IV-9

11. Dolphin births A state has two aquariums that have dolphins, with more births recorded at the larger aquarium than at the smaller one. Records indicate that in general babies are equally likely to be male or female, but the gender ratio varies from season to season. Which aquarium is more likely to report a season when over two-thirds of the dolphins born were males? Explain.

12. Annual review You are up for your annual job performance review. You estimate there’s a 30% chance you’ll get a promotion, a 40% chance of a raise, and a 20% chance of getting both a raise and a promotion. a. Find the probability that you get a raise or promotion.

b. Are the raise and the promotion independent events? Explain.

13. Pumpkin pie A can of pumpkin pie mix contains a mean of 30 ounces and a standard deviation of 2 ounces. The contents of the cans are normally distributed. What is the probability that four randomly selected cans of pumpkin pie mix contain a total of more than 126 ounces?

14. Depression A recent psychiatric study from the University of Southampton observed a higher incidence of depression among women whose birth weight was less than 6.6 pounds than in women whose birth weight was over 6.6 pounds. Based on a P-value of 0.0248 the researchers concluded there was evidence that low birth weights may be a risk factor for susceptibility to depression. Explain in context what the reported P-value means.

.


Part IV-10

Part IV From the Data at Hand to the World at Large

15. Truckers On many highways state police officers conduct inspections of driving logbooks from large trucks to see if the trucker has driven too many hours in a day. At one truck inspection station they issued citations to 49 of 348 truckers that they reviewed. a. Based on the results of this inspection station, construct and interpret a 95% confidence interval for the proportion of truck drivers that have driven too many hours in a day.

b. Explain the meaning of “95% confidence” in part A.

16. Graduation tests Many states mandate tests that have to be passed in order for students to graduate with a high school diploma. A local school superintendent believes that after-school tutoring will improve the scores of students in his district on the state’s graduation test. A tutor agrees to work with 15 students for a month before the superintendent will approach the school board about implementing an after-school tutoring program. The after-school tutoring program will be implemented if student scores increase by more than 20 points. The superintendent will test a hypothesis using   0.02 . a. Write appropriate hypotheses (in words and in symbols).

b. In this context, which do you consider to be more serious – a Type I or a Type II error? Explain.

c. After this trial produced inconclusive results, the superintendent decided to test the after-school tutoring program again with another group of students. Describe two changes he could make in the trial to increase the power of the test, and explain the disadvantages of each.

.


Review of Part IV: From the Data at Hand to the World at Large

Part IV-11

Statistics Test B – Inference – Part IV – Key 1. B

2. B

3. C

4. D

5. C

6. E

7. E

8. B

9. C

10. D

11. Dolphin births: The smaller aquarium would experience more variability in the season percentage of male births. We would expect the larger aquarium to stay more consistent and closer to the 50-50 ratio for gender births. 12. Approval Annual review a. P(raise or prom.) = P(raise) + P(prom.) – P(raise and prom.) = 0.4 + 0.3 – 0.2 = 0.5 or from Venn diagram at right, 50%. b. P(raise | promotion) = P(raise  promotion) 0.20   0.67  P(raise) P(promotion) 0.30 No, the raise and the promotion are not independent events. The probability of a raise is higher if you get the promotion. 13. Pumpkin pie: sample

Using the Central Limit Theorem approach, let y = average content of cans in

2   Since the contents are Normally distributed, y is modeled by N  30, . 4  126  31.5  30    P y    P( y  31.5)  P  z    P  z  1.5   0.067 There is about a 6.7% chance 4  1    that 4 randomly selected cans will contain a total of over 126 ounces. 14. Depression If birth weight was not a risk factor for susceptibility to depression, an observed difference in incidence of depression this large (or larger) would occur in only 2.48% of such samples.

.


Part IV-12

Part IV From the Data at Hand to the World at Large

15. Truckers a. Conditions: * Independence: We assume that one trucker’s driving times do not influence other trucker’s driving times. * Random Condition: We assume that trucks are stopped at random. * 10% Condition: This sample of 348 truckers is less than 10% of all truckers. * Success/Failure: 49 and 299 tickets both at least 10, so our sample is large enough. Under these conditions the sampling distribution of the proportion can be modeled by a Normal model. We will find a one-proportion z-interval.

ˆˆ  0.14 .86   0.0186 pq  n 348 The sampling model is Normal, for a 95% confidence interval the critical value is z *  1.96 . The margin of error is ME  z *  SE ( pˆ )  1.96  0.0186   0.0365 . The 95% confidence interval is 0.14 ± 0.0365 or (0.1035, 0.1765). We are 95% confident that between 10.4% and 17.7% of truck drivers have driven too many hours in a day. b. If we repeated the sampling and created new confidence intervals many times we would expect about 95% of those intervals to contain the actual proportion of truck drivers that have driven too many hours in a day. We know n  348 and pˆ  0.14 , so SE ( pˆ ) 

16. Graduation Tests a. H 0 :   20 ; The difference between the mean number of points before and after the tutoring program is not more than 20. H A :   20 ; The difference between the mean number of points before and after the tutoring program is more than 20. b. A Type I error would be very expensive for the school district. A Type I error would mean that the superintendent thought the tutoring program was working better than it was, so he would implement the program. In reality, the after-school tutoring program did not improve student test scores, so the school district would be spending money on a program that did not help. (Note: Students could argue that a Type II error would be worse.) c. To increase the power of the test, we could increase the level of significance () or increase the sample size. By increasing the level of significance, it could lead to adopting a tutoring program that actually doesn’t help. By increasing the sample size, the trial cost would increase.

.


Review of Part IV: From the Data at Hand to the World at Large

Statistics Test C – Inference – Part IV __ 1.

Name ____________________

A certain population is strongly skewed to the left. We want to estimate its mean, so we collect a sample. Which should be true if we use a large sample rather than a small one? I. The distribution of our sample data will be more clearly skewed to the left. II. The sampling model of the sample means will be more skewed to the left. III. The variability of the sample means will greater. A) I only

__ 2.

Part IV-13

B) II only

C) III only

D) I and III only

E) II and III only

Which is true about a 99% confidence interval based on a given sample? I. The interval contains 99% of the population. II. Results from 99% of all samples will lie in this interval. III. The interval is wider than a 95% confidence interval would be. A) None

B) I only

C) II only

D) III only

E) II and III only

__ 3.

Political analysts estimate the probability that Candidate A will run for president in 2016 is 45%, and the probability that Candidate B will run is 20%. If their political decisions are independent, then what is the probability that only Candidate A runs for president? A) 9% B) 11% C) 25% D) 36% E) 45%

__ 4.

An online catalog company wants on-time delivery for at least 90% of the orders they ship. They have been shipping orders via UPS and FedEx but will switch to a more expensive service (ShipFast) if there is evidence that this service can exceed the 90% on-time goal. As a test the company sends a random sample of orders via ShipFast, and then makes follow-up phone calls to see if these orders arrived on time. Which hypotheses should they test? H 0 : p  0.90 H 0 : p  0.90 H 0 : p  0.90 H 0 : p  0.90 H 0 : p  0.90 B) C) D) E) A) H A : p  0.90 H A : p  0.90 H A : p  0.90 H A : p  0.90 H A : p  0.90

__ 5.

A researcher investigating whether joggers are less likely to get colds than people who do not jog found a P-value of 3%. This means that: A) 3% of joggers get colds. B) Joggers get 3% fewer colds than non-joggers. C) There’s a 3% chance that joggers get fewer colds. D) There’s a 3% chance that joggers don’t get fewer colds. E) None of these.

__ 6.

To plan the budget for next year a college needs to estimate what affect the current economic downturn might have on student requests for financial aid. Historically this college has provided aid to 35% of its students. Officials look at a random sample of this year’s applications to see what proportion indicate a need for financial aid. Based on these data they create a 90% confidence interval of (32%, 40%). Could this confidence interval be used to test the hypothesis H 0 : p  0.35 versus H A : p  0.35 at the   0.10 level of significance? A) No, because financial aid amounts may not be normally distributed. B) No, because they only used a sample of the applicants instead of all of them. C) Yes; since 35% is in the confidence interval they accept the null hypothesis, concluding that the percentage of students requiring financial aid will stay the same. D) Yes; since 35% is in the confidence interval they fail to reject the null hypothesis, concluding that there is not strong evidence of any change in financial aid requests. E) Yes; since 35% is not at the center of the confidence interval they reject the null hypothesis, concluding that the percentage of students requiring aid will increase.

.


Part IV-14

__ 7.

Part IV From the Data at Hand to the World at Large

We are about to test a hypothesis using data from a well-designed study. Which is true? I. A large P-value would be strong evidence against the null hypothesis. II. We can set a higher standard of proof by choosing  = 10% instead of 5%. III. If we reduce the risk of committing a Type I error, then the risk of a Type II error will also decrease. A) None

__ 8.

B) I only

C) II only

D) III only

E) I and II only

Suppose that a device advertised to increase a car’s gas mileage really does not work. We test it on a small fleet of cars (with H0: not effective), and our data results in a P-value of 0.004. What probably happens as a result of our experiment? A) We correctly fail to reject H0. B) We correctly reject H0. C) We reject H0, making a Type I error. D) We reject H0, making a Type II error. E) We fail to reject H0, committing Type II error.

__ 9.

We will test the hypothesis that p = 60% versus p > 60%. We don’t know it, but actually p is 70%. With which sample size and significance level will our test have the greatest power? A)   0.01, n = 200 B)   0.01, n = 500 C)   0.05, n = 200 D)   0.05, n = 500 E) The power will be the same so long as the true proportion p remains 70%.

__ 10. A wildlife biologist wants to determine the mean weight of adult red squirrels. She captures 10 squirrels she believes to be representative of the species and weighs them, finding a mean of 12.32 grams and standard deviation of 1.88gm. Assuming these squirrels can be considered a random sample of all red squirrels which of the following formulas gives a 95% confidence interval for the mean weight of all squirrels? A) 12.32  1.96(1.88 / 10)

B) 12.32  2.228(1.88 / 10)

D) 12.32  2.228(1.88 / 9)

E) 12.32  2.268(1.88 / 9)

C) 12.32  2.262(1.88 / 10)

11. Births A city has two hospitals, with many more births recorded at the larger hospital than at the smaller one. Records indicate that in general babies are about equally likely to be boys or girls, but the actual gender ratio varies from week to week. Which hospital is more likely to report a week when over two-thirds of the babies born were girls? Explain.

12. Traffic accidents Police reports about the traffic accidents they investigated last year indicated that 40% of the accidents involved speeding, 25% involved alcohol, and 10% involved both risk factors. a. What is the probability that an accident b. Do these two risk factors appear to be involved neither alcohol nor speed? independent? Explain.

.


Review of Part IV: From the Data at Hand to the World at Large

Part IV-15

13. Egg weights The weights of hens’ eggs are normally distributed with a mean of 56 grams and a standard deviation of 4.8 grams. What is the probability that a dozen randomly selected eggs weighs over 690 grams?

14. Roadblocks From time to time police set up roadblocks to check cars to see if the safety inspection is up to date. At one such roadblock they issued tickets for expired inspection stickers to 22 of 628 cars they stopped. a. Based on the results at this roadblock, construct and interpret a 95% confidence interval for the proportion of autos in that region whose safety inspections have expired.

b. Explain the meaning of “95% confidence” in part a).

.


Part IV-16

Part IV From the Data at Hand to the World at Large

15. Baldness and heart attacks A recent medical study observed a higher frequency of heart attacks among a group of bald men than among another group of men who were not bald. Based on a P-value of 0.062 the researchers concluded there was some evidence that male baldness may be a risk factor for predicting heart attacks. Explain in this context what their P-value means.

16. Housing costs A government report on housing costs says that single-family home prices nationwide are skewed to the right, with a mean of $235,700. We want to see how home prices in Orange County, California compare with those nationwide. a. We collect price data from a random sample of 50 homes in Orange County, California. Why is it okay to use these data for inference even though the population is skewed?

b. The standard deviation of the 50 homes in our sample was $25,500. Specify the sampling model for the mean price of such samples.

c. This sample of randomly chosen homes produced a 90% confidence interval for the mean price in Orange County of ($233954, $246046). Does this interval provide evidence that single-family home prices are unusually high in this county? Explain briefly.

.


Review of Part IV: From the Data at Hand to the World at Large

Part IV-17

Statistics Test C – Inference for Proportions – Part IV – Key 1. A

2. D

3. D

4. E

5. E

6. D

7. A

8. C

9. D

10. C

11. Births: The smaller hospital should experience greater variability in the weekly percentage of male babies. We’d expect the larger hospital to stay closer to the expected 50-50 ratio. 12. Traffic accidents a. 0.45 b. Yes; 25% of all speeding accidents involved alcohol – same as overall. 13. Egg weights:    690  57.5  56   P y    P  z  4.8   P( z  1.08)  14% 12       12  

14. Roadblocks a. We assume the cars stopped are representative of all cars in the area, and that 628 < 10% of the cars. The 22 successes (expired stickers) and 606 failures are both greater than 10. It’s okay to use a Normal model and find a 1-proportion z-interval. (0.035)(0.965) 0.035  1.96  (0.021, 0.049) pˆ  0.035 ; 628 The interval (0.021,0.049) means that we are 95% confident that between 2% and 5% of local cars have expired safety inspection stickers. b. If we repeated the sampling and created new confidence intervals many times we’d expect about 95% of those intervals to contain the actual proportion of cars with expired stickers. 15. Baldness and heart attacks: If baldness is not a risk factor an observed level of heart attacks this much higher (or more) would occur in only 6% of such samples. 16. Housing costs a. According to the Central Limit Theorem, the sample size is large enough so that the sampling distribution of the mean is approximately Normal. b. t with 49 degrees of freedom c. No, this county’s mean home price could be the same as the national average because $235,700 is in the confidence interval.

.



Chapter 17 Comparing Groups Solutions to Class Examples 1. See Class Example 1. 2. Solutions to Class Example 2:

 They purchased both name brand (Perdue) and store brand chicken in 25 U.S. cities.

Laboratory tests found campylobacter contamination in 33% of the 75 Perdue packages, and in 45% of the 75 store brand packages. Does this indicate that shoppers would be safer buying the name brand product?

Hypotheses. The null hypothesis is that the proportion of contaminated packages of Perdue chicken is the same as the proportion of contaminated packages of store brand chicken. The alternative hypothesis is that the proportion of contaminated packages of store brand chicken is greater than the proportion of contaminated packages of Perdue chicken. In symbols: H 0 : pS  pP or pS  pP  0,

H A : pS  pP or pS  pP  0

Model. Independence Assumption: The packages may NOT be mutually independent. Packages that are handled improperly might be more likely to be contaminated, and those packages may be delivered to the same stores. Random Sampling Condition: There is no mention of randomness. Hopefully, Consumer Reports used randomness in selection of packages. 10% Condition: 75 Perdue and 75 store brand are each less than 10% of all packages. Independent Groups Assumption: The groups might NOT be independent. We have to assume that Consumer Reports didn’t purchase packages of each brand from each store. Certain stores might handle the chicken improperly, leading to contamination of both brands. Success/Failure Condition: nS pˆ S  75(0.45)  34 , nS qˆS  75(0.55)  41 , nP pˆ P  75(0.33)  25 , nP qˆ P  75(0.67)  50 , which are all at least 10, so the samples are large enough. The conditions may be satisfied, so I’ll use the Normal model for the difference in two proportions. I will perform a two proportion z-test. Mechanics.

nS  75 , yS  34 , pˆ S  34 75  0.453 25 nP  75 , yP  25 , pˆ P  75  0.333

pˆ S  pˆ P  0.453  0.333  0.120

.


17-2

Part V Inference for Relationships

z

 pˆ S  pˆ P    pS  pP  pˆ S qˆ S pˆ P qˆ P  nS nP

(0.453  0.333)  0 (0.453)(0.547) (0.333)(0.667)  75 75 z  1.516 z

P  value = P (( pˆ S  pˆ P )  0.120)  P( z  1.50)  0.065 Conclusion. The P-value is beyond 5%, so we cannot reject the null hypothesis. There is no more than weak evidence to suggest that the proportion of contaminated packages is greater for store brand than for Perdue chicken. We’d like to know more about the study design before making definitive statements, though.

 They also found campylobacter in 56% of 75 Tyson brand packages. Create a 90%

confidence interval for the difference in contamination levels between Tyson and Perdue chicken.

The conditions have been met for the hypothesis test, so we can construct a two proportion z-interval for the difference in two proportions, with 90% confidence. ( pˆ T  pˆ P )  z *

pˆ T qˆT pˆ P qˆ P  75 75

( pˆ T  pˆ P )  1.645

(0.56)(0.44) (0.333)(0.6667)  75 75

(0.097, 0.357)

I am 90% confident that the rate of Tyson package contamination is between 9.7 and 35.7 percentage points higher than the rate of Perdue package contamination. 3. See Class Example 3. 4. Solution to Class Example 4: We want to know if the two pizza chains have significantly different mean saturated fat contents. Hypotheses. The null hypothesis is that there is no difference is mean saturated fat content. The alternative hypothesis is that there is a difference in mean saturated fat content. H 0 :  D   PJ  0 H A :  D   PJ  0 .


Chapter 17 Comparing Groups

17-3

Model. Independent Groups Assumption – The two samples of saturated fat contents were chosen independently of one another. Randomization Condition: There is no mention of randomness, so we will assume that these pizzas are representative of all pizzas by these two chains.

10 8 6 4 2 5.0

Nearly Normal Condition: Both distributions of saturated fat content are roughly unimodal and symmetric. Since the conditions have been met, we can do a two sample ttest for the difference of means, with 32.757 degrees of freedom (from the approximation formula). Mechanics.

nD  20

yD  11.25

sD  3.193

nPJ  15

yPJ  6.53

sPJ  2.588

yD  yPJ  4.72

t

10.0

15.0

Brand D

6 4 2

2

df  32.757 (from technology)

6

10

Brand PJ

 yD  yPJ    D  PJ  P  2  P( yD  yPJ  4.72)

2 sD2 sPJ  nD nPJ (11.25  6.53)  0 t 3.1932 2.5882  20 15 t  4.823

 2  P(t32.8  4.823)  0.00003

Conclusion. Since the P-value is very low, we reject the null hypothesis. There is strong evidence to suggest that the two pizza brands have different mean saturated fat content. Brand D appears to have more saturated fat on average than Brand PJ. Follow-up. The conditions have been met, so we can create a two-sample t-interval for difference in means, with 95% confidence. yD  yPJ  t

* 32.8

 SE ( yD  yPJ )  4.72  t

* 32.8

 3.1932 2.5882      (2.73, 6.71)  20 15  

I am 95% confident that the average saturated fat content for Brand D is between 2.73 and 6.71 grams higher than the average saturated fat content for Brand PJ. 5. See Class Example 5.

.


17-4

Part V Inference for Relationships

Statistics Quiz A – Chapter 17

Name

The adoption of smart phones and social media apps has been much discussed. One aspect of these devices is that students tend to be distracted by their phones during class. A recent survey reported that 86% of 1004 randomly sampled females used their smart phones during class time, compared to 81% of 1009 randomly sampled males. Do these results confirm a higher smart phone usage by females? 1. Test an appropriate hypothesis and state your conclusions.

2. Find a 99% confidence interval for the difference in the proportion of males and females who use their phone during class. Interpret your interval.

.


Chapter 17 Comparing Groups

17-5

Statistics Quiz A – Chapter 17 – Key 1. H 0 : pM  pF  0 , H A : pM  pF  0 , where M = males and F = Females. * Randomization Condition: The males and females were randomly sampled. * 10% Condition: The number of males and females is greater than 10,090 (10 × 1009) and 10,040 (10 × 1004), respectively. * Independent samples condition: The two groups are independent of each other (as long as the random sample chose people who were not in class together). * Success/Failure Condition: Of the males, approximately 817 used their phones and 192 did not; of the females, approximately 863 used their phones and 141 did not. The observed number of both successes and failures in both groups is larger than 10. Because the conditions are satisfied, we can model the sampling distribution of the difference in proportions with a Normal model. We can perform a two-proportion z-test. We know: nM  1009 , pˆ M  0.81 , nF  1004 , pˆ F  0.86 . We estimate SD( pˆ M  pˆ F ) as

SE  pˆ M  pˆ F  

 0.81 0.19    0.86  0.14   0.0165

pˆ M qˆ M pˆ F qˆ F   nM nF

1009

1004

The observed difference in sample proportions = pˆ B  pˆ C = 0.81 – 0.86 = –0.05. z

 pˆ M  pˆ F   0  0.05  0  3.03 , so the P-value  P( z  3.03)  0.0012

SE ( pˆ M  pˆ F ) 0.0165 The P-value of 0.0012 is low, so we reject the null hypothesis. There is strong evidence that the percentage of males who used their phones in class is less than the percentage of females who did so. 2. With the conditions satisfied (from Problem 1), the sampling distribution of the difference in proportions is approximately Normal with a mean of pM  pF , the true difference between the population proportions. We can find a two-proportion z-interval. We know: nM  1009 , pˆ M  0.81 , nF  1004 , pˆ F  0.86 . We estimate SD( pˆ M  pˆ F ) as

SE ( pˆ M  pˆ F ) 

 0.81 0.19    0.86  0.14   0.0165

pˆ M qˆ M pˆ F qˆ F   nM nF

1009

1004

ME  z *  SE ( pˆ M  pˆ F )  2.576(0.0165)  0.0425 The observed difference in sample proportions = pˆ M  pˆ F = 0.81 – 0.86 = –0.05, so the 99% confidence interval is 0.05  0.0425 , or –9.3% to –0.8%. We are 99% confident that the proportion of males who used their phones in class is between 0.8-percentage points and 9.3-percentage points lower than the proportion of females who did so.

.


17-6

Part V Inference for Relationships

Statistics Quiz B – Chapter 17

Name

In 2010, the United Nations claimed that there was a higher rate of illiteracy in men than in women from the country of Qatar. A humanitarian organization went to Qatar to conduct a random sample. The results revealed that 45 out of 234 men and 42 out of 251 women were classified as illiterate on the same measurement test. Do these results indicate that the United Nations findings were correct? 1. Test an appropriate hypothesis and state your conclusion.

2. Find a 95% confidence level for the difference in the proportions of illiteracy in men and women from Qatar. Interpret your interval.

.


Chapter 17 Comparing Groups

17-7

Statistics Quiz B – Chapter 17 – Key

1. H 0 : pm  pw  0 , H A : pm  pw  0 , where m = men and w = women. * Randomization Condition: The men and women in the sample were randomly selected by the humanitarian organization. * 10% Condition: The number of men and women in Qatar is greater than 2340 (10 × 234) and 2510 (10 × 251), respectively. * Independent samples condition: The two groups are independent of each other because the samples were selected at random. * Success/Failure Condition: In men, 45 were illiterate and 189 were not. In women, 42 were illiterate and 209 were not. The observed number of both successes and failures in both groups is larger than 10 both groups. Because the conditions are satisfied, we can model the sampling distribution of the difference in proportions with a Normal model. We can perform a two-proportion z-test. We know: nm  234 , , nw  251 , pˆ w  0.167 . pˆ m qˆm pˆ wqˆw   nm nw

SE ( pˆ m  pˆ w ) 

 0.192  0.808   0.167  0.833  0.0349 234

251

The observed difference in sample proportions = pˆ m  pˆ w = 0.192 – 0.167 = 0.025.

 pˆ m  pˆ w   0

0.025  0 P  P( z  0.716)  0.24  0.716 0.0349 SE pooled ( pˆ m  pˆ w ) The P-value of 0.24 is high, so we fail to reject the null hypothesis. There is insufficient evidence to conclude that illiteracy rate in men is higher than for women in Qatar. z

2. With the conditions satisfied (from Problem 1), the sampling distribution of the difference in proportions is approximately Normal with a mean of pm  pw , the true difference between the population proportions. We can find a two-proportion z-interval. We know: nm  234 , pˆ m  0.192 , nw  251 , and pˆ w  0.167 . We estimate SD( pˆ m  pˆ w ) as SE ( pˆ m  pˆ w ) 

pˆ m qˆm pˆ wqˆw   nm nw

 0.192  0.808   0.167  0.833  0.0349 234

251

ME  z  SE ( pˆ m  pˆ w )  1.96(0.0349)  0.0684 The observed difference in sample proportions = pˆ m  pˆ w = 0.192 – 0.167 = 0.025, so the 95% confidence interval is 0.025  0.0684 , or –4% to 9%. *

We are 95% confident that the proportion of illiterate men in Qatar is between 4-percentage points lower and 9-percentage points higher than the proportion of illiterate women.

.


17-8

Part V Inference for Relationships

Statistics Quiz C – Chapter 17

Name

A total of 23 Gossett High School students were admitted to State University. Of those students, 7 were offered athletic scholarships. The school’s guidance counselor looked at their composite ACT scores (shown in the table), wondering if State U. might admit people with lower scores if they also were athletes. Assuming that this group of students is representative of students throughout the state, what do you think? 1. Test an appropriate hypothesis and state your conclusion.

2. Create and interpret a 90% confidence interval.

.

Composite ACT Score Non-athletes Athletes 25 21 22 22 27 21 19 29 24 25 26 27 24 30 19 25 27 23 24 26 17 23 23


Chapter 17 Comparing Groups

17-9

Statistics Quiz C – Chapter 17 – Key

1. Let: mean ACT score of non-athletes = 1 and mean ACT score of athletes = 2 H 0 : 1  2  0 H A : 1  2  0 5 * Independent group assumption: Non-athletes and athletes 4 are definitely independent groups. 3 * Randomization condition: These are not a random 2 samples, but they should be representative of athletes and non-athletes. 1 * 10% condition: The samples represent less than 10% of all 15 19 23 27 possible non-athletes and less than 10% of all possible Non-athletes athletes who could be admitted to State U. from Gossett High School. 2 * Nearly Normal condition: The histograms show that both sets are unimodal and roughly symmetric. 1 Because the conditions are satisfied, it is appropriate to model the sampling distribution of the difference in the means with a Student’s t-model, and we will perform a twosample t-test. 15 19 23 Athletes n1  16 y1  24.75 s1  2.84 n2  7 y2  21.86 s2  3.29 SE ( y1  y2 )  SE 2  y1   SE 2  y2  

31

27

s12 s22 2.842 3.292     1.432 n1 n2 16 7

The observed difference is y1  y2 = 24.75 – 21.86 = 2.89 t

 y1  y2   0  2.89  0  2.02

SE ( y1  y2 ) 1.432 The sampling distribution model has 10.13 degrees of freedom. P(t10.13  2.02)  0.0352 . Since the P-value is low, we reject the null hypothesis. There is evidence of a difference between the mean ACT scores for non-athletes and athletes. 2. We wish to create an interval that is likely to contain the true difference between the mean ACT score for non-athletes and athletes. Under these conditions, the sampling model of the difference in the sample means can be modeled by a Student’s t-model with about 10 degrees of freedom (from calculator). We will use a two-sample t-interval. The 90% confidence interval is: (0.30, 5.48).

We are 90% confident that non-athletes average between 0.30 points and 5.48 points higher than athletes on the ACT.

.


17-10

Part V Inference for Relationships

Statistics Quiz D – Chapter 17

Name

1. A survey of randomly chosen adults found that 38 of the 61 women and 46 of the 83 men follow regular exercise programs. Construct a 95% confidence interval for the difference in the proportions of women and men who have regular exercise programs. A. (–0.124, 0.786) B. (–0.093, 0.231) C. (–0.124, 0.815) D. (0.460, 0.786) E. (0.430, 0.815) 2. Suppose the proportion of women who follow a regular exercise program is pw and the proportion of men who follow a regular exercise program is pm . A study found a 98% confidence interval for pw  pm is (–0.024, 0.115). Give an interpretation of this confidence interval. A. We are 98% confident that the proportion of women who follow a regular exercise program is between 2.4% less and 11.5% more than the proportion of men who follow a regular exercise program. B. We are 98% confident that the proportion of men who follow a regular exercise program is between 2.4% less and 11.5% more than the proportion of women who follow a regular exercise program. C. We know that 98% of men exercise between 2.4% less and 11.5% more than women. D. We know that 98% of all random samples done on the population will show that the proportion of women who follow a regular exercise program is between 2.4% less and 11.5% more than the proportion of men who follow a regular exercise program. E. We know that 98% of women exercise between 2.4% less and 11.5% more than men. 3. A two-sample z-test for two population proportions is to be performed using the P-value approach. The null hypothesis is H 0 : p1  p2 and the alternative is H A : p1  p2 . Use the given sample data to find the p-value for the hypothesis test. Give an interpretation of the p-value. n1  200, x1  11, n2  100, and x2  8 A. P-value = 0.402; If there is no difference in the proportions, there is a 40.2% chance of seeing the exact observed difference by natural sampling variation. B. P-value = 0.1011; There is about a 10.11% chance that the two proportions are equal. C. P-value = 0.0201; There is about a 2.01% chance that the two proportions are equal D. P-value = 0.402; If there is no difference in the proportions, there is about a 40.2% chance of seeing the observed difference or larger by natural sampling variation. E. P-value = 0.0201; If there is no difference in the proportions, there is about a 2.01% chance of seeing the observed difference or larger by natural sampling variation.

.


Chapter 17 Comparing Groups

17-11

4. A study was conducted to determine if patients recovering from knee surgery should receive physical therapy two or three times per week. Suppose p1 represents the proportion of patients who showed improvement after one month of therapy three times a week and p2 represents the proportion of patients who showed improvement after one month of therapy twice a week. A 90% confidence interval for p1  p2 is (0.12, 0.38). Give an interpretation of this confidence interval. A. We are 90% confident, based on the data, that patients who receive therapy three times per week show between 12% and 38% more improvement after one month than patients who receive therapy twice a week. B. We know that 90% of all knee-surgery patients will show between 12% and 38% more improvement with therapy three times per week instead of twice a week. C. We are 90% confident, based on the data, that the proportion of patients who show improvement with therapy three times per week is between 12% and 38% higher than the proportion who show improvement with therapy twice a week. D. We know that 90% of all knee-surgery patients will show between 12% and 38% more improvement with therapy twice a week instead of three times per week. E. We are 90% confident, based on the data, that the proportion of patients who show improvement with therapy three times per week is between 12% and 38% lower than the proportion who show improvement with therapy twice a week. 5. Two types of flares are tested for their burning times (in minutes) and sample results are given below. Brand A: n  35, Y  19.4, s  1.4 Brand B: n  40, Y  15.1, s  0.8 Construct a 95% confidence interval for the difference  A   B based on the sample data. A. (3.8, 4.8) B. (3.6, 5.0) C. (3.2, 5.4) D. (3.5, 5.1) E. (–4.7, –3.9)

.


17-12

Part V Inference for Relationships

6. A grocery store is interested in determining whether or not a difference exists between the shelf life of Tasty Choice doughnuts and Sugar Twist doughnuts. A random sample of 100 boxes of each brand was selected and the mean shelf life in days was determined for each brand. A 90% confidence interval for the difference of the means, TC   ST , was determined to be (1.3, 2.5). A. We know that 90% of Tasty Choice doughnuts last between 1.3 and 2.5 days longer than Sugar Twist doughnuts. B. Based on this sample, we are 90% confident that Sugar Twist doughnuts will last on average between 1.3 and 2.5 days longer than Tasty Choice doughnuts. C. We are 90% confident that a randomly selected box of Tasty Choice doughnuts will have a shelf life that is between 1.3 and 2.5 days longer than a randomly selected box of Sugar Twist doughnuts. D. We know that 90% of all random samples done on the population will show that the mean shelf life of Tasty Choice doughnuts is between 1.3 and 2.5 days longer than the mean shelf life of Sugar Twist doughnuts. E. Based on this sample, we are 90% confident that Tasty Choice doughnuts will last on average between 1.3 and 2.5 days longer than Sugar Twist doughnuts. 7. A researcher was interested in comparing the number of hours of television watched each day by two-year-olds and three-year-olds. A random sample of 18 two-year-olds and 18 three-year-olds yielded the follow data. 2y 3y

0.5 2.0

1.5 3.0

1.5 1.5

2.0 1.5

1.5 1.5

0.0 2.0

1.0 1.0

1.0 0.0

1.0 0.0

0.0 1.5

2.0 1.5

1.5 2.0

2.5 2.5

2.0 2.0

0.5 3.0

0.0 1.0

1.5 1.5

2.5 0.5

Find a 98% confidence interval for the difference, 2  3 , between the mean number of hours for two-year-olds (2y) and the mean number of hours for three-year-olds (3y). A. (1.25, 1.56) B. (–1.56, –1.25) C. (–0.98, –0.37) D. (–0.37, 0.98) E. (–0.98, 0.37) 8. A researcher is interested in the academic performance differences between individuals using an optimistic versus a pessimistic approach to their studies. If the researcher fails to find a significant difference, when in fact one exists in the population: A. the null hypothesis was correctly accepted. B. the research hypothesis was correctly accepted. C. the null hypothesis was correctly rejected. D. a Type 2 error has been made. E. a Type 1 error has been made.

.


Chapter 17 Comparing Groups

17-13

9. Results of a small experiment show that people are likely to offer a different amount for used exercise equipment when bargaining with a friend than when bargaining with a stranger. The p-value from testing the difference in mean offers was equal to 0.00162. At an   0.05 , the correct conclusion is to A. reject the null hypothesis. B. fail to reject the null hypothesis. C. conclude that there is no difference between bargaining with a stranger versus a friend. D. Both A and C. E. Both B and C. 10. A consumer group was interested in comparing the operating time of cordless toothbrushes manufactured by two different companies. Group members took a random sample of 18 toothbrushes from Company A and 15 from Company B. Each was charged overnight and the number of hours of use before needing to be recharged was recorded. Company A toothbrushes operated for an average of 119.7 hours with a standard deviation of 1.74 hours; Company B toothbrushes operated for an average of 120.6 hours with a standard deviation of 1.72 hours. Do these samples indicate that Company B toothbrushes operate more hours on average than Company A toothbrushes? The correct hypotheses to address this question are A. H 0 : µA  µB  0 and H A : µA  µB  0. B. H 0 : µA  µB  0 and H A : µA  µB  0. C. H 0 : µA  µB  0 and H A : µA  µB  0. D. H 0 : µA  µB  0 and H A : µA  µB  0. E. None of the above.

.


17-14

Part V Inference for Relationships

Statistics Quiz D – Chapter 17 – Key

1. B 2. A 3. D 4. C 5. A 6. E 7. E 8. D 9. A 10. B

.


Chapter 17 Comparing Groups

Statistics Quiz E – Chapter 17

17-15

Name

1. A survey of randomly selected college students found that 45 of the 99 freshmen and 57 of the 100 sophomores surveyed had purchased used textbooks in the past year. Construct a 98% confidence interval for the difference in the proportions of college freshmen and sophomores who purchased used textbooks. A. (0.317, 0.593) B. (0.291, 0.593) C. (–0.279, 0.049) D. (0.291, 0.619) E. (–0.253, 0.593) 2. Suppose the proportion of sophomores at a particular college who purchased used textbooks in the past year is ps and the proportion of freshmen at the college who purchased used textbooks in the past year is p f . A study found a 90% confidence interval for ps  p f is (0.234, 0.421). Give an interpretation of this confidence interval. A. We are 90% confident that at this college the proportion of sophomores who bought used textbooks is between 23.4% and 42.1% higher than the proportion of freshmen who bought used textbooks. B. We are 90% confident that at this college the proportion of freshmen who bought used textbooks is between 23.4% and 42.1% higher than the proportion of sophomores who bought used textbooks. C. We are 90% confident that sophomores at this college bought between 23.4% and 42.1% more used textbooks than freshmen. D. We know that 90% of all random samples done on the population will show that the proportion of sophomores who bought used textbooks is between 23.4% and 42.1% higher than the proportion of freshmen who bought used textbooks. E. We know that 90% of sophomores bought between 23.4% and 42.1% more used textbooks than freshmen at this college. 3. A two-sample z-test for two population proportions is to be performed using the P-value approach. The null hypothesis is H 0 : p1  p2 and the alternative is H A : p1  p2 . Use the given sample data to find the p-value for the hypothesis test. Give an interpretation of the p-value. n1  50, x1  20, n2  75, and x2  15 A. P-value = 0.0032; If there is no difference in the proportions, there is about a 0.32% chance of seeing the observed difference or larger by natural sampling variation. B. P-value = 0.0147; If there is no difference in the proportions, there is about a 1.47% chance of seeing the observed difference or larger by natural sampling variation. C. P-value = 0.0292; If there is no difference in the proportions, there is a 2.92% chance of seeing the exact observed difference by natural sampling variation. D. P-value = 0.0032; There is about a 0.32% chance that the two proportions are equal. E. P-value = 0.0147; There is about a 1.47% chance that the two proportions are equal.

.


17-16

Part V Inference for Relationships

4. Suppose the proportion of women who favor stricter gun control legislation is pw and the proportion of men who favor stricter gun control legislation is pm. The survey found a 90% confidence interval for pw  pm is (–0.08, 0.24). Give an interpretation of this confidence interval. A. We know that 90% of all random samples done on the population will show that the proportion of men who favor stricter gun control legislation is between 8% lower and 24% higher than the proportion of women who favor stricter gun control legislation. B. We know that 90% of all random samples done on the population will show that the proportion of women who favor stricter gun control legislation is between 8% lower and 24% higher than the proportion of men who favor stricter gun control legislation. C. We are 90% confident that the proportion of men who favor stricter gun control legislation is between 8% lower and 24% higher than the proportion of women who favor stricter gun control legislation. D. We are 90% confident that women favor gun control legislation that is between 8% less strict and 24% more strict than legislation favored by men. E. We are 90% confident that the proportion of women who favor stricter gun control legislation is between 8% lower and 24% higher than the proportion of men who favor stricter gun control legislation. 5. A researcher wishes to determine whether people with high blood pressure can reduce their blood pressure by following a particular diet. Use the sample data below to construct a 99% confidence interval for 1  2 where 1 and 2 represent the mean for the treatment group and the control group respectively. Treatment Group: n1  85, Y 1  189.1, s1  38.7 Control Group: n2  75, Y 2  203.7, s2  39.2 A. (–29.0, –0.2) B. (–1.5, 30.7) C. (–30.7, 1.5) D. (–26.8, –2.4) E. (–1.3, 30.5)

.


Chapter 17 Comparing Groups

17-17

6. A researcher was interested in comparing the salaries of female and male employees of a particular company. Independent random samples of female employees (sample 1) and male employees (sample 2) were taken to calculate the mean salary, in dollars per week, for each group. A 95% confidence interval for the difference, 1  2 , between the mean weekly salary of all female employees and the mean weekly salary of all male employees was determined to be (–$180, $60). A. We know that 95% of female employees at this company make between $180 less and $60 more than the male employees. B. We know that 95% of all random samples done on the employees at this company will show that the average female salary is between $180 less and $60 more per week than the average male salary. C. We are 95% confident that a randomly selected female employee at this company makes between $180 less and $60 more per week than a randomly selected male employee. D. Based on these data, with 95% confidence, female employees at this company average between $180 less and $60 more per week than the male employees. E. Based on these data, with 95% confidence, male employees at this company average between $180 less and $60 more per week than the female employees. 7. A consumer advocate decided to investigate the average wait time for a table for one at two local restaurants. Eighteen customers were sent to each restaurant at the same randomly selected times and the time they waited for a table was recorded in minutes. The following sample data was obtained. A B

10 12

15 17

7 5

9 12

20 8

5 2

15 10

5 25

7 6

0 35

10 10

12 14

19 9

6 22

0 20

5 18

22 5

18 13

Find a 90% confidence interval for the difference,  A   B , between the mean wait time for restaurant A and the mean wait time for restaurant B. A. (10.28, 13.5) B. (-7.45, -1) C. (-1, 7.45) D. (-7.45, 1) E. (-13.5, -10.28) 8. A researcher is interested in the academic performance differences between individuals using an optimistic versus a pessimistic approach to their studies. If the researcher claims a significant difference between groups, when in fact none exists: A. the research hypothesis was correctly rejected. B. the research hypothesis was correctly accepted. C. a Type 1 error is made. D. the null hypothesis was correctly accepted. E. a Type 2 error is made.

.


17-18

Part V Inference for Relationships

9. Data were collected on annual personal time (in hours) taken by a random sample of 16 women (group1) and 7 men (group 2) employed by a medium sized company. The women took an average of 24.75 hours of personal time per year with a standard deviation of 2.84 hours. The men took an average of 21.89 hours of personal time per year with a standard deviation of 3.29 hours. The Human Resources Department believes that women tend to take more personal time than men because they tend to be the primary child care givers in the family. The correct null and alternative hypotheses to test this belief are A. H 0 : µ1  µ2  0 and H A : µ1  µ2  0. B. H 0 : µ1  µ2  0 and H A : µ1  µ2  0. C. H 0 : µ1  µ2  0 and H A : µ1  µ2  0. D. H 0 : µ1  µ2  0 and H A : µ1  µ2  0. E. None of the above. 10. Data were collected on annual personal time (in hours) taken by a random sample of 16 women (group1) and 7 men (group 2) employed by a medium sized company. The women took an average of 24.75 hours of personal time per year with a standard deviation of 2.84 hours. The men took an average of 21.89 hours of personal time per year with a standard deviation of 3.29 hours. The Human Resources Department believes that women tend to take more personal time than men because they tend to be the primary child care givers in the family. The results of the test are t  2.02 with an associated P-value of 0.0352. The correct conclusion at   0.05 is to A. reject the null hypothesis. B. fail to reject the null hypothesis. C. conclude that women take a higher average number of hours of personal time per year compared to men. D. Both B and C. E. Both A and C.

.


Chapter 17 Comparing Groups

Statistics Quiz E – Chapter 17 – Key

1. C 2. A 3. B 4. E 5. C 6. D 7. D 8. C 9. B 10. E

.

17-19



Chapter 18 Paired Samples and Blocks Solutions to Class Examples 1. See Class Example 1. 2. See Class Example 2. 3. See Class Example 3. Investigative Task This task is a continuation of the task in Chapter 14. Again, students draw a sample from a large dataset, perhaps the data we provided. They do two hypothesis tests, one based on independent samples and the other on matched pairs. Supplemental Resources After the Chapter Quizzes, you will find a set of mixed inference questions. We have found it beneficial to have students work through this in small groups.

.


18-2

Part V Inference for Relationships

Statistics Quiz A – Chapter 18

Name

Before you took this course, you probably heard many stories about Statistics courses. Oftentimes parents of students have had bad experiences with Statistics courses and pass on their anxieties to their children. To test whether actually taking Introductory Statistics decreases students’ anxieties about the field of Statistics, a Statistics instructor gave a test to rate student anxiety at the beginning and end of his course. Anxiety levels were measured on a scale of 0-10. Here are the data for 16 randomly chosen students from a class of 180 students: Pre-course anxiety level 7 6 9 5 6 7 5 7 6 4 3 2 1 3 4 2 Post-course anxiety level 4 3 7 3 4 5 4 6 5 3 2 2 1 3 4 3 Difference (Post – Pre) -3 -3 -2 -2 -2 -2 -1 -1 -1 -1 -1 0 0 0 0 1 1. Do the data indicate that anxiety levels about the field of Statistics decreases after students take Introductory Statistics? Test an appropriate hypothesis and state your conclusion.

2. Create and interpret a 90% confidence interval.

.


Chapter 18 Paired Samples and Blocks

18-3

Statistics Quiz A – Chapter 18 – Key 1. Let d = Post-course anxiety level – Pre-course anxiety level. H 0 : d  0 The mean difference in the anxiety levels is zero. 3 H A : d  0 The mean difference in the anxiety levels is less 2 than zero. 1 * Paired data: The data are paired because they are measurements on the same individuals both before and after the -3 -1 1 intro stats course. Difference (Post-Pre) * Independence: The anxiety level of any student is independent of the anxiety level of any other student, so the differences are independent. * Randomization: We are told this is a random sample from the class. * 10% condition: Our sample of 16 students is less than 10% of the students who take Introductory Statistics. * Nearly Normal condition: The histogram of the differences is unimodal and roughly symmetric. Under these conditions the sampling distribution of the differences can be modeled by a Student’s t-model with (n – 1) = 15 degrees of freedom, and we will use a paired t-test. We find from the data: n  16 , d  1.125 , and sd  1.1475 . s 1.15 We estimate the standard deviation of d using: SE (d )  d   0.287 . n 16 d  0 1.125  0 t15    3.92 , so our P-value is P(t15  3.82)  0.0007 0.287 SE (d ) With a P-value this small, we can reject the null hypothesis. We have strong evidence that taking Introductory Statistics class reduces the anxiety level of students. 5 4

2. Under these conditions the sampling distribution of the differences can be modeled by a Student’s t-model with (n – 1) = 15 degrees of freedom, and we will use a paired t-interval. The 90% critical value for t15 is 1.753 (from table). The margin of error: ME  t15*  SE (d )  1.753(0.287)  0.503 So the 90% confidence interval is 1.125  0.503 , or an interval of (-1.63, -0.62). We are 90% confident that taking An Introductory Statistics will decrease a student’s anxiety between 0.6 and 1.6 points (on our scale).

.


18-4

Part V Inference for Relationships

Statistics Quiz B – Chapter 18

Name

Most people are definitely dominant on one side of their body – either right or left. For some sports being able to use both sides is an advantage, such as batting in baseball or softball. In order to determine if there is a difference in strength between the dominant and non-dominant sides, a few switch-hitting members of some school baseball and softball teams were asked to hit from both sides of the plate during batting practice. The longest hit (in feet) from each side was recorded for each player. The data are shown in the table at the right. Does this sample indicate that there is a difference in the distance a ball is hit by batters who are switch-hitters? 1. Test an appropriate hypothesis and state your conclusion. Batter

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

2. Create and interpret a 95% confidence interval.

.

Dominant Non-dominant Side Side

142 144 153 148 146 149 138 145 153 160 163 170 169 151 152 167 164 165 163

119 118 126 119 121 125 116 120 124 138 135 144 142 128 131 141 140 140 138


Chapter 18 Paired Samples and Blocks

18-5

Statistics Quiz B – Chapter 18 – Key

1. Let d = difference in the hit lengths between the dominant side and the non-dominant side of the plate. H 0 : d  0 The mean difference in the hit lengths (in feet) is zero. H A : d  0 The mean difference in the hit lengths (in feet) is greater then zero. * Paired data: The data are paired because they are measurements on the same individual from the dominant side and the non-dominant side of the plate. * Independence: The hit length by any individual is independent of the hit length of other individuals, so the differences are independent. * Randomization: We assume these switch-hitters, though not randomly selected, are representative of all switchhitters. * 10% condition: Our inference is about hit lengths from dominant and non-dominant sides, not about individuals, so we do not need to check this condition. * Nearly Normal condition: The histogram of the differences is unimodal and symmetric. Under these conditions the sampling distribution of the differences can be modeled by a Student’s t-model with (n – 1) = 18 degrees of freedom. We will use a paired t-test. We find from the data: n  19 pairs, d  25.1 feet, and sd  2.31 feet. s 2.31 We estimate the standard deviation of d using: SE (d )  d   0.53 . n 19 d  0 25.1  0 P  P(t18  47.4)  0.0001 t18    47.4 0.53 SE (d ) With a P-value this small, we can reject the null hypothesis. We have strong evidence that hits tend to be longer from the dominant side of the plate. 2. Create and interpret a 95% confidence interval. Under these conditions the sampling distribution of the differences can be modeled by a Student’s t-model with (n – 1) = 18 degrees of freedom. We will use a paired t-interval. We find from the data: n  19 pairs, d  25.1 feet, and sd  2.31 feet. s 2.31 We estimate the standard deviation of d using: SE (d )  d   0.53 . n 19 The 95% critical value for t18 is 2.101 (from table). The margin of error: ME  t18*  SE (d )  2.101(0.53)  1.11 So the 95% confidence interval is 25.1  1.11 , or an interval of (23.99, 26.22) feet.

We are 95% confident that hits from the dominant side will average between 24.0 and 26.2 feet longer than hits from the non-dominant side of the plate. .


18-6

Part V Inference for Relationships

Statistics Quiz C – Chapter 18

Name

A cookie company has developed a new recipe that is cost saving in production. However, they are worried that they may have sacrificed flavor in the process of lowering costs. Participants are asked to rate each cookie on a scale of 1 to 10. The order is chosen randomly. Thus, the company wishes to test if the new cookie rates lower than the old recipe. The ratings were recorded for each participant. The data are shown in the table below. Does this sample indicate that there is a decrease in the ratings? Test an appropriate hypothesis and state your conclusion.

.

Rater

New

Old

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

4 7 3 8 5 9 7 5 6 9 8 3 6 8 9 7

6 5 6 9 2 9 10 4 8 7 8 7 5 8 10 6


Chapter 18 Paired Samples and Blocks

18-7

Statistics Quiz C – Chapter 18 – Key

A cookie company has developed a new recipe that is cost saving in production. However, they are worried that they may have sacrificed flavor in the process of lowering costs. Participants are asked to rate each cookie on a scale of 1 to 10. The order is chosen randomly. Thus, the company wishes to test if the new cookie rates lower than the old recipe. The ratings were recorded for each participant. The data are shown in the table below. Does this sample indicate that there is a decrease in the ratings? Test an appropriate hypothesis and state your conclusion. H 0 : d  0 H A : d  0 Where d represents the population mean difference in the ratings. A random sample was selected, and the order of the cookies was chosen randomly. A histogram of the differences of the ratings is approximately normal. The data are paired by participant. A matched pair t-test for the mean difference is appropriate. 3 2 1 -4

-2

0 diff

2

4

0.375  0  0.739 2.029 16 df  15 P-value = 0.236 t

The P-value of 0.236 is large. We fail to reject the null hypothesis. There is no evidence that the new recipe has decreased the ratings.

.

Rater

Red

Blue

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

4 7 3 8 5 9 7 5 6 9 8 3 6 8 9 7

6 5 6 9 2 9 10 4 8 7 8 7 5 8 10 6


18-8

Part V Inference for Relationships

Statistics – Investigative Task

Chapter 18

SAT Performance (Part II) In this Task you will continue your investigation of the four questions we posed about one high school’s SAT scores in Chapter 14. 1. What is the mean SAT-Math score at this high school? 2. How do these students’ SAT Math scores stack up against the rest of the state? 3. Is there a significant difference between Verbal and Math scores for these students? 4. Nationally males tend to have higher Math scores than females. Is that true at this high school, too? You should still have your copy of data from the SAT Score Roster sent to the high school by the College Board. It shows gender, Verbal scores, and Math scores, for 303 students in a recent graduating class. Again you will not use all of these data, just a sample of 20 – 30 students. Your new assignment: investigate the last two questions posed above.  Use a sample of these students. It may be the same sample you used previously, or you may draw a new one, carefully explaining your procedure.  Determine whether your sample provides statistical evidence that Verbal and Math scores differ for students at this high school.  Determine whether your sample provides statistical evidence that males have higher Math scores than females at this school.  Prepare a written report that includes complete demonstrations of the statistical procedures you used and your conclusions (in context, of course).

.


Chapter 18 Paired Samples and Blocks

Statistics Task – Investigative Task

18-9

Chapter 21

Each hypothesis test has 4 components scored as Essentially correct, Partially correct, or Incorrect. V vs M

B vs G

Component The Hypotheses E – Writes the correct hypotheses using proper notation. P – Uses incorrect notation or tails. I – Hypotheses are missing or incorrect. The Test E – Cites random sample <10% of population, discusses independence, and checks normality with proper plots (two plots for boys vs girls, plot of differences for verbal vs math). P – Discussion of either independence or normality is incomplete. I – Misunderstands or omits the conditions. The Mechanics E – Identifies the correct procedure, shows the sample statistics and df, uses correct notation, shows the formula, and finds t and the P-value (perhaps with minor arithmetic errors). P – Appears to be doing the proper procedure but omits important information, uses the wrong notation, or makes major errors in calculations. I – Uses the wrong procedure or shows no work or makes several major mistakes. The Conclusion E – Makes correct decision w/linkage to the P-value and states correct conclusion in context. P – Writes a conclusion that is correct but incomplete. I – Conclusion is incorrect or missing.

Comments:

Scoring  E’s count 1 point, P’s are 1/2  Grade: based on the sum of the components - 8 = A, 6 = B, etc.

(Odd scores round up or down depending on overall thoroughness and quality of responses.)

Name ______________________________ ____ Grade ____ .


18-10

Part V Inference for Relationships

NOTE: We present a model solution with some trepidation. This is not a scoring key, just an example. Many other approaches could fully satisfy the requirements outlined in the scoring rubric. That (not this) is the standard by which student responses should be evaluated. Model Solution - Investigative Task – SAT Performance (Part II)

Plan. We want to know whether there is a difference between Math and Verbal scores at this college. To do this, we will examine the mean of the paired differences between Math and Verbal scores. This approach is strong, since it accounts for variability in scores from student to student. Each student’s Math score is compared to that student’s Verbal score. Hypotheses. The null hypothesis is that there is no significant mean difference between the Math scores and the Verbal scores of students at this high school. The alternative hypothesis is that there is a significant mean difference between the Math scores and Verbal scores at this high school. H 0 :  M V  0 H A :  M V  0 Model. Paired Data Assumption: The scores are paired by student. Independence Assumption: The scores of any student are independent of the scores of other students, so the differences are mutually independent. Randomization Condition: Students were randomly selected. 10% Condition: 25 students represent less than 10% of the 300 students at this college. Nearly Normal Condition: The distribution of the differences is unimodal and symmetric.

8 6 4 2 -90

0

90

Mth-Vrl

The conditions are met, so I’ll use a Student’s t-model with (n – 1) = 24 degrees of freedom, and perform a paired t-test. Mechanics.

d 0 sd n 4.8  0 t 50.09 24 t  0.479 t

n  25 pairs

d  4.8

sd  50.09

P-value  2  P(d  4.8)  2  P(t24  0.479)  0.636

Conclusion. Since the P-value is high, fail to reject the null hypothesis. There is no evidence to suggest that the mean difference in Math and Verbal scores is significantly different from zero.

.


Chapter 18 Paired Samples and Blocks

18-11

Plan. We want to know if males at this college score higher than females on the SAT-Math, as happens nationally. To answer the question, we will look at the mean Math score for females and the mean math score for males. Hypotheses. The null hypothesis is that the mean SAT-Math score for males at this college is not significantly different from the mean SAT-Math score for females at this college. The alternative hypothesis is that the mean SAT-Math score for males is higher than the mean SAT-Math score for females. H 0 :  M   F or  M   F  0 H A :  M   F or  M   F  0 Model.

6 4 2

Independent Groups Assumption – The two groups of students are independent of one another.

375

Nearly Normal Condition: Both distributions of SAT-Math scores are roughly unimodal and symmetric.

Since the conditions have been met, we can do a two sample t-test for the difference of means, with 22 degrees of freedom (from the approximation formula). nM  11 nF  14

Mechanics.

t

t

 yM  y F     M   F  2 M

2 F

s s  nM nF (586.36  615.71)  0

695

M:Math

Randomization Condition: Students were chosen randomly. 10% Condition: The 11 males and 14 females represent less than 10% of all males and females at this college.

535

5 4 3 2 1 500

600

700

F:Math

yM  586.36 sM  72.01 yF  615.71 sF  76.13 yM  yF  29.35

P-value  P( yM  yF  29.35)  P(t22  0.986)  0.833

72.012 76.132  11 14 t  0.986

Conclusion. Since the P-value is high, fail to reject the null hypothesis. There is no evidence that the mean SAT-Math score is higher for males than for females at this college. (NOTE: This sample actually had a higher mean female score. This is OK. This is a possible consequence of writing hypotheses BEFORE sampling, as we should.)

.


18-12

Part V Inference for Relationships

Statistics – Inference Review

Group Project

The following problems cover all the types of inference we have studied thus far: hypothesis tests and confidence intervals, for means and proportions, involving one or two samples. 1. Survey Responses Few people who receive questionnaires in the mail actually fill them out and return them – often fewer than 10%! One researcher thinks he can improve the response rate by including a coupon good for a free pint of ice cream along with the questionnaire. The researcher believes that people will want the ice cream, and feel guilty if they don’t return the questionnaire. To test this conjecture he mails questionnaires with ice cream coupons to 150 randomly selected people. After two weeks 41 of the surveys have been returned. a) Create a 95% confidence interval for the return rate. b) Encouraged by this response rate, this researcher now plans to replicate the study in hopes of estimating the return rate this strategy might achieve to within 5%. How many new questionnaires must he mail out? 2. Scoring The table shows the average number of points scored in home and away games by 8 randomly selected NFL teams during the 2002 season. Assuming that the offensive performance of these teams is representative of other teams during this season and others, do these data provide evidence of a home field advantage when it comes to scoring?

Ave. points scored Team

Home

Away

Jets Ravens Texans 49ers Giants Bucs Eagles Packers

24.3 19.8 13.5 25.2 18.9 26.5 24.1 24.6

21.2 19.8 13.1 20.6 23.0 20.8 25.5 23.0

3. Pets A National Cancer Institute study published in 1991 examined the incidence of cancer in dogs. Of 827 dogs whose owners used the weed killer 2-4-D on their lawns or gardens, 473 were found to have cancer. Only 19 of the 130 dogs that had not been exposed to this herbicide had cancer. Construct a 95% confidence interval for the difference in pets’ cancer risk. 4. Grapes An agronomist hopes that a new fertilizer she has developed will enable grape growers to increase the yield of each grapevine by more than 5 pounds. To test this fertilizer she applied it to 44 vines and used the traditional growing strategies on 47 other vines. The fertilized vines produced a mean of 58.4 pounds of grapes with standard deviation 3.7 pounds, while the unfertilized vines yielded an average of 52.1 pounds with standard deviation 3.4 pounds of grapes. Do these experimental results confirm the agronomist’s expectations?

.


Chapter 18 Paired Samples and Blocks

18-13

5. Car Insurance An insurance company advertises that 90% of their accident claims are settled within 30 days. A consumer group randomly selects 104 of last year’s claims from the company’s files, and finds that only 89 of them were settled within 30 days. Is the company guilty of false advertising? 6. Cigarettes Some of the cigarettes sold in the US claim to be “low tar.” How much less tar would a smoker get by smoking low tar brands instead of regular cigarettes? Samples of 15 brands of each type were randomly chosen from the 1206 varieties (no kidding) that are marketed. Their tar contents (mg/cig) are listed in the table below. Find a 95 % confidence interval for the difference between regular and low tar brands. Type

Milligrams of tar per cigarette

Regular

18

10

14

15

15

12

17

11

14

17

12

14

15

15

12

Low Tar

9

5

10

4

8

9

9

3

7

12

6

10

8

11

8

7. Brownies Wegman’s (a food market chain) has developed a new store-brand brownie mix. Before they start selling the mix they want to compare how well people like their brownies to brownies made from a popular national brand mix. In order to see if there was any difference in consumer opinion, Wegman’s asked 124 shoppers to participate in a taste test. Each was given a brownie to try. Subjects were not told which kind of brownie they got - that was determined randomly. 58% of the 62 shoppers who tasted a Wegman’s brownie said they liked it well enough to buy the mix, compared to 66% of the others who said they would be willing to buy the national brand. Does this result indicate that consumer interest in the Wegman’s mix is lower than for the national brand? 8. Taxes Waiters are expected to keep track of their income from tips and report it on their income tax forms. The Internal Revenue Service suspects that one waiter (we’ll call him “Jared”) has been under-reporting his income, so they are auditing his tax return. An IRS agent goes through the restaurant’s files and obtains a random sample of 80 credit card receipts from people Jared served. The average tip size shown on these receipts was $9.68 with a standard deviation of $2.72. a) Create a 90% confidence interval for the mean size of all of Jared’s credit card tips. b) On his tax return Jared had claimed that his tips averaged $8.73. Based on their confidence interval, does the IRS have a case against him? Explain.

.


18-14

Part V Inference for Relationships

Inference Review Key

1. z-interval for a proportion; 95% sure of return rate between 20.2% and 34.5%; n = 305 2. matched pairs t-test, upper tail; t = 1.09, P = 0.16; no evidence of higher scores at home 3. 2-proportion z-interval; 95% confident of 36%–49% higher rate of cancer in pets exposed to 2-4-D. 4. 2 independent sample t-test, upper tail; t = 1.74, P = 0.043; strong evidence that the yield will average at least 5 pounds more for vines treated with this fertilizer. 5. 1-proportion z-test, lower tail; z = –1.5, P = 0.066; some evidence that the actual settlement rate is worse than advertised (some students may not reject – that’s okay) 6. 2 independent sample t-interval; 95% confident that regular cigarettes average 4.31 – 7.95 mg more tar. 7. 2-proportion z-test, 2-tailed; z = 0.93, P = 0.36; no evidence of a difference in consumer opinion 8. One sample t-interval; 90% confident that Jared’s tips averaged between $9.18 and $10.19; IRS can accuse him of under-reporting income, because his claim of a mean of $8.73 is not in the interval.

.


Chapter 18 Paired Samples and Blocks

Statistics Quiz D – Chapter 18

18-15

Name

1. An agricultural company wanted to know if a new insecticide would increase corn yields. Eight test plots showed an average increase of 3.125 bushels per acre. The standard deviation of the increases was 2.911 bushels per acre. Determine a 90% confidence interval for the mean increase in yield. A. (2.435, 3.815) B. (1.175, 5.075) C. (2.435, 5.075) D. (1.175, 3.815) E. (–5.075, –3.815) 2. Using the sample paired data below, construct a 90% confidence interval for the population mean of all differences x-y x y

5.0 4.7

4.7 3.5

4.4 4.0

5.1 5.8

6.4 4.1

A. (–0.69, 2.09) B. (0.22, 7.48) C. (–0.37, 1.77) D. (–0.07, 1.47) E. (–0.48, 1.88) 3. A high school coach uses a new technique in training middle distance runners. He records the times for 4 different athletes to run 800 meters before and after this training. A 95% confidence interval for the difference of the means before and after the training,  B   A , was determined to be (2.6, 4.3). A. We are 95% confident that a randomly selected middle distance runner at this high school will have a time for the 800-meter run that is between 2.6 and 4.3 seconds shorter after the training than before the training. B. We know that 95% of the middle distance runners shortened their times between 2.6 and 4.3 seconds after the training. C. Based on this sample, with 95% confidence, the average time for the 800-meter run for middle distance runners at this high school is between 2.6 and 4.3 seconds longer after the new training. D. Based on this sample, with 95% confidence, the average time for the 800-meter run for middle distance runners at this high school is between 2.6 and 4.3 seconds shorter after the new training. E. We know that 95% of all random samples done on runners at this high school will show that the mean time difference before and after the training is between 2.6 and 4.3 seconds.

.


18-16

Part V Inference for Relationships

4. A mid-sized company has decided to implement an enterprise resource planning (ERP) system, and management suspects that many of its employees are concerned about the planned implementation. Managers are considering holding informational workshops to help decrease anxiety levels among employees. To determine whether such an approach would be effective, they randomly select 16 employees to participate in a pilot workshop. These employees were given a questionnaire to measure anxiety levels about ERP before and after participating in the workshop. To determine if anxiety levels about ERP decreases as a result of the workshop, they should use a A. one-tailed test of two proportions. B. two-tailed test of two independent means. C. one-tailed test of two independent means. D. two-tailed paired t-test. E. one-tailed paired t-test. 5. A mid-sized company has decided to implement an enterprise resource planning (ERP) system, and management suspects that many of its employees are concerned about the planned implementation. Managers are considering holding informational workshops to help decrease anxiety levels among employees. They randomly select 16 employees to participate in a pilot workshop. These employees were given a questionnaire to measure anxiety levels about ERP before and after participating in the workshop. If we let d = post-workshop anxiety level – pre-workshop anxiety level, the correct alternative hypothesis to determine if this approach was successful is A. µd  0 B. µd  0 C. µd  0 D. µd  0 E. µd  0 6. Managers are considering holding informational workshops to help decrease anxiety levels among employees. They randomly select 20 employees to participate in a pilot workshop. These employees were given a questionnaire to measure anxiety levels before and after participating in the workshop. A test was performed to determine if the workshop was successful in decreasing anxiety levels. The test results yielded a P-value of 0.008. The correct conclusion at   .05 is A. to accept the null hypothesis. B. to reject the null hypothesis. C. to conclude that participating in the workshop decreases employee anxiety about ERP. D. both A and C. E. both B and C.

.


Chapter 18 Paired Samples and Blocks

18-17

7. A mid-sized company has decided to implement an enterprise resource planning (ERP) system, and management suspects that many of its employees are concerned about the planned implementation. Managers are considering holding informational workshops to help decrease anxiety levels among employees. They randomly select 16 employees to participate in a pilot workshop. These employees were given a questionnaire to measure anxiety levels about ERP before and after participating in the workshop. At the 90% confidence level, what is the margin of error for the mean difference in anxiety levels pre- and post-workshop? Pre-workshop anxiety level Post-workshop anxiety level Difference (Post – Pre)

7 4 –3

6 3 –3

9 7 –2

5 3 –2

6 4 –2

7 5 –2

5 4 –1

7 6 –1

6 5 –1

4 3 –1

3 2 –1

2 2 0

1 1 0

3 3 0

4 4 0

2 3 1

A. 1.753 B. 1.147 C. 0.287 D. 0.503 E. 2.010 8. Before being released to market, a drug company tests a new allergy medication for potential side effects. A random sample of 160 individuals with allergies was selected for the study. The new allergy medication was randomly assigned to 80 of them, and another popular allergy medication already on the market (Brand C) was assigned to the rest. Out of the 80 given the new allergy medication, 14 reported drowsiness; 22 of the 80 taking Brand C reported drowsiness. The 95% confidence interval for the difference in proportions reporting drowsiness is -0.028 to 0.228. Which of the following is correct? A. We are 95% confident that there is no difference between the proportions of patients reporting drowsiness for the two allergy medications. B. We are 95% confident that there is a difference between the proportions of patients reporting drowsiness for the two allergy medications. C. We are 95% confident that the proportion of patients reporting drowsiness is higher for the new allergy medication. D. We are 95% confident that the proportion of patients reporting drowsiness is lower for the new allergy medication. E. There will be a significant difference between the proportions 95% of the time.

.


18-18

Part V Inference for Relationships

Statistics Quiz D – Chapter 18 – Key

1. B 2. C 3. D 4. E 5. B 6. E 7. D 8. A

.


Chapter 18 Paired Samples and Blocks

Statistics Quiz E – Chapter 18

18-19

Name

1. Ten different families are tested for the number of gallons of water a day they use before and after viewing a conservation video. Construct a 90% confidence interval for the mean of the differences. Before 33 33 38 33 35 35 40 40 40 31 After 34 28 25 28 35 33 31 28 35 33 A. (1.8, 7.8) B. (2.5, 7.1) C. (3.8, 5.8) D. (1.5, 8.1) E. (2.1, 7.5) 2. Ten different families were tested for the average number of gallons of water they used per day before and after viewing a conservation video. A 90% confidence interval for the difference of the means after and before the training,  A   B was determined to be (–10.8, –4.2). A. We know that 90% of families will use between 4.2 and 10.8 fewer gallons of water each day after viewing the conservation video. B. Based on this sample, we are 90% confident that the average decrease in daily water consumption after viewing the conservation video is between 4.2 and 10.8 gallons. C. We are 90% confident that a randomly selected family who has viewed the video will use between 4.2 and 10.8 more gallons of water per day compared to a randomly selected family who has not viewed the video. D. Based on this sample, we are 90% confident that the average increase in daily water consumption after viewing the conservation video is between 4.2 and 10.8 gallons. E. We are 90% confident that a randomly selected family who has viewed the video will use between 4.2 and 10.8 fewer gallons of water per day compared to a randomly selected family who has not viewed the video. 3. An army depot that overhauls ground mobile radar systems is interested in improving its processes. One problem involves troubleshooting a particular component that has a high failure rate after it has been repaired and reinstalled in the system. The shop floor supervisor believes that having standard work procedures in place will reduce the time required for troubleshooting this component. Time (in minutes) required troubleshooting this component without and with the standard work procedure is recorded for a sample of 19 employees. In order to determine if having a standard work procedure in place reduces troubleshooting time, they should use A. a test of two proportions. B. a two-tailed test of two independent means. C. a one-tailed test of two independent means. D. a two-tailed paired t-test. E. a one-tailed paired t-test.

.


18-20

Part V Inference for Relationships

4. An army depot that overhauls ground mobile radar systems is interested in improving its processes. One problem involves troubleshooting a particular component that has a high failure rate after it has been repaired and reinstalled in the system. The shop floor supervisor believes that having standard work procedures in place will reduce the time required for troubleshooting this component. Time (in minutes) required troubleshooting this component without and with the standard work procedure is recorded for a sample of 19 employees. The standard error of the mean difference is Without Standard Work Procedure 142 144 153 148 146 149 138 145 153 160 163 170 169 151 152 167 164 165 163

With Standard Work Procedure 119 118 126 119 121 125 116 120 124 138 135 144 142 128 131 141 140 140 138

Difference Without - With 23 26 27 29 25 24 22 25 29 22 28 26 27 23 21 26 24 25 25

A. 25.105 B. 2.307 C. 0.529 D. 4.3589 E. None of the above.

.


Chapter 18 Paired Samples and Blocks

18-21

5. An army depot that overhauls ground mobile radar systems is interested in improving its processes. One problem involves troubleshooting a particular component that has a high failure rate after it has been repaired and reinstalled in the system. The shop floor supervisor believes that having standard work procedures in place will reduce the time required for troubleshooting this component. Time (in minutes) required troubleshooting this component without and with the standard work procedure is recorded for a sample of 19 employees. Assuming that we define our differences as Time without standard work procedure – Time with standard work procedure, the correct alternative hypothesis is A. µd  0. B. µd  0. C. µd  0. D. µd  0. E. None of the above. 6. An army depot that overhauls ground mobile radar systems is interested in improving its processes. One problem involves troubleshooting a particular component that has a high failure rate after it has been repaired and reinstalled in the system. The shop floor supervisor believes that having standard work procedures in place will reduce the time required for troubleshooting this component. Time (in minutes) required troubleshooting this component without and with the standard work procedure is recorded for a sample of 19 employees. The P-value associated with the calculated test statistic is < 0.001. At   0.05 , the correct conclusion is to A. reject the null hypothesis. B. fail to reject the null hypothesis. C. conclude that having standard work procedures in place reduces troubleshooting time for this component. D. Both A and C. E. Both B and C. 7. Which of the following is not an assumption and/or condition for the paired t-test? A. Nearly Normal Condition B. Independent Groups Assumption C. Paired Data Assumption D. Randomization Condition E. All of the above 8. Previous surveys reported that more men than women trade stocks online. A local brokerage firm randomly selected a sample of investors. They found that 45 out of 234 men traded online and 42 out of 251 women traded online. The 95% confidence interval for the difference in proportions is A. 4% to 9% B. –4% to 9% C. 2% to 12% D. –2% to 12% E. –2% to –4% .


18-22

Part V Inference for Relationships

Statistics Quiz E – Chapter 18 – Key

1.

A

2.

B

3.

E

4.

C

5.

C

6.

D

7.

B

8.

B

.


Chapter 19 Comparing Counts Solutions to Class Examples 1. See Class Example 1. 2. See Class Example 2. 3. See Class Example 3. Investigative Task The Investigative Task uses test results from Ithaca High School as an example; but you can substitute more meaningful data. We ask students to test three hypotheses, one for goodness-offit, one for homogeneity, and one for independence.

.


19-2

Part V Inference for Relationships

Statistics Quiz A – Chapter 19

Name

1. There is a fruity, sugary, delicious cereal that many children love to snack on. There are six different colors of cereal: red, orange, purple, yellow, blue, and green. A curious student wonders if the six colors are manufactured uniformly. He buys a box of cereal and this is his data: Color red orange purple yellow blue green Frequency 42 12 17 23 15 2 a. Test an appropriate hypothesis to decide if the cereal is made uniformly. Give statistical evidence to support your conclusion.

b. Which color(s) affected your decision the most? Explain what this means in the context of the problem.

.


Chapter 19 Comparing Counts

2. As part of a survey, students in a large statistics class were asked whether or not they ate breakfast that morning. The data appears in the following table: Breakfast Yes No Total Male 66 66 132 Sex Female 125 74 199 Total 191 140 331 Is there evidence that eating breakfast is independent of the student’s sex? Test an appropriate hypothesis. Give statistical evidence to support your conclusion.

.

19-3


19-4

Part V Inference for Relationships

Statistics Quiz A – Chapter 19 – Key 1. There is a fruity, sugary, delicious cereal that many children love to snack on. There are six different colors of cereal: red, orange, purple, yellow, blue, and green. A curious student wonders if the six colors are manufactured uniformly. He buys a box of cereal and this is his data: Color red orange purple yellow blue green Frequency 42 12 17 23 15 2 a. Test an appropriate hypothesis to decide if the cereal is made uniformly. Give statistical evidence to support your conclusion. H 0 : The cereal is made uniformly. H A : The cereal is not made uniformly. Conditions: *Counted data: We have the counts of the number of cereal pieces of each color. *Randomization: We have a convenience sample of one box, but no reason to suspect bias. *Expected cell frequency: There are a total of 111 cereal pieces. We expect 111(1/6) = 18.5 of each color. Since 18.5 exceeds 5, so the condition is satisfied.

Under these conditions, the sampling distribution of the test statistic is  2 with 6 – 1 = 5 degrees of freedom, and we will perform a chi-square goodness-of-fit test. Color Observed Frequency Expected Frequency  2 component

2  

 Obs  Exp 

orange 12 18.5 2.28

purple 17 18.5 0.12

yellow 23 18.5 1.09

blue 15 18.5 0.66

green 2 18.5 14.72

2

Exp

 42  18.5   12  18.5   17  18.5   23  18.5  15  18.5   2  18.5 2

red 42 18.5 29.85

18.5  48.73

2

18.5

2

18.5

2

18.5

2

18.5

2

18.5

The P-value is the area in the upper tail of the  2 model with 5 degrees of freedom above the computed  2 value. P-value = P   2  48.73  2.5 108  0.0001

The P-value is very small, so we reject the null hypothesis. There is strong evidence that the cereal is not made uniformly.

b. Which color(s) affected your decision the most? Explain what this means in the context of the problem. The  2 component for red and green are the largest, (29.85 and 14.72): There are a lot more red and a lot less green than should happen if the cereal is made uniformly.

.


Chapter 19 Comparing Counts

19-5

2. As part of a survey, students in a large statistics class were asked whether or not they ate breakfast that morning. The data appears in the following table: Breakfast Yes No Total Male 66 66 132 Sex Female 125 74 199 Total 191 140 331 Does it appear eating breakfast is independent of the student’s sex? Test an appropriate hypothesis. Give statistical evidence to support your conclusion. We want to know whether the categorical variables “eating breakfast” and “student’s sex” are statistically independent. H 0 : Eating breakfast and student’s sex are independent. H A : There is an association between eating breakfast and student’s sex. Conditions: *Counted data: We have the counts of individuals in categories of two categorical variables. *Randomization: We have a convenience sample of students, but no reason to suspect bias. *Expected cell frequency: The expected values (shown in parenthesis in the table) are all greater than 5, so the condition is satisfied. Under these conditions, the sampling distribution of the test statistic is  2 with

 r  1 c  1   2  1 2  1  1 degree of freedom, and we will perform a chi-square test of independence.

Sex

  2

Breakfast Yes No 66 (76.169) 66 (55.831) 125 (114.83) 74 (84.169) 191 140

Male Female Total

Total 132 199 331

 Obs  Exp    66  76.169    66  55.831  125  114.83   74  84.169   5.339 2

Exp

2

76.169

The P-value is P    5.339   0.0209 .

2

55.831

2

114.83

2

84.169

2

The P-value of 0.0209 is pretty small, so we reject the null hypothesis. There is evidence of an association between the student’s sex and whether or not breakfast is eaten. It appears that females may be more likely to eat breakfast.

.


19-6

Part V Inference for Relationships

Statistics Quiz B – Chapter 19

Name

1. In a local school, vending machines offer a range of drinks from juices to sports drinks. The purchasing agent thinks each type of drink is equally favored among the students buying drinks from the machines. The recent purchasing choices from the vending machines are shown in the table. Drink Type/Flavor Frequency

Lemon Lime Kiwi Tropical Grape Sports Drink Strawberry Punch Sports Drink 159 198 174 149

a. Test an appropriate hypothesis to decide if the purchasing agent is correct. Give statistical evidence to support your conclusion.

b. Which type of drink affected your decision the most? Explain what this means in the context of the problem.

.


Chapter 19 Comparing Counts

19-7

2. A manufacturing plant for recreational vehicles receives shipments from three different parts vendors. There has been a defect issue with some of the electrical wiring in the recreational vehicles manufactured at the plant. The plant manager wonders if all of the vendors might be contributing equally to the defect issue. The plant manager reviews a sample of quality assurance inspections from the last six months. The data are shown in the table below. Purrfect Parts Co. Made-4-U Co. 25 Hour Parts Co. Rejected 53 48 70 Perfect 93 71 88 Not perfect but acceptable 22 31 22 Test an appropriate hypothesis to decide if the plant manager is correct. Give statistical evidence to support your conclusion.

.


19-8

Part V Inference for Relationships

Statistics Quiz B – Chapter 19 – Key 1. In a local school, vending machines offer a range of drinks from juices to sports drinks. The purchasing agent thinks each type of drink is equally favored among the students buying drinks from the machines. The recent purchasing choices from the vending machines are shown in the table. Lemon Lime Kiwi Tropical Grape Sports Drink Strawberry Punch Sports Drink Order Frequency 159 198 174 149 Expected Frequencies 170 170 170 170 2  components 0.712 4.612 0.094 2.594 Drink Type/Flavor

a. Test an appropriate hypothesis to decide if the purchasing agent is correct. Give statistical evidence to support your conclusion. We want to know if the types of drinks are uniformly distributed (equally favored) among the students buying drinks from the machines. H 0 : The types of drinks are uniformly distributed (equally favored) among the students buying drinks. H A : The types are not uniformly distributed (equally favored) among the students buying drinks. Conditions: * Counted data: We have the counts from a sample of purchasing data. * Randomization: We don’t want to make inferences about other high schools, so we don’t need to check this condition. * Expected cell frequency: The null hypothesis expects 25% of the 680 drinks, or 170, should occur in each flavor. These expected values are all greater than 5, so the condition is satisfied. Under these conditions, the sampling distribution of the test statistic is  2 with 4 – 1 = 3 degrees of freedom. We will perform a chi-square goodness of fit test. 2 2 2 2 2  Obs  Exp   159  170   198  170   174  170   149  170   8.012 2   Exp 170 170 170 170

The P-value is the area in the upper tail of the  2 model for 3 degrees of freedom above the computed  2 value. P  P   2  8.012   0.0458 The P-value of 0.0458 is low, so we reject the null hypothesis. These data show evidence that the drink flavors are not uniformly chosen (equally favored) by the students.

b. Which type of drink affected your decision the most? Explain what this means in the context of the problem. Kiwi Strawberry; its standardized residual is the largest

 Obs  Exp  198  170  Exp

170

 2.15 , and its

 2 component was 4.612. More students preferred the Kiwi Strawberry compared to what was expected by the purchasing agent. The purchasing agent should order more Kiwi Strawberry than other flavors.

.


Chapter 19 Comparing Counts

19-9

2. A manufacturing plant for recreational vehicles receives shipments from three different parts vendors. There has been a defect issue with some of the electrical wiring in the recreational vehicles manufactured at the plant. The plant manager wonders if all of the vendors might be contributing equally to the defect issue. The plant manager reviews a sample of quality assurance inspections from the last six months. The data are shown in the table below. Purrfect Parts Co. Made-4-U Co. 25 Hour Parts Co. Rejected 53 (57.69) 48 (51.51) 70 (61.81) Perfect 93 (85.01) 71 (75.90) 88 (91.08) Not perfect but acceptable 22 (25.30) 31 (22.59) 22 (27.11) Test an appropriate hypothesis to decide if the plant manager is correct. Give statistical evidence to support your conclusion. We want to know whether if all of the vendors might be contributing equally to the defect issue. H 0 : The type of defects in vehicles made by the three vendors has the same distribution (are homogeneous). H A : The type of defects in vehicles made by the three vendors does not have the same distribution (are not homogeneous). Conditions: * Counted data: We have the counts from a sample of quality assurance inspections from the last six months. * Randomization: The data are from a sample of quality assurance inspections from the last six months. * Expected cell frequency: The expected values (shown in parenthesis in the table) are all greater than five. Under these conditions, the sampling distribution of the test statistic is  2 with (3 – 1)(3 – 1) = 4 degrees of freedom. We will perform a chi-square test of homogeneity. 2 2 2  Obs  Exp    53  57.69    48  51.51  ...  7.40 2   Exp 57.69 51.51 P  P(  2  7.40)  0.1161

The P-value of 0.1161 is rather high, so we fail to reject the null hypothesis. There is little statistical evidence to indicate that the types of defects vary by vendor.

.


19-10

Part V Inference for Relationships

Statistics Quiz C – Chapter 19

Name

1. When two competing teams are equally matched, the probability that each team wins any game is 0.5. The NBA championship goes to the team that wins four games in a best-ofseven series. If the teams were equally matched, the probability that the final series ends with one of the teams sweeping four straight games would be 2(0.5) 4  0.125 . Further probability calculations indicate that 25% of these series should last five games, 31.25% should last six games, and the other 31.25% should last the full seven games. The table shows the number of games it took to decide each of the last 57 NBA champs. Do you think the teams are usually equally matched? Give statistical evidence to support your conclusion. Length of series 4 games 5 games 6 games 7 games NBA finals 7 13 22 15

.


Chapter 19 Comparing Counts

19-11

2. Could eye color be a warning signal for hearing loss in patients suffering from meningitis? British researcher Helen Cullington recorded the eye color of 130 deaf patients, and noted whether the patient’s deafness had developed following treatment for meningitis. Her data are summarized in the table below. Test an appropriate hypothesis and state your conclusion. Deafness related to… Eye color meningitis other Light 30 72 Dark 2 26

.


19-12

Part V Inference for Relationships

Statistics Quiz C – Chapter 19 – Key 1. When two competing teams are equally matched, the probability that each team wins any game is 0.5. The NBA championship goes to the team that wins four games in a best-ofseven series. If the teams were equally matched, the probability that the final series ends with one of the teams sweeping four straight games would be 2(0.5) 4  0.125 . Further probability calculations indicate that 25% of these series should last five games, 31.25% should last six games, and the other 31.25% should last the full seven games. The table shows the number of games it took to decide each of the last 57 NBA champs. Do you think the teams are usually equally matched? Give statistical evidence to support your conclusion. Length of series 4 games 5 games 6 games 7 games NBA finals 7 13 22 15 Expected counts 7.125 14.25 17.8125 17.8125 H 0 : The distribution of length of series is consistent with a 50-50 chance for each team to win each game. H A : The distribution of lengths of series isn’t consistent with a 50-50 chance for each team to win each game. Conditions: * Counted data: These are counts of categorical data. * Randomization: We will assume that these 57 series are typical of past and future NBA championships. * Expected cell frequency: The expected values are all greater than five. Under these conditions, we can perform a  2 goodness-of-fit test with 3 degrees of freedom.

 Obs  Exp    7  7.125  13  14.25   22  17.8125  15  17.8125  1.54 2

2  

Exp

2

7.125

2

2

14.25

17.8125

2

17.8125

P = 0.67. The high P-value means we do not reject the null hypothesis; there is no evidence that the NBA championship series are inconsistent with the conjecture that the teams are evenly matched.

.


Chapter 19 Comparing Counts

19-13

2. Could eye color be a warning signal for hearing loss in patients suffering from meningitis? British researcher Helen Cullington recorded the eye color of 130 deaf patients, and noted whether the patient’s deafness had developed following treatment for meningitis. Her data are summarized in the table below. Test an appropriate hypothesis and state your conclusion. Deafness related to… Eye color meningitis other Light 30 72 Dark 2 26 H 0 : Deafness and eye color are independent. H A : There is an association between deafness and eye color. These are counts of categorical data, assumed to be representative of deaf patients in Britain, with expected counts (25.1, 76.9, 6.9, and 21.1) all at least 5. OK to do a  2 test for independence, with df = 1

 Obs  Exp    30  25.1   72  76.9    2  6.9    26  21.1  5.87 2

  2

Exp

2

25.1

2

2

76.9

6.9

2

21.1

P = 0.015 Since P = 0.015 is low, I reject the null hypothesis. There is strong evidence that hearing loss is associated with eye color. It appears that people with dark-colored eyes are at less risk of suffering deafness from meningitis.

.


19-14

Part V Inference for Relationships

Statistics - Investigative Task

Chapter 19

‘97 AP* Stat Scores

Statistics

5 4 3 2 1 Candidates

Grade Distributions Global Group 15.6% 49.0% 22.1% 30.6% 24.5% 16.5% 19.6% 4.1% 18.1% 0.0% 49

Calculus BC

AP* Stat Grade

AP* Grade

1997 was the first year for Advanced Placement Statistics nationwide. How well did Ithaca High School students do on the AP* exam? Here are the summary statistics reported by the College Board for that year’s AP* Statistics and BC Calculus exams, as well as a breakdown of the AP* Stat scores by gender.

5 4 3 2 1 Total

29 8 6 1

24 15 8 2

44

49

AP* Stat Grade 5 4 3 2 1

Gender M F 19 5 8 7 5 3 1 1 0 0

Some questions to consider: 1. How did IHS AP* Stat scores compare to the national results? 2. How did AP* Stat students do compared to the BC Calc students? 3. Do these results indicate that IHS males and females may perform differently on the AP* Statistics exam? Write a report assessing the success of this first group of AP* Statistics students with respect to the questions above. Indicate the hypotheses you test and the types of tests you do (goodness of fit, homogeneity, independence), show appropriate calculations and results, and clearly state your conclusions.

.


Chapter 19 Comparing Counts

Statistics – Investigative Task

19-15

Chapter 19

This Task consists of three hypothesis tests using  2 procedures. Each test has four components: the hypotheses, the plan, the mechanics, and the conclusion. Components are scored as Essentially correct, Partially correct, or Incorrect. Question number 1 Hypotheses

Think

E

Chooses correct test Writes hypotheses in context Hypotheses are consistent w/type of test Model

Show

I

E

P

I

I

E

P

I

E

P

I

I

E

P

I

E

P

I

P

Indep

NA P

by hand E

P

I

Computer

Computer

E

E

Links P to correct decision States correct conclusion in context Conclusion is consistent with hypotheses Conclusion analyzes the residuals AP* SCORE

Comments:

Scoring  For each component E = 1, P = 1/2, and I = 0.  Grades: Sum of 12 = A, 11 = A-, 10 = B+, etc.

Name ___________________________________ Grade _____

.

P

E

Shows the mechanics Shows df,  2 , P-value

Tell

E

Homogen

E

Conclusion

I

3

Good Fit

Checks conditions Shows expected counts Combines cells so expected counts ≥ 5 Mechanics

P

2

P

NA

I

P

I


19-16

Part V Inference for Relationships

NOTE: We present a model solution with some trepidation. This is not a scoring key, just an example. Many other approaches could fully satisfy the requirements outlined in the scoring rubric. That (not this) is the standard by which student responses should be evaluated. Model Solution - Investigative Task – ’97 AP* Stat Scores Question #1 Plan – We want to compare the distribution of IHS AP* Stat scores to the national results. We have the scores of 49 students. We will test goodness-of-fit. Hypotheses:

H 0 : The distribution of AP* Stat scores at IHS doesn’t differ significantly from the national distribution of AP* Stat scores. H A : The distribution of AP* Stat scores at IHS differs from the national distribution of AP* Stat scores.

Model – Counted Data Condition: The percentages given must be converted to counts before proceeding, by multiplying the percentages by the total number of students. Randomization Condition: We don’t want to draw inferences about other high schools, so we don’t need a random sample. Expected Cell Frequency Condition: All expected cells (shown below) are greater than 5. The conditions are met, so we can use a  2 model, with (5 – 1) = 4 degrees of freedom to do a

 2 goodness-of-fit test. Mechanics – Score

Observed

Expected

Residual = (Obs – Exp)

(Obs – Exp)2

5 4 3 2 1

24 15 8 2 0

7.664 10.829 12.005 9.604 8.869

16.336 4.171 –4.005 –7.604 –8.869

266.86 17.397 16.04 57.821 78.659

Component =

(Obs  Exp )

2

Exp

34.997 1.6065 1.3361 6.0205 8.869

  52.83 P-value  P(  2  52.83)   0.0001 Conclusion – Since the P-value is so low, reject the null hypothesis. There is strong evidence that the distribution of scores at IHS differs from the national distribution. Students at IHS have more scores of 5 than expected, and fewer scores of 1 and 2.

.


Chapter 19 Comparing Counts

19-17

Question #2 Plan – We want to compare the distribution of AP* scores of BC Calculus students to the distribution of scores for AP* Stat students at IHS. There are two distinct groups of students, so we will use a  2 test for homogeneity. Hypotheses – H 0 : The distribution of BC Calc scores at IHS doesn’t differ significantly from the distribution of AP* Stat scores. H A : The distribution of BC Calc scores at IHS differs from the distribution of AP* Stat scores. Model – Counted Data Condition: The distributions of scores are reported as counts. Randomization Condition: We don’t want to draw inferences about other high schools, so we don’t need a random sample. Expected Cell Frequency Condition: After combining scores of 3, 2, and 1 in to a single cell, all expected cells (shown below) are greater than 5. The conditions are met, so we can use a  2 model, with (3 – 1)(2 – 1) = 2 degrees of freedom to do a  2 test for homogeneity. Mechanics – Course Score 5 4 3,2,1

Calc BC (Obs / Exp) 29 / 25.075 8 / 10.882 7 / 8.043

(Obs  Exp ) 2 (29  25.075)  Exp 25.075 all cells

2  

Stat

(Obs / Exp) 24 / 27.925 15 / 12.118 10 / 8.957 2

(24  27.925) 2 27.925

(8  10.882) 2 (15  12.118) 2   10.882 12.118 2 (7  8.043) (10  8.957) 2    2.87 8.043 8.957

P-value  P(  2  2.87)  0.2380 Conclusion – Since the P-value is high, we fail to reject the null hypothesis. There is no evidence that the distribution of BC Calc Scores is different than the distribution of AP* Stat scores. .


19-18

Part V Inference for Relationships

Question #3 Plan – We want to know if there is an association between gender and score on the AP* Stat exam at IHS. There is only one group of students, categorized by gender and score, so we will use a  2 test for independence. Hypotheses – H 0 : Gender and score on the AP* Stat exam are independent. H A : There is an association between gender and score on the AP* Stat exam. Model – Counted Data Condition: The distributions of scores are reported as counts. Randomization Condition: We don’t want to draw inferences about other high schools, so we don’t need a random sample. Expected Cell Frequency Condition: After combining scores of 4, 3, 2, and 1 in to a single cell, all expected cells (shown below) are greater than 5. The conditions are met, so we can use a  2 model, with (2 – 1)(2 – 1) = 1 degree of freedom to do a  2 test for independence. Mechanics – Gender Score

Male (Obs / Exp) 19 / 16.163 14 / 16.837

5 4,3,2,1

   2

all cells

 Obs  Exp 2 Exp

Female

(Obs / Exp) 5 / 7.8367 11 / 8.1633

(19  16.163) 2 (5  7.8367)2   16.163 7.8367 

(14  16.837) 2 (11  8.1633) 2   2.988 16.837 8.1633

P-value  P(  2  2.988)  0.0839 Conclusion – Since the P-value is fairly high, we fail to reject the null hypothesis. There is little evidence that gender and score at IHS are associated.

(NOTE: Students may decide to reject the null hypothesis, based on a P-value that is fairly low. Examination of the components of shows evidence that males are more likely to score 5s and less likely to score below 5 than females.)

.


Chapter 19 Comparing Counts

Statistics Quiz D – Chapter 19

19-19

Name

1. Tests for adverse reactions to a new drug yielded the results given in the table. The data will be analyzed to determine if there is sufficient evidence to conclude that an association exists between the treatment (drug or placebo) and the reaction (whether or not headaches were experienced). Drug Placebo 11 7 73 91

Headaches No headaches A. Goodness-of-fit B. Homogeneity C. Independence

2. Tests for adverse reactions to a new drug yielded the results given in the table. The data will be analyzed to determine if they provide sufficient evidence to conclude that an association exists between the treatment (drug or placebo) and the reaction (whether or not headaches were experienced). Drug Placebo 11 7 73 91

Headaches No headaches

A. H 0 : The occurrence of headaches is dependent upon the drug. H A : The occurrence of headaches is not dependent upon the drug. B. H 0 : The distribution of headaches is different for the drug and the placebo. H A : The distribution of headaches is the same for the drug and the placebo. C. H 0 : The drug is independent from the occurrence of headaches. H A : The drug is not independent from the occurrence of headaches. D. H 0 : There is a relationship between the drug and occurrence of headaches. H A : There is no relationship between the drug and occurrence of headaches. E. H 0 : The drug is related to the occurrence of headaches. H A : The drug is not related to the occurrence of headaches.

.


19-20

Part V Inference for Relationships

3. Tests for adverse reactions to a new drug yielded the results given in the table. The data will be analyzed to determine if they provide sufficient evidence to conclude that an association exists between the treatment (drug or placebo) and the reaction (whether or not headaches were experienced). Drug Placebo 11 7 73 91

Headaches No headaches

 2  1.798; P-value  0.1799 A. Reject the null hypothesis. Report that there is insufficient evidence to conclude that drug and headaches are dependent. B. Do not reject the null hypothesis. Report that there is insufficient evidence to conclude that the distribution of headaches is uniform for the drug and placebo. C. Do not reject the null hypothesis. Report that there is insufficient evidence to conclude that drug and headaches are dependent. D. Reject the null hypothesis. Report that there is sufficient evidence to conclude that drug and headaches are dependent. E. Do not reject the null hypothesis. Report that there is sufficient evidence to conclude that drug and headaches are dependent. 4. The Book Industry Study Group, Inc. performs sample surveys to obtain information on characteristics of book readers. A book reader is defined to be one who read one or more books in the six months prior to the survey; a non-book reader is defined to be one who read newspapers or magazines but no books in the six months prior to the survey; a nonreader is defined to be one who did not read a book, newspaper, or magazine in the six months prior to the survey. The following data were obtained from a random sample of 1429 persons 16 years old and over in an effort to determine whether or not the proportions of book readers, non-book readers, and non-readers are the same for each income bracket.

Household Less than $15000 income $15000 to $25000 $25000 to $40000 $40000 and over Total

Classification Non-book reader Book reader 173 267 168 130 160 144 213 88 714 629

A. Goodness-of-fit B. Homogeneity C. Independence

.

Non Reader 55 19 9 3 86

Total 495 317 313 304 1429


Chapter 19 Comparing Counts

19-21

5. The Book Industry Study Group, Inc. performs sample surveys to obtain information on characteristics of book readers. A book reader is defined to be one who read one or more books in the six months prior to the survey; a non-book reader is defined to be one who read newspapers or magazines but no books in the six months prior to the survey; a nonreader is defined to be one who did not read a book, newspaper, or magazine in the six months prior to the survey. The following data were obtained from a random sample of 1429 persons 16 years old and over in an effort to determine whether or not the proportions of book readers, non-book readers, and non-readers are the same for each income bracket.

Household Less than $15000 income $15000 to $25000 $25000 to $40000 $40000 and over Total

Classification Non-book reader Book reader 173 267 168 130 160 144 213 88 714 629

Non Reader 55 19 9 3 86

Total 495 317 313 304 1429

A. H 0 : The classifications have the same distribution for each household income bracket. H A : The classifications do not have the same distribution for each household income bracket. B. H 0 : The classifications do not have the same distribution for each household income bracket. H A : The classifications have the same distribution for each household income bracket. C. H 0 : There is a relationship between household income and book readership. H A : Household income and book readership are not related. D. H 0 : Household income and book readership are independent. H A : Household income and book readership are not dependent. E. H 0 : There is no relationship between household income and book readership. H A : Household income and book readership are not related.

.


19-22

Part V Inference for Relationships

6. The Book Industry Study Group, Inc. performs sample surveys to obtain information on characteristics of book readers. A book reader is defined to be one who read one or more books in the six months prior to the survey; a non-book reader is defined to be one who read newspapers or magazines but no books in the six months prior to the survey; a nonreader is defined to be one who did not read a book, newspaper, or magazine in the six months prior to the survey. The following data were obtained from a random sample of 1429 persons 16 years old and over in an effort to determine whether or not the proportions of book readers, non-book readers, and non-readers are the same for each income bracket.

Household Less than $15000 income $15000 to $25000 $25000 to $40000 $40000 and over Total

Classification Non-book reader Book reader 173 267 168 130 160 144 213 88 714 629

Non Reader 55 19 9 3 86

Total 495 317 313 304 1429

 2  114.534; P-value  2.287  1022 A. There is sufficient evidence to reject the null hypothesis and conclude that income and classification are independent. B. There is sufficient evidence to reject the null hypothesis and conclude that income and classification are dependent. C. There is sufficient evidence to reject the null hypothesis and conclude that the classifications are not uniformly distributed for each income bracket. D. There is not sufficient evidence to reject the null hypothesis and conclude that income and classification are dependent. E. There is not sufficient evidence to reject the null hypothesis and conclude that income and classification are independent. 7. A candy company claims that its bags of mixed suckers are 20% strawberry, 30% cherry, 15% apple, 10% lemon, and 25% grape. A bag was purchased, and the number of each type of flavor was recorded in the chart below. Flavour Strawberry Cherry Apple Lemon Grape Count 31 15 17 15 22 A. Goodness-of-fit B. Homogeneity C. Independence

.


Chapter 19 Comparing Counts

19-23

8. A candy company claims that its bags of mixed suckers are 20% strawberry, 30% cherry, 15% apple, 10% lemon, and 25% grape. A bag was purchased, and the number of each type of flavor was recorded in the chart below. Flavour Strawberry Cherry Apple Lemon Grape Count 31 15 17 15 22 A. H 0 : The distribution of flavors is uniform. H A : The distribution of flavors is not uniform. B. H 0 : The flavors and counts are independent. H A : The flavors and counts are not independent. C. H 0 : The distribution of flavors is the same as the distribution claimed by the company. H A : The distribution of flavors is not the same as the distribution claimed by the company. D. H 0 : The distribution of flavors is not uniform. H A : The distribution of flavors is uniform. E. H 0 : The distribution of flavors is not the same as the distribution claimed by the company. H A : The distribution of flavors is the same as the distribution claimed by the company. 9. Vending machines on a college campus offer a variety of snacks. The purchasing agent believes that each type of snack is equally preferred by students and consequently orders equal quantities. The number of snacks sold from vending machines on this campus for the last six months is shown in the following table. If the purchasing agent is correct, how many candy bars would we expect to have been sold? Snack Type Purchase Frequency

Chips 159

Candy Bars 198

A. 170 B. 198 C. 125 D. 180 E. 680

.

Crackers 174

Nuts 149


19-24

Part V Inference for Relationships

10. Vending machines on a college campus offer a variety of snacks. The purchasing agent believes that each type of snack is equally preferred by students and consequently orders equal quantities. The number of snacks sold from vending machines on this campus for the last six months is shown in the following table. The correct value of the test statistic for determining if the purchasing agent’s belief if supported is Snack Type Purchase Frequency

Chips 159

Candy Bars 198

A.  2  8.012 B.  2  12.019 C.  2  0.984 D.  2  45.014 E. None of the above.

.

Crackers 174

Nuts 149


Chapter 19 Comparing Counts

Statistics Quiz D – Chapter 19 – Key

1. C 2. C 3. C 4. B 5. A 6. B 7. A 8. C 9. A 10. A

.

19-25


19-26

Part V Inference for Relationships

Statistics Quiz E – Chapter 19

Name

1. The data will be analyzed to determine if they provide sufficient evidence to conclude that an association exists between car color and the likelihood of being in an accident. Car has been in an accident Car has not been in an accident

Red 28 23

Blue 33 22

White 36 30

A. Goodness-of-fit B. Homogeneity C. Independence 2. The data will be analyzed to determine if they provide sufficient evidence to conclude that an association exists between car color and the likelihood of being in an accident. Car has been in an accident Car has not been in an accident

Red 28 23

Blue 33 22

White 36 30

A. H 0 : Accidents and car color are independent. H A : Accidents and car color are not related. B. H 0 : Accidents and car color are related. H A : Accidents and car color are not related. C. H 0 : Accidents and car color are not related. H A : Accidents and car color are not dependent. D. H 0 : Accidents are uniformly distributed over the car colors. H A : Accidents are not uniformly distributed over the car colors. E. H 0 : Accidents and car color are independent. H A : Accidents and car color are not independent.

.


Chapter 19 Comparing Counts

19-27

3. The following data were collected to determine whether or not the retirement age is the same for people in different professions.

Career

Attorney College professor Secretary Store clerck Total

Age at retirement 56–60 61–65 40 75 30 80 45 63 44 70 159 288

50–55 10 5 21 18 54

Over 65 40 60 49 50 199

Total 165 175 178 182 700

A. Goodness-of-fit B. Homogeneity C. Independence 4. The following data were collected to determine whether or not the retirement age is the same for people in different professions.

Career

Attorney College professor Secretary Store clerck Total

50–55 10 5 21 18 54

Age at retirement 56–60 61–65 Over 65 40 75 40 30 80 60 45 63 49 44 70 50 159 288 199

Total 165 175 178 182 700

A. H 0 : The retirement age has the same distribution for people in the different professions. H A : The retirement age has a different distribution for people in the different professions. B. H 0 : Age at retirement and job are dependent. H A : Age at retirement and job are not dependent. C. H 0 : Age at retirement and job are related. H A : Age at retirement and job are unrelated. D. H 0 : The retirement age has a different distribution for people in the different professions. H A : The retirement age has the same distribution for people in the different professions. E. H 0 : Age at retirement and job are not independent. H A : Age at retirement and job are dependent.

.


19-28

Part V Inference for Relationships

5. The following data were collected to determine whether or not the retirement age is the same for people in different professions.

Career

Attorney College professor Secretary Store clerck Total

50–55 10 5 21 18 54

Age at retirement 56–60 61–65 Over 65 40 75 40 30 80 60 45 63 49 44 70 50 159 288 199

Total 165 175 178 182 700

 2  20.77; P-value  0.0137 A. Reject the null hypothesis and conclude that the age at retirement is uniformly distributed for each career. B. Reject the null hypothesis and conclude that career and retirement age are related. C. Do not reject the null hypothesis and conclude that career and retirement age are independent. D. Reject the null hypothesis and conclude that career and retirement age are independent. E. Do not reject the null hypothesis and conclude that career and retirement age are related. 6. The following data were collected to determine whether or not the retirement age is the same for people in different professions.

Career

Attorney College professor Secretary Store clerck Total

50–55 10 5 21 18 54

Age at retirement 56–60 61–65 40 75 30 80 45 63 44 70 159 288

Over 65 40 60 49 50 199

How many degrees of freedom are there for the chi-square test statistic? A. 12 B. 25 C. 16 D. 9 E. 20

.

Total 165 175 178 182 700


Chapter 19 Comparing Counts

7. A die, suspected of being unfair, was rolled 50 times. The number of times each face appeared was recorded in the following table. Face Count

1 8

2 10

3 9

4 11

5 5

6 7

A. Goodness-of-fit B. Homogeneity C. Independence 8. A die, suspected of being unfair, was rolled 50 times. The number of times each face appeared was recorded in the following table. Face Count

1 8

2 10

3 9

4 11

5 5

A. H 0 : The counts are not uniformly distributed over the faces. H A : The counts are uniformly distributed over the faces. B. H 0 : The counts and faces are independent. H A : The counts and faces are not independent. C. H 0 : The counts are normally distributed over the faces. H A : The counts are not normally distributed over the faces. D. H 0 : The counts and faces are not independent. H A : The counts and faces are independent. E. H 0 : The counts are uniformly distributed over the faces. H A : The counts are not uniformly distributed over the faces.

.

6 7

19-29


19-30

Part V Inference for Relationships

9. A manufacturing plant for recreational vehicles receives shipments from three different parts vendors. There has been a defect issue with some of the electrical wiring in the recreational vehicles manufactured at the plant. The plant manager believes that the defect issue is dependent on the parts vendor. The plant manager reviews a sample of quality assurance inspections from the last six months. The expected number of perfect parts from Made-4-U Co. is Rejected Perfect Not perfect but acceptable

Perfect Parts Co. 53 93 22

Made-4-U Co. 48 71 31

25 Hour Parts Co. 70 88 22

A. 75.9 B. 91.08 C. 71 D. 61.81 E. 252 10. A manufacturing plant for recreational vehicles receives shipments from three different parts vendors. There has been a defect issue with some of the electrical wiring in the recreational vehicles manufactured at the plant. The plant manager believes that the defect issue is dependent on the parts vendor. The plant manager reviews a sample of quality assurance inspections from the last six months. Rejected Perfect Not perfect but acceptable

Perfect Parts Co. 53 93 22

Made-4-U Co. 48 71 31

25 Hour Parts Co. 70 88 22

The correct value of the test statistic for determining if the plant manager’s belief is supported is A.  2  8.10 B.  2  6.52 C.  2  5.03 D.  2  7.40 E.  2  9.89

.


Chapter 19 Comparing Counts

Statistics Quiz E – Chapter 19 – Key

1. C 2. E 3. B 4. A 5. B 6. D 7. A 8. E 9. A 10. D

.

19-31



Chapter 20 Inferences for Regression Solutions to Class Examples 1. Solution to Class Example 1: We want to examine the relationship between anxiety level and math test score. There is a weak, negative association between anxiety level and math test score. The correlation between the scores is only – 0.525. Generally, lower anxiety levels are associated with higher math test scores. There is one student who scored much lower on the math test than we would have suspected, based on the student’s anxiety level. This point is close to a typical anxiety level, so it shouldn’t prove to be particularly influential. Both variables are quantitative, and the scatterplot shows an association that is roughly linear. We will create a linear model to predict the math test score from the anxiety level.

M a t h T e s t

100 80 60 40

2

4

6

8

Anxiety Level

Residuals Plot

Residuals

20 0 -20 -40 50

60

70

80

predicted(M/A)

  91.658  4.486( Anxiety ) . This model only accounts for 26.6% of The linear model is Math the variability in math test score. For each additional unit of anxiety level, the model predicts a decrease of about 4.5 points in math test score. The residuals plot has random scatter, indicating an appropriate model, but the low R 2 makes it a poor predictor of math test score.

2. See Class Example 2. 3. Solution to Class Example 3: We wonder whether the weak relationship we see between anxiety level and math test score is evidence of an actual relationship, or simply the appearance of a relationship that is due merely to sampling error. Hypotheses. H 0 : There is no association between math test score and anxiety level. 1  0 H A : There is a linear association between math test score and anxiety level. 1  0

.


20-2

Part V Inference for Relationships

Model. Straight Enough Condition: There is no obvious bend in the scatterplot (from #1). Independence Assumption: There is no reason to believe that one student’s score affected any other student’s score. The residuals plot (from #1) shows random scatter. Does the Plot Thicken? Condition: The residuals plot shows relatively consistent spread. 8 Nearly Normal Condition: The histogram of the residuals is unimodal and symmetric, with the exception of one large negative 6 residual This residual isn’t that large, relative to the others, so it 4 shouldn’t be a problem. 2 Under these conditions, the sampling distribution of the regression slope can be modeled with a Student’s t-model, with (n – 2) = 22 -50.0 -12.5 25.0 degrees of freedom. We’ll do a regression slope t-test. residuals(M/A)

Mechanics.

  91.658  4.486( Anxiety ) . The linear model is Math

Conclusion. Since the P-value is low, we reject the null hypothesis. There is strong evidence that, on average, lower anxiety scores are associated with higher math scores. Follow-up. 95% Confidence Interval * b1  t22 ( SE (b1 ))  4.486  2.074(1.551)  (7.70, 1.27)

We are 95% confident that the interval (7.70, 1.27) contains the average math test score difference associated with difference in anxiety level. In other words, each additional unit of anxiety level is associated with an average decrease of between 1.27 and 7.70 points on the math test. 4. See Class Example 4. 5. Solution to Class Example 5:

b1  t16* ( SE (b1 ))  0.0786  2.120(0.0204)  (0.035, 0.122) We are 95% confident that the interval (0.035, 0.122) dollars per mile contains the average fare difference associated with difference in distance traveled. In other word, each additional 100 miles traveled is associated with an average fare increase of between $3.50 and $12.20.

.


Chapter 20 Inferences for Regression

20-3

6. See Class Example 6. Supplemental Resources

A worksheet that examines the association between average temperature and electrical usage is provided after the Chapter Quizzes.

.


20-4

Part V Inference for Relationships

Statistics Quiz A – Chapter 20

Name

A college admissions counselor was interested in finding out how well high school grade point averages (HS GPA) predict first-year college GPAs (FY GPA). A random sample of data from first-year students was reviewed to obtain high school and first-year college GPAs. The data are shown below: HS GPA 3.82 3.90 3.20 3.40 3.88 3.50 3.60 3.70 FY GPA 3.75 3.45 2.60 2.95 3.50 2.76 3.10 3.40 HS GPA 4.00 3.30 3.50 3.80 3.87 4.00 3.20 3.82 FY GPA 3.90 2.70 3.00 3.00 3.10 3.77 2.80 3.54 3.9 F Y G P A

3.6 3.3 3.0 2.7 3.2

3.4

3.6

3.8

HS GPA

Residual

1. Is there evidence of an association between high school and firstyear college GPAs? Test an appropriate hypothesis and state your conclusion in the proper context.

r e s 0.2 i d 0.0 u a -0.2 l s ( F / H

2.75

3.25

predicted(F/H)

6 4 2

-0.4 -0.2 0.0

0.2

residuals(F/H)

2. Create and interpret a 95% confidence interval for the slope of the regression line.

.

0.4


Chapter 20 Inferences for Regression

20-5

3. To improve the model the counselor adds a new predictor variable: Freshmen Credit Hours (FCH). She hopes that by using the number of credits a freshman chooses at enrollment, she will improve her predictions. However, she notes that the P-value for the slope of HS GPA stays below 5%, the P-value for FCH is 23%. Explain what this large P-value for FCH says about the new model.

.


20-6

Part V Inference for Relationships

Statistics Quiz A – Chapter 20 – Key

1. We have high school and first-year college GPAs for a random sample of first-year college students. We want to know if there is an association between high school and first-year college GPAs. H 0 : There is no association between high school and first-year college GPAs. 1  0 H A : There is an association between high school and first-year college GPAs. 1  0 * Straight enough: There is no obvious bend in the original scatterplot of the data or in the plot of residuals against predicted values. * Independence: These data were not collected over time, and there is no pattern in the residuals. * Does the plot thicken? Neither the original scatterplot nor the residuals plot shows any disturbing changes in the spread about the line. * Nearly Normal: A histogram of the residuals does not deviate too much from being unimodal and symmetric. Under these conditions the regression model is appropriate. We will use a linear regression t-test.  Regression equation: FY GPA  1.5641  1.3053  HS GPA  The R 2 for the regression is 75.4%. High school GPA seems to account for just over 75% of the variation in first-year college GPA. The regression equation indicates that each additional tenth of a point in high school GPA corresponds to an increase in first-year college GPA by an average of about 0.13 points. With a very small P-value  0.0001 , we reject the null hypothesis. There is strong evidence of an association between high school and first-year college GPAs. 2. Degrees of freedom = 16 – 2 = 14 95% confidence int. for 1 : b1  t14  SE  b1   1.3053  2.145  0.1993 or (0.88, 1.73). We are 95% confident that the first-year college GPA increases an average of between 0.88 and 1.73 points for each additional high school GPA point. 3. Because the P-value for FCH is so large, this new variable has not successfully improved the prediction model. The slope of FCH is not statistically significant. She’ll have to look for a different variable!

.


Chapter 20 Inferences for Regression

Statistics Quiz B – Chapter 20

20-7

Name

A high school counselor was interested in finding out how well student grade point averages (GPA) predict ACT scores. A sample of the senior class data was reviewed to obtain GPA and ACT scores. The data are shown in the table to the right.

GPA ACT 3.25 24 2.87 21 2.66 18 3.33 22 2.87 22 3.21 22 2.76 18 3.91 28 3.55 29 2.55 18 2.44 20 3.22 24 3.01 21 3.44 24 3.22 25

1. Is there evidence of an association between GPA and ACT score? Test an appropriate hypothesis and state your conclusion in the proper context. You can assume that the conditions for inference have been satisfied.

2. Create and interpret a 95% confidence interval for the slope of the regression line.

.


20-8

Part V Inference for Relationships

Statistics Quiz B – Chapter 20 – Key

1. We have GPA and ACT scores from a sample of students. We want to know if there is an association between GPA and ACT scores. H 0 : There is no association between GPA and ACT score. 1  0 H A : There is an association between GPA and ACT score. 1  0 * Straight enough condition: There is no obvious bend in the original scatterplot of the data or in the plot of the residuals against the predicted values. * Independence: These data were not collected over time and there is no pattern in the residuals plot. * Does the plot thicken? Neither the original scatterplot nor the residual plot shows any changes in the spread about the line. * Nearly Normal condition: A histogram of the residuals is roughly unimodal and symmetric. Under these conditions the regression model is appropriate. We will use the linear regression t-test.   0.427  7.397(GPA) Regression equation: PtDiff The R 2 for the regression is 78.1%. GPA seems to account for about 78% of the ACT score variation in the students. The regression equation indicates that each additional point of GPA corresponds to an increase in the student’s ACT score by 7.40 points, on average.

The P-value < 0.0001 is very small, so we reject the null hypothesis. There is strong evidence of an association between GPA and ACT score. 2. Degrees of freedom = 15 – 2 = 13 95% confidence interval for 1 is: b1  t13*  SE (b1 )  7.397   2.16 1.087  or (5.05, 9.74) We are 95% confident that the ACT score increases an average of between 5.05 and 9.74 points for every additional GPA point.

.


Chapter 20 Inferences for Regression

Statistics Quiz C – Chapter 20

Name

A sports analyst was interested in finding out how well a football team’s winning percentage (stated as a proportion) can be predicted based upon points scored and points allowed. She selects a random sample of 15 football teams. Each team played 10 games. She decided to use the point differential, points scored minus points allowed as the predictor variable. The data are shown in the table to the right, and regression output is given below. Residuals Plot

Win Percentage and Point Difference 1.0

0.2

0.8 Win Pct

Residual

0.1 0.0 -0.1 -0.2 -0.3

0.6 0.4 0.2

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Fitted Value

0.0

-50

0

50

100

Point Diff

Frequency

Residuals Histogram 3.0 2.5 2.0 1.5 1.0 0.5 0.0

20-9

Point Diff Win Pct –75 .100 –60 .300 –40 .200 –33 .300 –21 .400 –15 .500 5 .500 14 .700 20 .600 28 .700 31 .800 45 .900 56 .700 72 .600 80 1.000

Regression Analysis: Win Pct versus Point Diff Predictor Coef SE Coef T P Constant 0.51802 0.03071 16.87 0.000 Point Diff 0.0049500 0.0006682 7.41 0.000 S = 0.117511 -0.2

-0.1

0.0

R-Sq = 80.8%

R-Sq(adj) = 79.4%

0.1

Residual

Is there evidence of an association between Point Differential and Winning Percentage? Test an appropriate hypothesis and state your conclusion in the proper context.

.


20-10

Part V Inference for Relationships

Statistics Quiz C – Chapter 20 – KEY

We have Winning Percentage and Point Difference from a random sample of 15 teams. We want to know if there is an association between Winning Percentage and Point Difference. H 0 : There is no association between Winning Percentage and Point Difference. 1  0 H A : There is an association between Winning Percentage and Point Difference. 1  0 * Straight enough condition: There is no obvious bend in the original scatterplot of the data or in the plot of the residuals against the predicted values. * Independence: These data were not collected over time and there is no pattern in the residuals plot. * Does the plot thicken? Neither the original scatterplot nor the residual plot shows any changes in the spread about the line. * Nearly Normal condition: With the exception of a possible outlier, the histogram of the residuals is roughly unimodal and symmetric. Under these conditions the regression model is appropriate. We will use the linear regression t-test.   0.518  0.00495( PointDiff ) Regression equation: WinPerc The R 2 for the regression is 80.8%. Point Difference seems to account for about 81% of the variability in Winning Percentage. The regression equation indicates that each additional 10 points in Point Difference corresponds to an increase in the Winning Percentage of about 0.05, on average.

The P-value < 0.0001 is very small, so we reject the null hypothesis. There is strong evidence of an association between Winning Percentage and Point Difference.

.


Chapter 20 Inferences for Regression

Statistics – Regression Inference – Electricity

20-11

Chapter 20

Investigate the association between average monthly temperature (°F) and electrical usage (kilowatt hours) for a home. Avg. temp

Elec (kWh)

Month

Avg. temp

Elec (kWh)

Jan

39

1922

July

69

503

Feb

35

2123

Aug

70

377

Mar

41

2111

Sept

66

670

Apr

48

1728

Oct

57

679

May

55

878

Nov

51

1212

June

61

765

Dec

41

1589

Residual

Month

 

Is there evidence of an association between average monthly temperature and electrical usage?

Explain the association using a 95% confidence interval.

.


20-12

Part V Inference for Relationships

Statistics – Regression Inference – Electricity

KEY

Chapter 20

Hypotheses. H 0 : There is no association between average monthly temperature and electrical usage. 1  0 H A : There is a linear association between the variables. 1  0 Model. Straight Enough Condition: The association appears linear. Independence Assumption: The residuals plot shows random scatter. Does the Plot Thicken? Condition: The residuals plot shows fairly uniform spread. Nearly Normal Condition: The histogram of the residuals is unimodal and symmetric. Under these conditions, the sampling distribution of the regression slope can be modeled with a Student’s t-model, with (n – 2) = 10 degrees of freedom. We’ll do a regression slope t-test.

Mechanics.

df = 10

t = – 10.2

P-value  0.001

Conclusion. Since the P-value is so low, reject the null hypothesis. There is strong evidence that, as temperature increases, the average electrical usage decreases. Follow-up. 95% Confidence Interval: b1  t10* ( SE (b1 ))  50.9027  2.228(4.98)  (61.00, 39.81) I am 95% confident that each additional degree-day in average temperature is associated with an average decrease of between 39.81 and 61.00 kilowatt hours in electricity usage.

.


Chapter 20 Inferences for Regression

Statistics Quiz D – Chapter 20

20-13

Name

1. Applicants for a particular job, which involves extensive travel in Spanish-speaking countries, must take a proficiency test in Spanish. The sample data were obtained in a study of the relationship between the numbers of years applicants have studied Spanish and their score on the test. Dependent variable is: Score R squared = 83.0% s = 5.65138 with 10 - 2 = 8 degrees of freedom Variable Constant Number of years

Coefficient 31.53333 10.90476

s.e. of Coeff 6.360418 1.744054

t-ratio 4.957745 6.252538

prob 0.00111 0.000245

A. Score = 31.53333*Number of Years + 10.90476; The scores on the proficiency test for these applicants increase about 10.90476 points for each additional year spent studying Spanish. B. Number of Years = 31.53333 + 10.90476*Score; The number of years spent studying Spanish increases about 10.90476 years for each additional point scored on the test. C. Score = 31.53333*Number of Years + 10.90476; The scores on the proficiency test for these applicants increase about 31.53333 points for each additional year spent studying Spanish. D. Score = 31.53333 + 10.90476*Number of Years; The scores on the proficiency test for these applicants increase about 31.53333 points for each additional year spent studying Spanish. E. Score = 31.53333 + 10.90476*Number of Years; The scores on the proficiency test for these applicants increase about 10.90476 points for each additional year spent studying Spanish.

.


20-14

Part V Inference for Relationships

2. Applicants for a particular job, which involves extensive travel in Spanish-speaking countries, must take a proficiency test in Spanish. The sample data were obtained in a study of the relationship between the numbers of years applicants have studied Spanish and their score on the test. Use the regression analysis given below to find a 99% confidence interval for the slope of the regression line. Dependent variable is: Score R squared = 83.0% s = 5.65138 with 10 - 2 = 8 degrees of freedom Variable Constant Number of years

Coefficient 31.53333 10.90476

s.e. of Coeff 6.360418 1.744054

t-ratio 4.957745 6.252538

prob 0.00111 0.000245

A. (5.23, 16.57) B. (5.05, 16.76) C. (25.70, 37.40) D. (6.88, 14.93) E. (9.16, 12.65) 3. Applicants for a particular job, which involves extensive travel in Spanish-speaking countries, must take a proficiency test in Spanish. The sample data were obtained in a study of the relationship between the numbers of years applicants have studied Spanish and their score on the test. Use the regression analysis and summary statistics given below to find a 95% confidence interval for the mean score of all applicants who have studied Spanish for 3.1 years. Variable Years Score

Count 10 10

Mean 3.5 69.7

StdDev 0.341565 4.088058

Dependent variable is: Score R squared = 83.0% s = 5.65138 with 10 - 2 = 8 degrees of freedom Variable Constant Number of years

Coefficient 31.53333 10.90476

s.e. of Coeff 6.360418 1.744054

A. (60.9, 69.8) B. (63.6, 67.1) C. (60.9, 71.8) D. (51.6, 79.1)

.

t-ratio 4.957745 6.252538

prob 0.00111 0.000245


Chapter 20 Inferences for Regression

20-15

4. Applicants for a particular job, which involves extensive travel in Spanish-speaking countries, must take a proficiency test in Spanish. The sample data were obtained in a study of the relationship between the numbers of years applicants have studied Spanish and their score on the test. Use the regression analysis and summary statistics given below to find a 95% prediction interval for the score of an applicant who has studied Spanish for 3.1 years. Variable Years Score

Count 10 10

Mean 3.5 69.7

StdDev 0.341565 4.088058

Dependent variable is: Score R squared = 83.0% s = 5.65138 with 10 - 2 = 8 degrees of freedom Variable Constant Number of years

Coefficient 31.53333 10.90476

s.e. of Coeff 6.360418 1.744054

t-ratio 4.957745 6.252538

prob 0.00111 0.000245

A. 60.9 to 69.8 B. 61.3 to 69.4 C. 54.3 to 76.4 D. 51.6 to 79.1 5. Applicants for a particular job, which involves extensive travel in Spanish-speaking countries, must take a proficiency test in Spanish. The sample data were obtained in a study of the relationship between the numbers of years applicants have studied Spanish and their score on the test. A 99% confidence interval for the slope of the regression line was determined to be (5.05, 16.76). Give an interpretation of this interval. Dependent variable is: Score R squared = 83.0% s = 5.65138 with 10 - 2 = 8 degrees of freedom Variable Constant Number of years

Coefficient 31.53333 10.90476

s.e. of Coeff 6.360418 1.744054

t-ratio 4.957745 6.252538

prob 0.00111 0.000245

A. Based on these data, we are 99% confident that the average test score decreases between 5.05 and 16.76 points for every additional year spent studying Spanish. B. Based on these data, we are 99% confident that the number of years spent studying Spanish increases between 5.05 and 16.76 years for every additional point earned on the test. C. Based on these data, we are 99% confident that the mean test score is between 5.05 and 16.76 points D. Based on these data, we are 99% confident that a randomly selected applicant will score between 5.05 and 16.76 points on the test. E. Based on these data, we are 99% confident that the average test score increases between 5.05 and 16.76 points for every additional year spent studying Spanish.

.


20-16

Part V Inference for Relationships

6. Applicants for a particular job, which involves extensive travel in Spanish-speaking countries, must take a proficiency test in Spanish. The sample data were obtained in a study of the relationship between the numbers of years applicants have studied Spanish and their score on the test. A 95% confidence interval for the mean score of all applicants who have studied Spanish for 4.6 years was determined to be (75.6, 87.7). Give an interpretation of this interval. Dependent variable is: Score R squared = 83.0% s = 5.65138 with 10 - 2 = 8 degrees of freedom Variable Constant Number of years

Coefficient 31.53333 10.90476

s.e. of Coeff 6.360418 1.744054

t-ratio 4.957745 6.252538

prob 0.00111 0.000245

A. Based on this regression, we are 95% confident that a randomly selected applicant who has studied Spanish for 4.6 years will score between 75.6 and 87.7 points on the proficiency test. B. Based on this regression, we are 95% confident that applicants will score between 75.6 and 87.7 points higher on the proficiency test for each additional year spent studying Spanish. C. Based on this regression, we are 95% confident that all applicants who have studied Spanish for 4.6 years will average between 75.6 and 87.7 points on the proficiency test. D. Based on this regression, we are 95% confident an applicant who has studied Spanish for 4.6 years will score between 75.6 and 87.7 points on the proficiency test. E. Based on this regression, 95% of all random samples will have an average score between 75.6 and 87.7 points on the proficiency test.

.


Chapter 20 Inferences for Regression

20-17

7. Applicants for a particular job, which involves extensive travel in Spanish-speaking countries, must take a proficiency test in Spanish. The sample data were obtained in a study of the relationship between the numbers of years applicants have studied Spanish and their score on the test. A 95% prediction interval for the score of an applicant who has studied Spanish for 4.4 years was determined to be (65.4, 93.6). Give an interpretation of this interval. Dependent variable is: Score R squared = 83.0% s = 5.65138 with 10 - 2 = 8 degrees of freedom Variable Constant Number of years

Coefficient 31.53333 10.90476

s.e. of Coeff 6.360418 1.744054

t-ratio 4.957745 6.252538

prob 0.00111 0.000245

A. Based on this regression, we are 95% confident that the mean test score will increase between 65.4 and 93.6 points for each additional year spent studying Spanish. B. Based on this regression, we are 95% confident that the mean test score for applicants who have studied Spanish for 4.4 years will be between 65.4 and 93.6. C. Based on this regression, we know that 95% of all random samples will have a mean test score between 65.4 and 93.6 points. D. Based on this regression, we are 95% confident that the mean number of years spent studying Spanish will increase between 65.4 and 93.6 years for each additional point earned on the test. E. Based on this regression, an applicant who has studied Spanish for 4.4 years will score between 65.4 and 93.6 on the proficiency test, with 95% confidence. 8. Determine which plot shows the strongest linear correlation.

A

B

C

A. B. C.

.


20-18

Part V Inference for Relationships

9. A sample of 33 companies was randomly selected and data collected on the average annual bonus, turnover rate (%), and trust index (measured on a scale of 0 – 100). According to the output is shown below, what is the estimated multiple regression model? Dependent Variable is Turnover Rate Predictor Constant Trust Index Average Bonus

Coef 12.1005 -0.07149 -0.0007216

SE Coef 0.7826 0.01966 0.0001481

T 15.46 -3.64 -4.87

P 0.000 0.001 0.000

A. Turnover Rate = 0.01966 Trust Index + 0.0001481 Average Bonus B. Turnover Rate = 0.7826 + 0.01966 Trust Index + 0.0001481 Average Bonus C. Turnover Rate = 12.1005 – 0.07149 Trust Index – 0.0007216 Average Bonus D. Turnover Rate = 12.1005 + 0.07149 Trust Index + 0.0007216 Average Bonus E. Turnover Rate = 15.46 – 3.64 Trust Index – 4.87 Average Bonus 10. A sample of 33 companies was randomly selected and a multiple regression model was performed using average annual bonus and trust index (scale of 0 – 100) to explain turnover rate. According to the output below, what is the F statistic to determine the overall significance of the estimated is Analysis of Variance Source Regression Residual Error Total

DF 2 30 32

SS 262.73 67.27 330.00

MS 131.36 2.24

A. 58.64 B. 1.497 C. 131.36 D. 78.3 E. 2.24

.


Chapter 20 Inferences for Regression

Statistics Quiz D – Chapter 20 – KEY

1. E 2. B 3. A 4. D 5. E 6. C 7. E 8. C 9. C 10. A

.

20-19


20-20

Part V Inference for Relationships

Statistics Quiz E – Chapter 20

Name

1. A grass seed company conducts a study to determine the relationship between the density of seeds planted (in pounds per 500 sq ft) and the quality of the resulting lawn. Eight similar plots of land are selected and each is planted with a particular density of seed. One month later the quality of each lawn is rated on a scale of 0 to 100. The regression analysis is given below. Dependent variable is: Lawn Quality R squared =36.0% s = 9.073602 with 8 - 2 = 6 degrees of freedom Variable Constant Seed Density

Coefficient 31.14815 4.537037

s.e. of Coeff 7.510757 2.469522

t-ratio 4.413423 1.837213

prob 0.004503 0.115825

A. Lawn Quality = 33.14815 + 4.537037*Seed Density; The lawn quality of these plots increases about 4.537037 points for each 1 pound per 500 square foot increase in seed density. B. Lawn Quality = 33.14815 + 4.537037*Seed Density; The lawn quality of these plots increases about 33.14815 points for each 1 pound per 500 square foot increase in seed density. C. Seed Density = 33.14815 + 4.537037*Lawn Quality; The seed density of these plots increases about 4.537037 pounds per 500 square feet for each 1 point increase in lawn quality. D. Lawn Quality = 33.14815*Seed Density + 4.537037; The lawn quality of these plots increases about 4.537037 points for each 1 pound per 500 square foot increase in seed density. E. Lawn Quality = 33.14815*Seed Density + 4.537037; The lawn quality of these plots increases about 33.14815 points for each 1 pound per 500 square foot increase in seed density.

.


Chapter 20 Inferences for Regression

20-21

2. A grass seed company conducts a study to determine the relationship between the density of seeds planted (in pounds per 500 sq ft) and the quality of the resulting lawn. Eight similar plots of land are selected and each is planted with a particular density of seed. One month later the quality of each lawn is rated on a scale of 0 to 100. The regression analysis is given below. Find a 95% confidence interval for the slope of the regression line. Dependent variable is: Lawn Quality R squared =36.0% s = 9.073602 with 8 - 2 = 6 degrees of freedom Variable Constant Seed Density

Coefficient 31.14815 4.537037

s.e. of Coeff 7.510757 2.469522

t-ratio 4.413423 1.837213

prob 0.004503 0.115825

A. (–0.26, 9.34) B. (–1.51, 10.58) C. (–1.30, 10.38) D. (14.77, 51.53) E. (27.10, 39.18) 3. A grass seed company conducts a study to determine the relationship between the density of seeds planted (in pounds per 500 sq ft) and the quality of the resulting lawn. Eight similar plots of land are selected and each is planted with a particular density of seed. One month later the quality of each lawn is rated on a scale of 0 to 100. The regression analysis and summary statistics are given below. Find a 99% confidence interval for the mean lawn quality of all lawns sown with a seed density of 2.5. Variable Seed Density Lawn Quality

Count 8 8

Mean 2.75 45.625

StdDev 0.49099 3.712611

Dependent variable is: Lawn Quality R squared =36.0% s = 9.073602 with 8 - 2 = 6 degrees of freedom Variable Constant Seed Density

Coefficient 31.14815 4.537037

s.e. of Coeff 7.510757 2.469522

A. (31.1, 53.9) B. (30.4, 54.6) C. (42.0, 47.0) D. (6.7, 78.2)

.

t-ratio 4.413423 1.837213

prob 0.004503 0.115825


20-22

Part V Inference for Relationships

4. A grass seed company conducts a study to determine the relationship between the density of seeds planted (in pounds per 500 sq ft) and the quality of the resulting lawn. Eight similar plots of land are selected and each is planted with a particular density of seed. One month later the quality of each lawn is rated on a scale of 0 to 100. The regression analysis and summary statistics are given below. Determine a 99% prediction interval for the lawn quality of a lawn sown with a seed density of 2.9. Variable Seed Density Lawn Quality

Count 8 8

Mean 2.75 45.625

StdDev 0.49099 3.712611

Dependent variable is: Lawn Quality R squared =36.0% s = 9.073602 with 8 - 2 = 6 degrees of freedom Variable Constant Seed Density

Coefficient 31.14815 4.537037

s.e. of Coeff 7.510757 2.469522

t-ratio 4.413423 1.837213

prob 0.004503 0.115825

A. 34.3 to 58.3 B. 37.2 to 55.5 C. 10.6 to 82.0 D. 12.6 to 80.0 5. A grass seed company conducts a study to determine the relationship between the density of seeds planted (in pounds per 500 sq ft) and the quality of the resulting lawn. Eight similar plots of land are selected and each is planted with a particular density of seed. One month later the quality of each lawn is rated on a scale of 0 to 100. The regression analysis is given below. A 95% confidence interval for the slope of the regression line was determined to be (–1.51, 10.58). Give an interpretation of this interval. Dependent variable is: Lawn Quality R squared =36.0% s = 9.073602 with 8 - 2 = 6 degrees of freedom Variable Constant Seed Density

Coefficient 31.14815 4.537037

s.e. of Coeff 7.510757 2.469522

t-ratio 4.413423 1.837213

prob 0.004503 0.115825

A. We are 95% confident that each additional point on the lawn quality scale will increase the average seed density by –1.51 to 10.58 pounds per 500 square feet. B. We are 95% confident that a randomly selected plot will have a lawn quality between –1.51 and 10.58 points. C. Based on this regression, we are 95% confidence that the average lawn quality is between –1.51 and 10.58 points. D. Based on this regression, each additional pound of seed per 500 square feet will increase the average lawn quality by –1.51 to 10.58 points, with 95% confidence. E. We are 95% confident that the average lawn quality will increase between –1.51 and 10.58 points per year.

.


Chapter 20 Inferences for Regression

20-23

6. A grass seed company conducts a study to determine the relationship between the density of seeds planted (in pounds per 500 sq ft) and the quality of the resulting lawn. Eight similar plots of land are selected and each is planted with a particular density of seed. One month later the quality of each lawn is rated on a scale of 0 to 100. The regression analysis is given below. A 99% confidence interval for the mean lawn quality of all lawns sown with a seed density of 3.5 was found to be (33.3, 60.8). Give an interpretation of this interval. Dependent variable is: Lawn Quality R squared =36.0% s = 9.073602 with 8 - 2 = 6 degrees of freedom Variable Constant Seed Density

Coefficient 31.14815 4.537037

s.e. of Coeff 7.510757 2.469522

t-ratio 4.413423 1.837213

prob 0.004503 0.115825

A. Based on this regression, 99% of all random samples will have an average lawn quality between 33.3 and 60.8. B. Based on this regression, we are 99% confident that the average lawn quality will increase between 33.3 and 60.8 points for each additional pound of seed per 500 square feet. C. Based on this regression, we are 99% confident that the lawn quality for a lawn with a seed density of 3.5 is between 33.3 and 60.8. D. Based on this regression, we are 99% confident that the average seed density will increase between 33.3 and 60.8 pounds per 500 square feet for each additional one-point increase in lawn quality. E. Based on this regression, we are 99% confident that the average lawn quality for lawns with a seed density of 3.5 pounds per 500 square feet is between 33.3 and 60.8.

.


20-24

Part V Inference for Relationships

7. A grass seed company conducts a study to determine the relationship between the density of seeds planted (in pounds per 500 sq ft) and the quality of the resulting lawn. Eight similar plots of land are selected and each is planted with a particular density of seed. One month later the quality of each lawn is rated on a scale of 0 to 100. The regression analysis is given below. A 99% prediction interval for the lawn quality of a lawn sown with a seed density of 1.9 was determined to be (5.3, 78.3). Give an interpretation of this interval. Dependent variable is: Lawn Quality R squared =36.0% s = 9.073602 with 8 - 2 = 6 degrees of freedom Variable Constant Seed Density

Coefficient 31.14815 4.537037

s.e. of Coeff 7.510757 2.469522

t-ratio 4.413423 1.837213

prob 0.004503 0.115825

A. Based on this regression, we are 99% confident that the mean seed density will increase between 5.3 and 78.3 for each additional point of lawn quality over 1.9 pounds. B. Based on this regression, we are 99% confident that the mean lawn quality for lawns with a seed density of 1.9 pounds per 500 square feet will be between 5.3 and 78.3. C. Based on this regression, we are 99% confident that the mean lawn quality will increase between 5.3 and 78.3 for each additional pound of seed density over 1.9 pounds. D. Based on this regression, we know that 99% of all random samples will have a lawn quality between 5.3 and 78.3. E. Based on this regression, we are 99% confident that a lawn with a seed density of 1.9 pounds per 500 square feet will have a lawn quality between 5.3 and 78.3. 8. Determine which plot shows the strongest linear correlation.

A

B

C

A. B. C.

.


Chapter 20 Inferences for Regression

20-25

9. Selling price and amount spent advertising were entered into a multiple regression to determine what affects flat panel LCD TV sales. Using the output below, calculated F statistic to determine the overall significance of the estimated multiple regression model is Analysis of Variance Source Regression Residual Error Total

DF 2 27 29

SS 16477.3 3038.0 19515.4

MS 8238.7 112.5

A. 10.61 B. 73.23 C. 112.5 D. 3.60 E. None of the above 10. Selling price and amount spent advertising were entered into a multiple regression to determine what affects flat panel LCD TV sales. Use the output shown below, calculate the amount of variability in Sales is explained by the estimated multiple regression model. Analysis of Variance Source Regression Residual Error Total

DF 2 27 29

SS 16477.3 3038.0 19515.4

MS 8238.7 112.5

A. 15.57% B. 6.90% C. 84.43% D. 29% E. None of the above.

.


20-26

Part V Inference for Relationships

Statistics Quiz E – Chapter 20 – KEY

1. A 2. B 3. B 4. C 5. D 6. E 7. E 8. B 9. B 10. C

.


Statistics Test A – Part V

Name ________________________

__ 1.

Which of the following statements in NOT an assumption of inference for a regression model? A) The response variable is linearly related to the explanatory variable. B) The errors around the idealized regression line follow a Normal model. C) The errors around the idealized regression line have equal variability. D) The errors around the idealized regression line are independent of each other. E) The errors around the idealized regression line are linearly related.

__ 2.

A marketing survey involves product recognition in New York and California. Suppose the proportion of New Yorkers who recognized a product is p1 and the proportion of Californians who recognized the product is p2 . The survey found a 95% confidence interval for p1  p2 is (–0.29, –0.12). Give an interpretation of this confidence interval. A) We know that 95% of all random samples done on the population will show that the proportion of Californians who knew the product is between 12% and 29% higher than the proportion of New Yorkers who knew the product. B) We know that 95% of Californians recognized the product between 12% and 29% more often than New Yorkers. C) We are 95% confident that the proportion of New Yorkers who recognized the product is between 12% and 29% higher than the proportion of Californians who recognized the product. D) We know that 95% of New Yorkers recognized the product between 12% and 29% more often than Californians. E) We are 95% confident that the proportion of Californians who recognized the product is between 12% and 29% higher than the proportion of New Yorkers who recognized the product.

__ 3.

A poll reported that 41 of 100 men surveyed were in favor of increased security at airports, while 35 of 140 women were in favor of increased security. A) P-value = 0.0512; If there is no difference in the proportions, there is about a 5.12% chance of seeing the observed difference or larger by natural sampling variation. B) P-value = 0.0086; If there is no difference in the proportions, there is about a 0.86% chance of seeing the observed difference or larger by natural sampling variation. C) P-value = 0.4211; If there is no difference in the proportions, there is a 42.11% chance of seeing the exact observed difference by natural sampling variation. D) P-value = 0.0512; There is about a 5.12% chance that the two proportions are equal. E) P-value = 0.0086; There is about a 0.86% chance that the two proportions are equal.

__ 4.

A philosophy professor wants to find out whether the mean age of the men in his large lecture class is equal to the mean age of the women in his classes. After collecting data from a random sample of his students, the professor tested the hypothesis H 0 :  M  W  0 against the alternative H A :  M  W  0 . The P-value for the test was 0.003. Which is true? A) There is a 0.3% chance that the mean ages for the men and women are equal. B) There is a 0.3% chance that the mean ages for the men and women are different. C) It is very unlikely that the professor would see results like these if the mean age of men was equal to the mean age of women. D) There is a 0.3% chance that another sample will give these same results. E) There is a 99.7% chance that another sample will give these same results.

.


Part V-2

Part V Inference for Relationships

__ 5.

Absorption rates into the body are important considerations when manufacturing a generic version of a brand-name drug. A pharmacist read that the absorption rate into the body of a new generic drug (G) is the same as its brand-name counterpart (B). She has a researcher friend of hers run a small experiment to test H 0 : G   B  0 against the alternative H A : G   B  0 . Which of the following would be a Type I error? A) Deciding that the absorption rates are different, when in fact they are not. B) Deciding that the absorption rates are different, when in fact they are. C) Deciding that the absorption rates are the same, when in fact they are not. D) Deciding that the absorption rates are the same, when in fact they are. E) The researcher cannot make a Type I error, since he has run an experiment.

__ 6.

A Chi-square test is run on a table with 3 rows and 4 columns. How many degrees of freedom? A) 2 B) 3 C) 4 D) 6 E) 11

__ 7.

At one vehicle inspection station, 13 of 52 trucks and 11 of 88 cars failed the emissions test. Assuming these vehicles were representative of the cars and trucks in that area, a test is run to see if the failure rate is higher with trucks. What is the P-value? A) 0.029 B) 0.058 C) 0.000 D) 0.942 E) 0.971

__ 8.

At one SAT test site students taking the test for a second time volunteered to inhale supplemental oxygen for 10 minutes before the test. In fact, some received oxygen, but others (randomly assigned) were given just normal air. Test results showed that 42 of 66 students who breathed oxygen improved their SAT scores, compared to only 35 of 63 students who did not get the oxygen. Which procedure should we use to see if there is evidence that breathing extra oxygen can help test-takers think more clearly? B) 2-proportion z-test C) 1-sample t-test A) 1-proportion z-test D) 2-sample t-test E) matched pairs t-test

__ 9.

A survey asked people “On what percent of days do you get more than 30 minutes of vigorous exercise?” Using their responses we want to estimate the difference in exercise frequency between men and women. We should use a A) 1-proportion z-interval B) 2-proportion z-interval C) 1-sample t-interval D) 2-sample t-interval E) matched pairs t-interval

__10.

When evaluating the success of a multiple regression model, which of the following should be considered? I. The P-values for each slope should be significant. II. R 2 should increase. III. The predictor variables should be interrelated. A) I only B) II only C) III only D) I or II E) I or III

.


Review of Part V: Inference for Relationships

11. Cloning A random sample of 800 adults was asked the following question: “Do you think current laws concerning the use of cloning for medical research are too strict, too lenient, or about right?” The pollsters also classified the respondents with respect to highest education level attained: high school, 2-year college degree, 4-year degree, or advanced degree. We wish to know if attitudes on cloning are related to education level. (All the conditions are satisfied – don’t worry about checking them.)

High school 2-year 4-year Adv. degree Total

a. Write appropriate hypotheses. b. Suppose the expected counts had not been given. Show how to calculate the expected count in the first cell (106.01).

2 

Strict 93 106.01 27 28.31 82 75.48 20 12.21 222 1.60 0.06 0.56 4.97

Lenient 107 87.38 19 23.33 50 62.22 7 10.07 183 + + + +

4.40 0.80 2.40 0.93

+ + + +

Part V-3

Right 182 188.61 56 50.36 140 134.30 17 21.73 395

Total 382

0.23 0.63 0.24 1.03

+ + + = 17.86

102 272 44 800

P = 0.0066

c. How many degrees of freedom? Explain.

d. State your complete conclusion in context.

12. Exercise A random sample of 150 men found that 88 of the men exercise regularly, while a random sample of 200 women found that 130 of the women exercise regularly. a. Based on the results, construct and interpret a 95% confidence interval for the difference in the proportions of women and men who exercise regularly.

b. A friend says that she believes that a higher proportion of women than men exercise regularly. Does your confidence interval support this conclusion? Explain.

.


Part V-4

Part V Inference for Relationships

13. Bedrooms A random sample of 76 apartments is collected near a university. All of the apartments in the sample have between 1 and 6 bedrooms. The variables recorded for each apartment are Rent (in dollars) and the number of Bedrooms. The regression output is: The dependent variable is Rent R squared = 62.0% R squared (adjusted) = 61.5% s = 364.4 with 76 - 2 = 74 degrees of freedom Variable Constant Bedrooms

Coeff 357.795 400.554

SE(Coeff) 111.6 36.42

t-ratio 3.2 11.0

p-value 0.0020 < 0.0001

a. Write out the regression equation.

b. Compute a 95% confidence interval for the coefficient of Bedrooms. Explain your confidence interval in the context of the problem.

c. Based on your interval is the number of bedrooms a significant predictor of rent? Explain how you reached your answer.

14. Haircuts You need to find a new hair stylist and know that there are two terrific salons in your area, Hair by Charles and Curl Up & Dye. You want a really good haircut, but you do not want to pay too much for the cut. A random sample of costs for 10 different stylists was taken at each salon (each salon employs over 100 stylists). a. Indicate what inference procedure you would use to see if there is a significant difference in the costs for haircuts at each salon. Check the appropriate assumptions and conditions and indicate whether you could or could nor proceed. (Do not do the actual test.)

.


Review of Part V: Inference for Relationships

Part V-5

14. Haircuts (continued) b. A friend tells you that he has heard that Curl Up & Dye is the more expensive salon. i. Write hypotheses for your friend’s claim.

ii. The following are computer outputs. Which output is the correct one to use for this test? Explain.

Output A: Two-sample T for Hair by Charles vs Curl Up & Dye

Hair by Charles Curl Up & Dye

N 10 10

Mean 22.10 26.00

StDev 6.33 4.81

SE Mean 2.0 1.5

Difference = mu (Hair by Charles) - mu (Curl Up & Dye) Estimate for difference: -3.90000 95% CI for difference: (-9.22983, 1.42983) T-Test of difference = 0 (vs not =): T-Value = -1.55 P-Value = 0.140 DF = 16

Output B: Paired T for Hair by Charles - Curl Up & Dye

Hair by Charles Curl Up & Dye Difference

N 10 10 10

Mean 22.1000 26.0000 -3.90000

StDev 6.3325 4.8074 7.37036

SE Mean 2.0025 1.5202 2.33071

95% CI for mean difference: (-9.17244, 1.37244) T-Test of mean difference = 0 (vs not = 0): T-Value = -1.67 = 0.129

P-Value

iii. Use the appropriate computer output to make a conclusion about the hypothesis test based on the data. Make sure to state your conclusion in context.

.


Part V-6

Part V Inference for Relationships

Statistics Test A – Part V – Key 1. E

2. A

3. B

4. C

5. A

6. D

7. A

8. B

9. D

10. D

11. Cloning. a. H 0 : People’s opinions on cloning are independent of education level. H A : There is an association b. (382/800)(222) = 106.005 c. (4 – 1)(3 – 1) = 6 d. Reject null (P < 0.05); There is strong evidence that opinion varies with education level. It appears that high school grads are more likely to think regulations are too lenient, people with advanced degrees too strict. 12. Exercise: A random sample of 150 men found that 88 of the men exercise regularly, while a random sample of 200 women found that 130 of the women exercise regularly. a. Conditions: * Randomization Condition: We are told that we have random samples. * 10% Condition: We have less than 10% of all men and less than 10% of all women. * Independent samples condition: The two groups are independent of each other. * Success/Failure Condition: Of the men, 88 exercise regularly and 62 do not; of the women, 130 exercise regularly and 70 do not. The observed number of both successes and failures in both groups is at least 10. With the conditions satisfied, the sampling distribution of the difference in proportions is approximately Normal with a mean of pM  pW , the true difference between the population proportions. We can find a two-proportion z-interval. 88 130 We know: nM  150 , pˆ M   0.587 , nW  200 , pˆW   0.650 . 150 200 We estimate SD ( pˆ M  pˆW ) as

SE ( pˆ M  pˆW ) 

pˆ M qˆ M pˆ W qˆW   nM nW

 0.587  0.413   0.65 0.35  0.0525 150

200

ME  z *  SE ( pˆ M  pˆW )  1.96(0.0525)  0.1029 The observed difference in sample proportions = pˆ M  pˆW = 0.587 – 0.650 = -0.063, so the 95% confidence interval is 0.063  0.1029 , or –16.6% to 4.0%. We are 95% confident that the proportion of women who exercise regularly is between 4.0% lower and 16.6% higher than the proportion of men who exercise regularly. b. Since zero is contained in my confidence interval, I cannot say that a higher proportion of women than men exercise regularly. My confidence interval does not support my friend’s claim.

.


Review of Part V: Inference for Relationships

Part V-7

13. Bedrooms a. b.

  357.80  400.55 Bedrooms Rent * 2. b1  tn* 2  SE (b1 ) The degrees of freedom are n – 2 = 74. For a 95% C.I., t73 400.55  2(36.42)  (327.71, 473.39) dollars per bedroom We are 95% confident that, on average, each additional bedroom is associated with an increase of between $327.71 and $473.39 in the rent of an apartment.

c. To test H 0 : 1  0 vs. H A : 1  0 , we can compare our confidence interval from part b to the hypothesized value of zero. Since zero is below our interval, we conclude that there is strong evidence that the number of bedrooms is positively associated with the amount of rent charged. 14. Haircuts a. I would use a two-sample t-test for the difference of means. Conditions: * Independent group assumption: Stylists from two different salons are definitely independent groups. * Randomization condition: We are told that these are random samples of stylists from each salon. * 10% condition: The sample represents less than 10% of all possible stylists from each salon. * Nearly Normal condition: We do not have the data, so we do not know about this condition. We would proceed with caution. b. A friend tells you that he has heard that Curl Up & Dye is the more expensive salon. i.

Write hypotheses for your friend’s claim. Let H = Hair by Charles and C = Curl Up & Dye.

H 0 :  H  C  0 (There is no difference in the mean cost of haircuts at the two salons.) H A :  H  C  0 (The mean cost of haircuts is higher at Curl Up & Dye.)

ii. We would use Output A, since we are doing a two-sample t-test. (Output B is for a paired-t test.) iii. The P-value of 0.070 is high, so I fail to reject the null hypothesis. There is no evidence that Curl Up & Dye is any more expensive on average than Hair by Charles.

.


Part V-8

Part V Inference for Relationships

Statistics Test B– Part V __1.

Which of the following statements is true about multiple regression? I. We do not want our explanatory variables to be interrelated. II. At least one slope should be statistically significant. III. The slopes should not change if one variable is removed. A) I only

__2.

Name ________________________

B) II only

C) I and II only

D) I and III only

E) I, II, and III

A recent study of perfect pitch tested students in American music conservatories. It found the proportion of non-Asian students having perfect pitch, p1 , to differ from the proportion of Asian students having perfect pitch, p2 . The survey found that the 95% confidence interval for p1  p2 equals (–0.28, –0.22). Give an interpretation of this confidence interval. A) We know that 95% of all random samples done on the population will show that the proportion of Asian students having perfect pitch is between 22% and 28% higher than the proportion of non-Asian students having perfect pitch. B) We know that 95% of non-Asian students have perfect pitch between 22% and 28% more often than Asian students having perfect pitch. C) We are 95% confident that the proportion of non-Asian students having perfect pitch is between 22% and 28% higher than the proportion of Asian students having perfect pitch. D) We know that 95% of Asian students have perfect pitch between 22% and 28% more often than non-Asian students having perfect pitch. E) We are 95% confident that the proportion Asian students having perfect pitch is between 22% and 28% higher than the proportion of non-Asian students having perfect pitch.

__3.

A poll reported that 3 out of 50 college seniors surveyed did not have jobs, while 7 out of 50 college juniors did not have jobs during the academic year. A) P-value = 0.0613; There is about a 6.13% chance that the two proportions are equal. B) P-value = 0.1824; If there is no difference in the proportions, there is about a 18.24% chance of seeing the observed difference or larger by natural sampling variation. C) P-value = 0.0613; If there is no difference in the proportions, there is about a 6.13% chance of seeing the exact observed difference by natural sampling variation. D) P-value = 0.0072; If there is no difference in the proportions, there is about a 0.72% chance of seeing the observed difference or larger by natural sampling variation. E) P-value = 0.1824; There is about a 18.24% chance that the two proportions are equal.

__4.

A relief fund is set up to collect donations for the families affected by recent storms. A random sample of 400 people shows that 28% of those 200 who were contacted by telephone actually made contributions compared to only 18% of the 200 who received first class mail requests. Which formula calculates the 95% confidence interval for the difference in the proportions of people who make donations if contacted by telephone or first class mail? A)  0.28  0.18   1.96

C)  0.28  0.18   1.96 E)  0.28  0.18   1.96

 0.23 0.77  200

 0.23 0.77  400

B)  0.28  0.18   1.96

 0.23 0.77    0.23 0.77 

D)  0.28  0.18   1.96

 0.28 0.72    0.18 0.82 

 0.28 0.72    0.18 0.82  400

400 .

200

200

200

200


Review of Part V: Inference for Relationships

__5.

Part V-9

Doctors at a technology research facility randomly assigned equal numbers of people to use computer keyboards in two rooms. In one room a group of people typed a manuscript using standard keyboards, while in the other room people typed the same manuscript using ergonomic keyboards to see if those people could type more words per minute. After collecting data for several days the researchers tested the hypothesis H 0 : 1  2  0 against the one-tail alternative and found P = 0.22. Which is true? A) The people using ergonomic keyboards type 22% more words per minute. B) There’s a 22% chance that people using ergonomic keyboards type more words per minute. C) There’s a 22% chance that there’s really no difference in typing speed. D) There’s a 22% chance another experiment will give these same results. E) None of these.

__6.

It’s common for a movie’s ticket sales to open high for the first couple of weeks, then gradually taper off as time passes. Hoping to be able to better understand how quickly sales decline, an industry analyst keeps track of box office revenues for a new film over its first 20 weeks. What inference method might provide useful insight? A) 1-proportion z-test B) t-Interval for a mean C)  2 goodness-of-fit test D) t-test for linear regression E) t-Interval for slope

__7.

A researcher wishes to determine whether people with high blood pressure can reduce their blood pressure by following a particular diet. Subjects were randomly assigned to a treatment group and a control group. The mean blood pressure was determined for each group, and a 95% confidence interval for the difference in the mean for the treatment group and the control group, t  c , was found to be (–21, –6). A) We know that 95% of all people who follow the particular diet will have a blood pressure between 6 and 21 points lower than people who do not follow the diet. B) With 95% confidence, people who follow the diet have an average blood pressure between 6 and 21 points higher than people who do not follow the diet. C) We know that 95% of all experiments done on the population will show that the average blood pressure for people who follow the diet is between 6 and 21 points lower than the average blood pressure for people who do not follow the diet. D) We are 95% confident that a randomly selected subject in the treatment group had a blood pressure between 6 and 21 points lower than a randomly selected subject in the control group. E) With 95% confidence, people who follow the diet have an average blood pressure between 6 and 21 points lower than people who do not follow the diet.

__8.

Based on data from two very large independent samples, two students tested a hypothesis about equality of population means using   0.02 . One student used a one-tail test and rejected the null hypothesis, but the other used a two-tail test and failed to reject the null. Which of these might have been their calculated value of t? A) 1.22 B) 1.55 C) 1.88 D) 2.22 E) 2.66

__9.

A regression model is built using the heights and body fat percentages to predict the weights of 30 volunteers. How many degrees of freedom does this model have? A) 3 B) 27 C) 28 D) 29 E) 30

.


Part V-10

__10.

Part V Inference for Relationships

A contact lens wearer read that the producer of a new contact lens boasts that their lenses are cheaper than contact lenses from another popular company. She collected some data, then tested the null hypothesis H 0 : old  new  0 against the alternative H A : old  new  0 . Which of the following would be a Type II error? A) Deciding that the new lenses are cheaper, when in fact they really are. B) Deciding that the new lenses are cheaper, when in fact they are not. C) Deciding that the new lenses are not really cheaper, when in fact they are. D) Deciding that the new lenses are not really cheaper, when in fact they are not. E) Applying these results to all contact lenses, old and new.

11. College admissions According to information from a college admissions office, 62% of the

students there attended public high schools, 26% attended private high schools, 2% were home schooled, and the remaining students attended schools in other countries. Among this college’s Honors Graduates last year there were 47 who came from public schools, 29 from private schools, 4 who had been home schooled, and 4 students from abroad. Is there any evidence that one type of high school might better equip students to attain high academic honors at this college? Test an appropriate hypothesis and state your conclusion.

12. Gas mileage Hoping to improve the gas mileage of their cars, a car company has made an adjustment in the manufacturing process. Random samples of automobiles coming off the assembly line have been measured each week that the plant has been in operation. The data from before and after the manufacturing adjustments were made are in the table. It is believed that measurements of gas mileage are normally distributed. Write a complete conclusion about the manufacturing adjustments based on the statistical software printout shown below. SET M1

24 21 26 25 23 24 19 22 20 24 20 21 27 22

SET M2

22 24 28 28 27 24 22 24 27 25 27 23 28

Two Sample T for M1 vs M2 N Mean M1 14 22.71 M2 13 25.31

StDev 2.40 2.29

SEMean 0.64 0.64

P = 0.0041

DF = 24.98

95% CI for  2  1 (0.74, 4.45) T-Test

1   2 (vs. 1  2 ): T= 2.88

.


Review of Part V: Inference for Relationships

Part V-11

13. Test identification (NOTE: Do not do these problems!) For each, indicate which procedure you would use, the test statistic (z, t, or  2 “chi-squared”), and, if t or  2 , the number of degrees of freedom. A choice may be used more than once. Type

a. b. c. d. e. f. g. h

z / t / 

1. 2. 3. 4. 5. 6. 7. 8. 9.

df

proportion – 1 sample difference of proportions – 2 samples mean – 1 sample difference of means – independent samples mean of differences – matched pairs goodness of fit homogeneity independence regression, inference for slope

a. A union organization would like to represent the employees at the local market. A sample of the employees revealed 74 of 120 were in favor of the union. Does the union have the required 3 to 2 majority? b. An oral surgeon is interested in estimating how long it takes to extract all four wisdom teeth. The doctor records the times for 24 randomly chosen surgeries. Estimate the time it takes to perform the surgery with a 95% confidence interval. c. A microwave manufacturing company receives large shipments of thermal shields from two suppliers. A sample from each supplier’s shipment is selected and tested for the rate of defects. The microwave manufacturing company’s contract with each supplier states the shipment with the smallest rate of defect will be accepted. Do the shipments’ defect rates vary from each other? d. The owner of a construction company would like to know if his current work teams can build room additions quicker than the time allotted for by the contract. A random sample of 15 room additions completed recently revealed an average completion time of 0.32 days faster than contracted. Is this strong evidence that the teams can complete room additions in less than the contract times? e. A farmer would like to know if a new fertilizer increases his crop yield. In an effort to decide this, the farmer recorded the yield for 10 different fields prior to adding fertilizer and after adding the fertilizer. The farmer assumes the crop yields are approximately normal. Does the fertilizer work as advertised? f. A manufacturer gets parts from four suppliers (call them A, B, C, and D). A random sample of 1000 parts is selected from shipments by each supplier. In the samples, Supplier A has 21 defects, Supplier B has 14 defects, Supplier C has 8 defects, and Supplier D has 17 defects. Does this suggest any difference between the quality of parts provided by these suppliers? g. In a study to determine whether there is a difference between the average jail time convicted bank robbers and car thieves are sentenced to, the law students randomly selected 20 cases of each type that resulted in jail sentences during the previous year. A 90% confidence interval was created from the results. h. Doctors offer small candies to sixty teenagers, recording the number of candies consumed by each. One hour later they test the blood sugar level for each person. Is there evidence that blood sugar levels in teenagers are related to the amount of candy eaten?

.


Part V-12

Part V Inference for Relationships

14. Improving productivity A packing company considers hiring a national training consultant in hopes of improving productivity on the packing line. The national consultant agrees to work with 18 employees for one week as part of a trial before the packing company makes a decision about the training program. The training program will be implemented if the average product packed increases by more than 10 cases per day per employee. The packing company manager will test a hypothesis using   0.05. a. Write appropriate hypotheses (in words and in symbols).

b. In this context, which do you consider to be more serious – a Type I or a Type II error? Explain briefly.

c. After this trial produced inconclusive results the manager decided to test the training program again with another group of employees. Describe two changes he could make in the trial to increase the power of the test, and explain the disadvantages of each.

.


Review of Part V: Inference for Relationships

Part V-13

15. Student progress The Comprehensive Test of Basic Skills (CTBS) is used by school district to assess student progress. Two of the areas tested are math and reading. A random sample of student results was reviewed to determine if there is an association between math and reading scores on the CTBS. Here are the scatterplot, the residuals plot, a histogram of the residuals, and the regression analysis of the data. Use this information to analyze the association between the math and reading scores on the CTBS.

a. Is there an association? Write appropriate hypotheses.

b. Are the assumptions for regression satisfied? Explain.

c. What do you conclude?

d. Create a 95% confidence interval for the true slope and interpret it.

e. In addition to using Math scores to predict Reading scores, the variable of Grade Point Average (GPA) is added to the model. R 2 increases to 82.1% and both slopes have P-values less than 5%. Is this new variable helpful? Explain.

.


Part V-14

Part V Inference for Relationships

Statistics Test B – Part V – Key 1. A

2. E

3. B

4. D

5. E

6. D

7. E

8. D

9. B

10. C

11. College admissions H 0 : Distribution of school type among honors grads is the same as for whole college. H A : Distribution of school type among honors grads is different. These are counts; we assume this group is representative of other years; after combining home schoolers and students from abroad as “other,” expected counts of 52.08, 21.84, and 10.08 are all ≥ 5. OK to do a chi-square goodness of fit test with 2 df.

2  

 Obs  Exp    47  52.08   29  21.84   8  10.08  3.27 2

2

2

2

Exp 52.08 21.84 10.08 P = 0.195. With such a large P-value we do not reject the null hypothesis. There is no evidence that students who graduate with honors came from different high school backgrounds than others.

12. Gas mileage P = 0.0041 is strong evidence that the gas mileage of automobiles coming off the assembly line after the manufacturing adjustment has been increased. We are 95% confident that the mean gas mileage has increased between 0.74 and 4.45 miles per gallon. 13. Test identification a. b. c. d. e f g h

z / t /  z or  t z or  t t

Type 1 or 6 3 2 or 7 5 5 7 4 9

2 t t

df n/a or 1 23 n/a or 1 14 9 3 19 (or tech) 58

14. Improving productivity a. H 0 :  d  10 The mean difference in the number of cases packed is (not more than) 10. H A :  d  10 The mean difference in the number of cases packed is more than 10. b. A Type I error would be very expensive for the packing company. A Type I error would mean that the manager rejected the null hypothesis when in fact the null hypothesis is true. In this situation, by rejecting the null hypothesis the company thought the training improved productivity, so they paid for the consultant to train all employees. In reality, the training did not improve productivity so the company wasted money on training that did not help. c. To increase the power of the test, we could increase the level of significance (  ), or increase the sample size. Increasing the level of significance, could lead to adopting a training method that actually does not improve productivity. By increasing the sample size, the trial cost would increase and the trial might take more time.

.


Review of Part V: Inference for Relationships

Part V-15

15. Student progress a. H 0 : There is no association between Math and Reading CTBS scores. 1  0 H A : There is an association between Math and Reading CTBS scores. 1  0 b. * Straight Enough Condition: There is no obvious bend in the scatterplot. * Independence Condition: The residuals show no clear pattern. * Does the Plot Thicken? Condition: The residual plot shows reasonably consistent spread. * Nearly Normal condition: A histogram of the residuals is unimodal and roughly symmetric. c. The P-value is very small, so we reject the null hypothesis. There is strong evidence of a positive association between CTBS scores Math and Reading. d. A 95% confidence interval for 1 is: 1  t18*  SE  b1   0.866  2.101(0.1045) or (0.646, 1.086). We are 95% confident that the Reading CTBS score will be higher, on average, between 0.646 and 1.086 points for each additional CTBS point scored on the Math CTBS test. e. Since R 2 has increased and both slopes are statistically significant, we can reject Ho and conclude that we do have two significant predictors. This new variable has improved our model.

.


Part V-16

Part V Inference for Relationships

Statistics Test C – Part V

Name ________________________

__1.

Which statement is true about a matched pairs t-test? I. A matched pairs experiment usually results in this procedure. II. If two samples are independent, a matched pairs test is the wrong choice. III. This procedure has n -2 degrees of freedom. A) I only B) II only C) I and II only D) I and III only E) I, II, and III

__2.

A survey was conducted to determine the difference in gasoline mileage for two types of trucks. A random sample was taken for each model of truck, and the mean gasoline mileage, in miles per gallon, was calculated. A 95% confidence interval for the difference in the mean mileage for model A trucks and the mean mileage for model B trucks,  A   B , was determined to be (2.2, 4.5). A) Based on this sample, we are 95% confident that the average mileage for model B trucks is between 2.2 and 4.5 miles per gallon higher than the average mileage for model A trucks. B) We know that 95% of model A trucks get mileage that is between 2.2 and 4.5 miles per gallon higher than model B trucks. C) We are 95% confident that a randomly selected model A truck will get mileage that is between 2.2 and 4.5 miles per gallon higher than a randomly selected model B truck. D) We know that 95% of all random samples done on the population of trucks will show that the average mileage for model A trucks is between 2.2 and 4.5 miles per gallon higher than the average mileage for model B trucks. E) Based on this sample, we are 95% confident that the average mileage for model A trucks is between 2.2 and 4.5 miles per gallon higher than the average mileage for model B trucks.

__3.

A poll reported that 30% of 60 Americans between the ages of 25 and 29 had started saving money for retirement. Of the 40 Americans surveyed between the ages of 21 and 24, 25% had started saving for retirement. A) P-value = 0.2496; There is about a 24.96% chance that the two proportions are equal. B) P-value = 0.5854; If there is no difference in the proportions, there is about a 58.54% chance of seeing the observed difference or larger by natural sampling variation. C) P-value = 0.2496; If there is no difference in the proportions, there is about a 24.96% chance of seeing the exact observed difference by natural sampling variation. D) P-value = 0.1812; If there is no difference in the proportions, there is about a 18.12% chance of seeing the observed difference or larger by natural sampling variation. E) P-value = 0.5854; There is about a 58.54% chance that the two proportions are equal.

__4.

A researcher is comparing the proportion of rats that will successfully find food in a maze with or without negative stimulus. Out of the 150 rats who received a small electric shock, (when they chose the wrong direction) 40% found the food compared to only 30% of the 150 who received no negative stimuli to guide them. Which formula calculates the 98% confidence interval for the difference in thes proportions? A) (0.40  0.30)  2.33

(0.35)(0.65) 150

B) (0.40  0.30)  2.33

(0.35)(0.65) (0.35)(0.65)  150 150

C) (0.40  0.30)  2.33

(0.35)(0.65) 300

D) (0.40  0.30)  2.33

(0.40)(0.60) (0.30)(0.70)  150 150

E) (0.40  0.30)  2.33

(0.40)(0.60) (0.30)(0.70)  300 300 .


Review of Part V: Inference for Relationships

__5.

Part V-17

Investigators at an agricultural research facility randomly assigned equal numbers of chickens to be housed in two rooms. In one room, the chickens experienced normal day/night cycles, while in the other room lights were left on 24 hours a day to see if those chickens would lay more eggs. After collecting data for several days the researchers tested the hypothesis H 0 : 1  2  0 against the one-tail alternative and found P = 0.22. Which is true? A) The chickens in the lighted room averaged 0.22 more eggs per day B) There’s a 22% chance that chickens housed in a lighted room produce more eggs. C) There’s a 22% chance that there’s really no difference in egg production. D) There’s a 22% chance another experiment will give these same results. E) None of these

__6.

When we cannot reject the null hypothesis, we A) have obtained a t-value greater than our critical t-value. B) conclude that sampling error is responsible for our obtained difference. C) claim that a significant difference exists between groups. D) have committed a Type II error. E) have committed a Type I error.

__7.

160 students who were majoring in either math or English were asked a test question, and the researcher recorded whether they answered the question correctly. The sample results are given below. What is the correct report? Math English

Correct 27 43

Incorrect 53 37

 2  6.502; P -value  0.011 A) Report that response is dependent on major because the null hypothesis is rejected. B) Report that the distribution of correct answers is uniform for both majors because the null hypothesis is rejected. C) Report that response is not dependent on major because the null hypothesis is rejected. D) Report that response is dependent on major because the null hypothesis is not rejected. E) Report that response is not dependent on major because the null hypothesis is not rejected. __8.

Based on data from two very large independent samples, two students tested a hypothesis about equality of population means using   0.05 . One student used a one-tail test and rejected the null hypothesis, but the other used a two-tail test and failed to reject the null. Which of these might have been their calculated value of t? A) 1.22 B) 1.55 C) 1.88 D) 2.22 E) 2.66

__9.

A matched pairs experiment is run comparing the answers of 25 married couples. How many degrees of freedom do this t-test have? A) 22 B) 23 C) 24 D) 25 E) 50

.


Part V-18

__10.

Part V Inference for Relationships

Ten students in a graduate program were randomly selected. Their grade point averages (GPAs) when they entered the program were between 3.5 and 4.0. The students' GPAs on entering the program and their current GPAs were recorded. Use the given regression analysis to find and interpret the equation of the regression line. Dependent variable is: Current GPA R squared = 0.677968 s =0.024768 with 10 - 2 = 8 degrees of freedom

variable Constant Entering GPA

Coefficient 3.584756 0.090953

s.e. of Coefficient 0.078183 0.022162

t-ratio 45.85075 4.193932

prob 5.66*10-11 0.003419

A) Current GPA = 3.584756*Entering GPA + 0.090953; The current GPA for these students increases by 3.584756 points for each 1 point increase in their entering GPA. B) Entering GPA = 3.584756 + 0.090953*Current GPA; The entering GPA for these students increases by 0.090953 points for each 1 point increase in their current GPA. C) Current GPA = 3.584756*Entering GPA + 0.090953; The current GPA for these students increases by 0.090953 points for each 1 point increase in their entering GPA. D) Current GPA = 3.584756 + 0.090953*Entering GPA; The current GPA for these students increases by 0.090953 points for each 1 point increase in their entering GPA. E) Current GPA = 3.584756 + 0.090953*Entering GPA; The current GPA for these students increases by 3.584756 points for each 1 point increase in their entering GPA. 11. Peanut M&Ms According to Mars, Incorporated, peanut M&M’s are 12% brown, 15% yellow, 12% red, 23% blue, 23% orange, and 15% green. On a Saturday when you have run out of statistics homework, you decide to test this claim. You purchase a medium bag of peanut M&M’s and find 39 browns, 44 yellows, 36 red, 78 blue, 73 orange, and 48 greens. Test an appropriate hypothesis and state your conclusion.

.


Review of Part V: Inference for Relationships

Part V-19

12. Test identification Suppose you were asked to analyze each of the situations described below. (NOTE: Do not do these problems!) For each, indicate which procedure you would use (pick the appropriate number from the list), the test statistic (z, t, or  2 “chi-squared”), and, if t or c2, the number of degrees of freedom. A procedure may be used more than once. Type

z / t / 

1. 2. 3. 4. 5. 6. 7. 8.

df

a. b. c. d. e. f. g.

proportion – 1 sample difference of proportions – 2 samples mean – 1 sample difference of means – independent samples mean of differences – matched pairs goodness of fit homogeneity independence

a. Among randomly selected pets, 27% of the 188 dogs and 18% of the 167 cats had fleas. Does this indicate a significant difference in rates of flea problems for these two pets? b. Are there more broken bones in summer or winter? We get records about the number of fractures treated in January and July at a random sample of 25 emergency rooms. c. A random sample of 600 high school seniors reported their grade point averages and the amount of financial aid offered them by colleges. We wonder if there is an association between academic success and college aid. d. For a random sample of 200 drivers at a gas station, we record the driver’s gender (male or female) and the type of gasoline purchased (regular, plus, or premium). We wonder if there is an association between a driver’s gender and the type of gasoline they buy. e. The school newspaper wants a 95% confidence interval for the road test failure rate. In a random sample of 65 student drivers, 37 said they failed their driver’s test at least once. f.

A supermarket chain wants to know which of two merchandise display methods is more effective. They randomly assign 15 stores to use display type A and 15 others to use display type B, then collect data about the number of items sold at each store.

g. Tags placed on garbage cans allow the disposal of up to 30 pounds of garbage. A random sample of 22 cans averaged 33.2 pounds with a standard deviation of 3.2 pounds. Is this strong evidence that residents overload their garbage cans? h. Researchers offer small cookies to nine nursery school children and record the number of cookies consumed by each. Forty-five minutes later they observe these children during recess, and rate each child for hyperactivity on a scale from 1 – 20. Is there any evidence that sugar contributes to hyperactivity in children?

.


Part V-20

Part V Inference for Relationships

13. Scrubbers A factory recently installed new pollution control equipment (“scrubbers”) on its smokestacks in hopes of reducing air pollution levels at a nearby national park. Randomly timed measurements of sulfate levels (in micrograms per cubic meter) were taken before (Set C1) and after (Set C2) the installation. We believe that measurements of sulfate levels are normally distributed. Write a complete conclusion about the effectiveness of these scrubbers based on the statistical software printout shown. SET C1

10.0 8.0 8.0 7.0 6.0 9.0 11.5 8.0 9.5 7.5 5.0 10.0

SET C2

5.0 7.0 1.0 9.0 1.5 5.0 2.5 4.0 9.0 6.0

Two Sample T for C1 vs C2 N Mean C1 12 8.29 C2 10 5.00

StDev 1.83 2.84

SEMean 0.53 0.90

95% CI for mu1 – mu2: (1.20, 5.38) T-Test

mu1 = mu2 (vs. not =):

T= 3.29

P = 0.0037

DF = 20

14. Blood pressure Researchers developing new drugs must be concerned about possible side effects. They must check a new medication for arthritis to be sure that it does not cause an unsafe increase in blood pressure. They measure the blood pressures of a group of 12 subjects, then administer the drug and recheck the blood pressures one hour later. The drug will be approved for use unless there is evidence that blood pressure has increased an average of more than 20 points. They will test a hypothesis using   0.05 . a. Write appropriate hypotheses (in words and in symbols).

b. In this context, which do you consider to be more serious – a Type I or a Type II error? Explain briefly.

c. After this experiment produced inconclusive results the researchers decided to test the drug again another group of patients. Describe two changes they could make in their experiment to increase the power of their test, and explain the disadvantages of each.

.


Review of Part V: Inference for Relationships

15. Auto repairs An insurance company hopes to save money on repairs to autos involved in accidents. Two body shops in town seem to do most of the repairs, and the company wonders whether one of them is generally cheaper than the other. From their files of payments made during the past year they select a random sample of ten bills they paid at each repair shop. The data are shown in the table. Indicate what inference procedure you would use to see if there is a significant difference in the costs of repairs done at these two body shops, then decide if it is okay to actually perform that inference procedure. (Check the appropriate assumptions and conditions and indicate whether you could or could not proceed. You do not have to do the actual test.)

Bodies by Bock 2130 980 3400 2190 1600 1450 4290 3090 1050 2530 3870 2760 2290 2640

Part V-21

Velleman’s Automagic 2570 1120 2950 1880 1660 1700 4030 3970 1130 3660 4650 3280 2390 2180

16. Height and weight Height and weight data was collected from a group of randomly selected male students. Dependent variable is: WT(lb) R squared = 56.6% s =

14.16

with

25 - 2 = 23

Variable

Coefficient

s.e. of Coeff

t-ratio

prob

Const

-364.403

94.61

-3.85

0.0008

HT(in)

7.29993

1.333

5.48

≤ 0.0001

a. Is there an association? Write appropriate hypotheses.

b. Are the assumptions for regression satisfied? Explain.

c. What do you conclude?

.


Part V-22

Part V Inference for Relationships

Statistics Test C – Part V – Key 1. C

2. E

3. B

4. D

5. E

6. B

7. A

8. C

9. C

10. D

7. Peanut M&Ms We want to know if the distribution of colors in the bag matches the distribution stated by Mars, Incorporated. H 0 : The distribution of colors in the bag matches the distribution stated by Mars, Incorporated. H A : The distribution of colors in the bag does not match the distribution stated by Mars, Incorporated. Conditions: * Counted data: We have the counts of the number of peanut M&Ms of each color. * Randomization: We will assume that each bag of peanut M&Ms represents a random sample of peanut M&Ms. * Expected cell frequency: There are a total of 318 peanut M&Ms. The smallest percentage of any particular color is 12% (brown and red), and we expect 318(0.12) = 38.16. Since the smallest expected count exceeds 5, all expected counts will exceed 5, so the condition is satisfied. Under these conditions, the sampling distribution of the test statistic is  2 with 6 – 1 = 5 degrees of freedom, and we will perform a chi-square goodness-of-fit test. Color brown yellow red blue orange green Observed Frequency 39 44 36 78 73 48 Expected Frequency 38.16 47.7 38.16 73.14 73.14 47.7

2  

 Obs  Exp    39  38.16    44  47.7   ...  0.7528 2

2

Exp

38.16

2

47.7

P-value = P    0.7528   0.980 2

A P-value this large says that if the distribution of colors in the bag matches the distribution stated by Mars, Incorporated., an observed chi-square value of 0.7528 would happen about 98% of the time. Thus, we fail to reject the null hypothesis. These data do not show evidence that the distribution of colors in the bag differs from the distribution stated by Mars, Incorporated. 12. Test identification a. b. c. d. e. f. g. h.

Type 2 or 7 5 6 8 1 or 6 4 3 9

z / t /  z or  t n/a



z or  t t t

df n/a or 1 24 n/a 2 n/a or 1 14 (or tech) 21 7

.


Review of Part V: Inference for Relationships

Part V-23

13. Scrubbers P < 0.05 is strong evidence that the scrubbers have changed the mean pollution level. We are 95% confident that the mean sulfate level has decreased between 1.20 and 5.38 micrograms per cubic meter. 14. Blood pressure a. H 0 : d  20 The mean increase in blood pressure is safe. H A : d  20 The mean increase in blood pressure exceeds the safe limit. b. Type II is dangerous; the medication is approved even though blood pressure increases too much. Type I means that an acceptable medication is not approved; that’s too bad, but not dangerous. c. Increase alpha; could lead to rejecting a medication that’s actually okay. Increase n; more costly and difficult, and could endanger more subjects. 15. Auto repairs t-test for difference of means – the samples are independent, each is an SRS from its body shop and less than 10% of all repairs done there. The Bock data are unimodal and roughly symmetric, but a histogram of the Velleman bills is too skewed. We cannot proceed. 16. Height and weight a. H 0 : There is no association between height and weight. H A : There is an association between height and weight. b. The scatterplot looks straight enough, residuals are random and display consistent spread, the histogram of residuals looks roughly unimodal and symmetric. c. Reject H 0 because of the small P-value; there is strong evidence of an association between height and weight.

.


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.