CHAPTER IV – HYPOTHESIS TESTING
Hypothesis Testing Objectives After reading this chapter, the student should be able to: 1 Define what is meant by the terms correlation, negative correlation, positive correlation, and zero correlation and be able to give two examples of each 2 Construct a correlation coefficient for a set of scores 3 Convert raw scores to percentiles and T scores 4 Interpret a correlation coefficient 5 Determine how to construct a coefficient of determination 6 Determine how to construct a test for the significance of a correlation coefficient 7 Define what is meant by reliability, objectivity and validity 8 Understand how to compare the score on one or more items with a criterion T- score of several items in a battery 9 Interpret a reliability correlation coefficient 10 Interpret a objectivity correlation coefficient 11 Interpret a validity correlation coefficient 12 Follow the formulas to calculate the various values presented in this chapter
Key Terms Population: A population is all the members of a specified group. Sample: A sample is a group of subjects that have been selected from a population. Random sample: With a random sample each member of the population has an equal chance of being selected. Cluster sample: A cluster is a variation of the random sample, particularly appropriate when the population is large or where the geographic distribution is widely scattered. Systematic sample: A systematic sample consists of the selection of each nth term from a list. Stratified sample: When sub-populations vary considerably, it is advantageous to sample each subpopulation (stratum) independently. Stratification is the process of grouping members of the population into relatively homogeneous subgroups before sampling. The strata should be mutually exclusive: every element in the population must be assigned to only one stratum. The strata should also be collectively exhaustive: no population element can be excluded. Then random or systematic sampling is applied within each stratum. This often improves the representativeness of the sample by reducing sampling error.
Hypothesis testing: Hypothesis testing is an organized method for drawing conclusions and determining how likely they are to be true. Intervening variables: variables which cannot be controlled or measured directly. Type I error: A Type I error is when you reject a true null hypothesis Type II error: A Type II error is when you accept a false null hypothesis Matched pairs: Matched pairs is when you select pairs of subjects with identical characteristics and assign one of them to an experimental group and the other to a control group in order to ensure the groups are equal. Parametric statistics: Parametric tests require that we estimate from the sample data the value of at least one population characteristic (parameter) such as its SD. One of the assumptions we make in applying these parametric techniques to sample data is that the variable we have measured is normally distributed in the populations from which the samples were obtained. Non-parametric statistics: Non-parametric tests or distribution free statistical procedure is used when little is known about the distribution of the population or when it is known that the distribution differs markedly from a normal distribution. Null hypothesis (HO): A type of hypothesis used in statistics that proposes that no statistical significance exists in a set of given observations. The null hypothesis attempts to show that no variation exists between variables, or that a single variable is no different than zero. It is presumed to be true until statistical evidence nullifies it for an alternative hypothesis. Alternate hypothesis (Ha): Alternate hypothesis (denoted Ha), is the opposite of what is stated in the null hypothesis. The alternate hypothesis attempts to show that variation exists between variables, or that a single variable is different than zero. Independent scores: An independent scores means that the scores made by one group do not affect those made by the other group.
Hypothesis Testing So far we have been primarily concerned with statistical measures which describe a set of data. The mean and the standard deviation are such statistics. However, we frequently wish to do more than describe our data. There are times when we wish to draw conclusions and make generalizations about classes, teaching methods, training programs, etc. In general, we often wish to predict characteristics of a larger group as a result of our investigation with a smaller group.
Populations and Samples The term population refers to all the members of a specified group. The population of a city is made up of all the people living there. The population of students at a high school is all the students enrolled in that school. A population may be very large (all tenth grade boys in the U.S.) or fairly small (all members of the Central High School varsity golf team). It is not always convenient or possible to examine each member of a population in order to draw a conclusion. To make the investigation easier, some members from a population are selected and called a sample of the population. Conclusions about the population are then based upon results obtained from the sample. If the conclusions based upon the examination of a sample are to be valid for the entire population, that sample must be representative of the population. In other words, to make an inference or generalization from a sample to the population from which it was draw the sample has to mirror the population. Suppose a teacher at a large all boys high school received a questionnaire from the state physical education association requesting mean measurements of his student’s height, weight, arm strength, and leg strength. Let’s say he does not have the time to measure each member of the population and then compute the means, so he decides to select a sample of 25 boys and measure them. How should he select the 25 members of the sample so that they are representative or a mirror image of the population? The varsity football team is practicing outside his office. Should he ask the coach to send him 25 players? A ninth grade physical education class is available. Should he select 25 members from it? If he chooses either of these methods, will his sample be representative of the population? Of course not! Do twenty-five football players or twenty-five ninth graders represent the population of the entire school? Heck, no! Those samples (assuming that they were randomly selected) would only represent the football team and the ninth grade physical education classes respectively. If the teacher wants measurements that are most likely to be representative of the population of boys in his school, he should select a random sample from the population. A random sample is one in which each member of the population has an equal chance of being selected. There are several techniques which can be used in selecting a random sample. For instance, if we wished to draw a random sample of 50 individuals from a population of 600 students enrolled in a high school, we could place the 600 names in a container and, blindfolded, draw one name at a time until the sample of 50 names was selected. This procedure is cumbersome and is not often used unless you are going to draw a small sample from the population. A more convenient way of choosing a random sample, especially if the sample is going to be fairly large, is by the use of a table of random numbers. Such tables have been compiled by the RAND Corporation and the Bureau of Transport Economics and Statistics of the Interstate Commerce Commission. When using a table it is first necessary to assign consecutive numbers to each individual in the population from which the sample is to be drawn. Then, starting at any point on the table of
random numbers, corresponding numbers are taken from the published list until the desired number of individuals is obtained. It is essential that the table order be followed meticulously to avoid bias in the process. Now let me try to confuse you a little more. If a population has been accurately listed, a type of systematic selection will provide what approximates a random sample. A systematic sample consists of the selection of each nth term from a list. For example if a sample of 200 were to be selected from a telephone directory with 200,000 listings, one would select the first name by selecting a randomly selected name from a randomly selected page. Then every thousandth name would be selected until the sample of 200 names was complete. If the last page were reached before the desired number had been selected, the count would continue from the first page of the directory. Systematic samples of automobile owners could be selected in similar fashion from a state licensing bureau list or file, or a sample of eighth-grade students from a school attendance roll. Now I am really going to add chaos to confusion. I just love playing with your itty bitty brain like this. At times it is advisable to subdivide the population into smaller homogeneous groups in order to get more accurate representation. For example, in making an income study of wage earners in a community, a true sample would approximate the same relative number from each socioeconomic level of the whole community. If the proportions were 15 percent professional workers, 10 percent managers, 20 percent skilled workers and 55 percent unskilled workers, the sample should include approximately the same proportions in order to be considered representative. Within each subgroup some process of probability selection is generally used. This process gives the researcher a more representative sample than one selected from the entire community which might be unduly weighted by a preponderance of unskilled workers. In addition to, or instead of, socioeconomic status, such characteristics as age, sex, extent of formal education, racial origin, religious or political affiliation, or rural-urban residence might provide a basis for the choice of a stratified sample. It is evident that the characteristics of the entire population must be carefully considered together with the purposes of the study before a stratified sample is decided upon. This is called stratified random sampling. This is the best method of random sampling you can use. There is also what is called cluster samples. The cluster sample is a variation of the random sample, particularly appropriate when the population is large or where the geographic distribution is widely scattered. For example, suppose you wish to select a random sample of teachers in a large state. In terms of preparation, cost, and administration the survey would be almost prohibitive. It would be more feasible to select a random sample of counties and from each of the sample counties a random sample of school districts. Using the selected school districts it would not be difficult to gather the information desired from the teachers in those districts. The deduction in cost and ease of administration would allow the selection of a larger sample than would be the case if a simple random sample of all teachers in the state were selected. Again, in selecting a sample, the individuals or observations are chosen in such a way that any individual or observation in the population has an equal chance of being selected, and that each choice is independent of any other choice. Only when these conditions are met can a sample be said to be randomly selected. Also, it should be apparent that in order to select a random sample by any method, conscious selection of particular individuals must not enter the process. Educational researchers, because of the administrative difficulties in applying randomizing procedures, often use available classes or intact groups as samples. This is a questionable procedure because the ordinary methods of statistical inference are not applicable to such groups. Just between us this type of biased sampling is the worst way to draw a sample. With this "method," we put together
our sample by using naturally occurring or artificially constructed groups of subjects without the benefit of random selection. For example, if a physical education instructor wants to determine how many students lift weights, he could ask the 30 students taking his introductory weightlifting course this question. If he tried to generalize beyond these 30 students, he would have a biased sample. There is no good reason to believe that these 30 students are typical of the 10,000 other students at his university. Likewise, if a researcher stood at a football game and questioned whoever walked past, he would be getting a biased sample. How would he know that the people who walked past were representative of anyone about whom he would wish to generalize too? And furthermore, what makes him think that the person who is willing to talk to him is similar to the person who looks down and avoids eye contact in order to evade being interviewed? Some, let me rephrase that, a lot of the weakness of educational research can be attributed to the drawing of random sampling inferences from non-random samples. Unless appropriate and more complex experimental designs are used to nullify uncontrolled factors, the research cannot be considered sound and it certainly cannot be considered valid. Let me explain something to you about research. There is the gospel and there is garbage. In the research game, many self-proclaimed experts are quick to pass off the latter for the former. A research expert, of course, is anyone who's been to a research library more than once. And if that person happens to know how to use a computer data base we are talking research guru. How then do you separate the good advice from the bullshit? I have my own favorite expert on that subject. In the profound words of Marvin Gaye: "Believe half of what you see and none of what you hear." Did I mention that some research actually fabricate their data? Okay, some researchers actually fabricate their data. Another good reason to read all the research you can, and then believe half of it. What half should you believe? Well, the half that makes perfect sense to you after reading this book. Moving right along, some of the relationships between samples and the population from which they are drawn can be illustrated by a specific example. Table 4-1 gives data about each member of the population of varsity and freshman basketball players at a certain college. The population mean has been computed for these groups. Based upon your examination of Table 4-1 would you say the sample Table 4-1 Some random samples and their means for the population of varsity and freshman basketball players
Sample A 70 69 M=71 74
Sample B 68 78 M=74.3 77
Sample C 81 77 M=76.75 73 76
Sample D 70 81 M=76.25 77 77
Sample E 72 78 81 69 74 M=74.6 75 73
Sample F 70 78 75 74 74 M=74.6 74 77
Sample G 70 68 78 75 69 M=73.5 75 75 71 77
Sample H 77 74 74 71 75 M=73.7 70 69 76 72 77
mean is always equal to the population mean? Please tell me you didn’t say yes? If you did, go sit in the corner until I tell you to come back. Get back here…I was only kidding. Obviously, the sample means are not the same as the population means. You knew that…right? Now put your thinking cap on. That’s right your Yankee cap! Other factors being equal, which would be better…a large sample or a small sample? Clearly, the larger the sample the more it is going to reflect the population. You didn’t even need your thinking cap to figure that out…right? Here is something you need to understand though. The size of the sample may or may not be significantly related to its adequacy. Although size is a factor that may affect accuracy, a large sample, carelessly selected, may be biased and imprecise, whereas a smaller one, carefully selected, may be relatively unbiased and accurate enough to make satisfactory inference possible. Let me give you a good example. It is an old example but a great illustration of how poor sampling can screw everything up. When the now defunct Literary Digest drew its sample for the purpose of predicting the results of the 1936 presidential election, subjects were chosen from the pages of telephone directories and from official automobile registration lists. The prediction that Alfred Landon was going to be elected president proved to be wrong. I mean really wrong! Heck, he got beat by a landslide. It wasn’t even close. A post election analysis revealed that the sample was not truly "at random." Large numbers of voters did not own automobiles at the time and were not telephone subscribers, and consequently were not included in the sample. The percentages of responses were not complete enough, and the fact that the population for which the prediction was made was not the population sampled which made the prediction totally inaccurate. This is probably one reason why the Literary Digest is now defunct. Here is something else you need to write on your sleeve. If you don’t know what the heck you are doing, don’t do it. In other words, don’t do something you know nothing about just because you have the opportunity to do it. Be careful. Having electrical tape doesn’t make someone an electrician. Having a degree doesn’t make someone a researcher. Now let’s look at what we can do once we have a sample group.
Hypothesis Testing Suppose a teacher buys a baseball bat from the Reliable Baseball Bat Manufacturing Co. The first time the bat is used it cracks. Should the teacher conclude that the Reliable Baseball Bat Manufacturing Co. makes inferior baseball bats? Suppose the teacher bought 10 baseball bats and five of them cracked the first time they were used. Would he then be justified in concluding that the Reliable Baseball Bat Manufacturing Co. makes inferior bats? Watch what you say here…you could get sued. Lucky for you there is no such company. I just made that name up. Of course, you want to be reasonably sure that the conclusion you are drawing is a valid one. Using statistical procedures to make generalizations or draw conclusions is one way of minimizing the possibility of drawing incorrect conclusions and good method for not getting sued.. The process called hypothesis testing is an organized method for drawing conclusions and determining how likely they are to be true. The following are situations in which hypothesis testing could be used: 1. You could determine if a class improved its fitness level during a semester. 2. You could ascertain if an isometric or an isotonic training program is more effective in developing strength. 3. You could compare two teaching methods to see which is more effective in developing softball skills.
4. You could ascertain if sprinters get a faster start using the bunch start or the elongated start. 5. You could determine if a health class had fewer misconceptions about health at the end of a particular course. 6. Heck! There are a million and one things you could do with hypothesis testing. You could ascertain if Viagra will increase sexual endurance…I’ll even be one of your subjects. The need for a procedure to help answer such questions can be illustrated by a specific example. Suppose it is reported that the mean strength score on a 1-RM bench press of a group who participated in an isometric training program is 285 pounds and the mean of a group who participated in an isotonic training program is 294 pounds. Would you say that the effects of the two training programs were different? There are two possible explanations for the difference between the means of the two groups. One explanation is that the two training programs differ in the way they affect strength. The other is that chance factors, not the training programs, are responsible for any difference between the means. A variety of causes are grouped together and called “chance factors.” One of these is individual variation from day to day. Another is errors in measurement caused by variations in measuring instruments or inaccuracies in reading the instruments. Another possible source of variation is sample selection. Suppose the groups participating in the two strength training programs were not approximately equal in mean strength scores before the programs began. For instance, suppose the mean of the isometric group was 283 pounds before the training program was implemented and the isotonic groups pre-training mean was 284 pounds. How could you interpret any differences in means after the training program? A teacher or researcher who plans to use hypothesis testing in answering a question like the aforementioned must try to minimize the influence of chance factors by careful selection of samples, by using accurate instruments, and by keeping the testing conditions as similar as possible for each subject from day to day.
General Procedure for Hypothesis Testing Suppose a strength coach wanted to compare an isometric strength training program with an isotonic strength training program on squat strength. Let’s say the coach believed that isometric training was as effective as isotonic training and hoped to show that it is actually superior to isotonic strength training. In order to test his theory he selected two groups which were both fairly large and had just about equal mean squat strength scores. In other words, the two groups were equal initially in reference to strength. Use your imagination here. He than had one group use an isometric training program and the other an isotonic training program for a number of months. At the end of the training period the mean strength for each group was again computed. Now let’s say he found that the isometric group had a mean of 400 pounds on the squat and the isotonic group, a mean of 390 pounds on the squat. Can the strength coach, by looking at the means, decide if the difference of ten points between the two means indicates that the isometric program was superior? Could the difference be due to daily fluctuations or due to some other chance factor? As I mentioned before, such a difference could be a product of chance. If however, the coach examines the data by using a statistical procedure, called a test of significance, he could determine whether the difference is probably due to chance factors or due to an actual difference in the two training programs. If the test indicates that the difference is probably due to chance factors, he would have to conclude that the training programs are about the same for developing squat strength. If the test shows that it is very unlikely that chance factors alone are
responsible for the difference of the means of the two groups, he would then be able to conclude that isometric training is superior in developing squat strength. Let me try to make this a little clearer. The coach’s reasoning in this example can be divided into four steps. These steps are basic to all hypothesis testing and must be followed preciously. 1. Form a hypothesis about the population or populations. In this case, the coach is dealing with two populations…all people who participate in isometric training programs, and all who participate in isotonic programs. His initial assumption is that the two training programs are equally good. He will retain this assumption unless some evidence is produced which supports his belief that isometrics are really superior. The initial assumption is called the null hypothesis (HO) and is stated in terms of population means. If the two training programs are equally good, the mean of the isometric training population (M1) will be equal to the mean of the isotonic training population (M2). The null hypothesis is denoted in symbols as HO: M1=M2. The coach’s belief as to which program is superior is the basis of the alternate hypothesis (HA). If isometric training is better, the mean strength of the isometric training population (M1) should be greater than the mean strength of the isotonic training population (M2). This is denoted… HA: M1>M2 Both the null and the alternate hypotheses are statements about the population and are formed before a sample is selected and the research is begun. By rule, the null hypothesis is always a statement that no difference exists in the populations. For example, the null hypothesis in this study would be…there is no significant difference between the mean scores of isometric and isotonic training groups. The alternate hypothesis in this case is directional; a statement that one population mean will be greater than the other population mean. For example, there is a significant difference between the mean scores…yatta, yatta, yatta. An alternate hypothesis may also be non-directional, simply a statement that the two means differ. In the brief treatment in this text, however, only directional alternate hypotheses will be discussed, due to the author’s (that’s me) belief that the student (that’s you) is more likely to encounter or wish to use them. 2. Use the data obtained from the samples to compute the statistic for the appropriate test of significance. In other words, collect your data for each group, and then use a statistical procedure to analysis it. This statistic or this result of the statistical procedure is a measure of the difference between the samples. 3. Compare the computed statistic with established tables to determine how likely it is that a statistic of the obtained value would occur if the null hypothesis were true. Physical educators commonly use the .05 level of significance as the arbitrary dividing line between differences due to chance effects and those due to the effects of the treatments. This may be interpreted by saying that if the null hypothesis is true, a measure of the difference of the size obtained would occur by chance alone no more than 5% of the time. In other words, if you ran this exact study 100 times the probability of you finding a significant difference between the means would occur only 5 times by chance. 4.
Accept or reject the null hypothesis on the basis of the significance of the computed statistic. If the statistic is not significant at the .05 level (that is, it would occur due to chance more than 5% of the time when the null hypothesis is true) we accept the null
hypothesis. This means that the evidence is not sufficient to make us doubt that the population means are equal. If the difference is significant at the .05 level (it would occur due to chance less than 5% of the time if the null hypothesis is true), we reject the null hypothesis and accept the alternate hypothesis. In other words, evidence makes us doubt that the population means are equal and makes us believe instead that one is greater than the other. A statistic significant at the .05 level would occur less than 5% of the time due to chance. (In other words, you would make a false conclusion by rejecting a true hypothesis about 5 times in 100.) I know we have been through all this before, but just humor me. Some things are better mentioned more than once. Often it is desirable to reduce the possibility of this error. To this end, the .01 level of significance is sometimes used. Values significant at the .01 level occur by chance less than 1% of the time. This reduces the possibility of rejecting a true null hypothesis to 1 chance in 100. Note, that both the null and alternate hypotheses and the conclusion are statements about populations. Furthermore, the data on which we base our conclusion is obtained from a representative sample of the population. Also, we arbitrarily choose the point at which we differentiate between differences due to chance or due to a difference in the populations due to treatments. Thus, hypothesis testing is a logical, orderly way of drawing conclusions which are very likely to be true‌ but no one can be absolutely certain that his conclusion is correct.