CHAPTER V – BASIC CONCEPTS IN TEST EVALUATION
Basic Concepts in Test Construction Objectives After reading this chapter, the student should be able to: 1. Identify sound criteria for the evaluation and selection of tests in physical education 2. Describe and apply the steps involved in the construction of a motor performance test 3. Apply the principles of test administration in measuring motor performance
Key Terms Validity: Validity refers to the degree to which a test measures what it was designed to measure. Face validity: Face validity if it appears to measure the ability in question Logical validity: Logical validity is often used when a test obviously involves the skill or ability that is being evaluated. Content validity: Content validity refers to the degree that a test measures what students have learned in a class. Construct validity: Construct validity refers to the degree to which performances on a test correspond to the abilities or traits that the test purports to measure. Criterion validity: Criterion validity when a test is used to predict future performance or is used in place of another test that is perhaps longer or requires more elaborate equipment or facilities. Reliability: Reliability may be thought of as the repeatability of test results. Objectivity: For a test to have high objectivity no matter who administers the test you are going to get the same results.
Basic Concepts in Test Evaluation The overall purpose of testing is to improve student learning. Test and measurements provides the student and teacher with valid information concerning student progress and their attainment of the expected goals of the class. Testing should always be viewed as information to improve student achievement with the understanding that success is simply the manipulation of error. It goes without saying that testing is extremely important. In order for assessments to be sound, the testing must be free of bias and distortion. Reliability, objectivity and validity are three concepts that are important for defining and measuring bias and distortion. For physical education teachers to successfully evaluate tests that are available for use in their program, they should have knowledge about the construction of tests. Included in the procedures for test construction are the criteria by which a test may be judged. This information will enable you to acquire confidence in your efforts to select the appropriate tests or even establish your own tests….just like the big boys do.
Validity Validity refers to the degree to which a test measures what it was designed to measure. To the beginner in the field of tests and measurements, this concept may seem so basic that it scarcely deserves mentioning. Nevertheless, many tests are found to be rather weak in this most basic consideration. Maybe you have had an experience where you were given a test and you wonder what in the heck does this test have to do with the class. I know I encountered this type of unnerving incident when I was in college. I won’t tell you where…University of Georgia. At the time, I was the number one ranked powerlifter in the world and I was taking a weightlifting course designed to enhance muscular strength and muscular endurance. I ended up with a C in the course because the teacher gave a skills test that measured little if anything to do with weight training. He gave us a freaken shuttle run test. The shuttle run doesn’t measure muscular strength or muscular endurance…not even close. It may have a little to do with cardiovascular endurance, very little, but it is generally used to measure agility. Now, I don’t want to brag, but of course you know I will. I could easily out perform everyone in the class on any muscular strength and muscular endurance test but my grade didn’t reflect that fact, because I did poorly on that one skills test that really wasn’t measuring weightlifting ability in the first place. This is the kind of stunt that teachers pull that will make even their best students go “postal” on them. In this instance, the teacher, in his zeal to employ objective measures, overlooked the testing procedures main function…to supplement a test, or a battery of test, that will measure what it is suppose to be measuring…HELLO! In addition to the teacher's misuse of the test, the test itself did not represent an accurate measure of the ability to lift weights. I should hasten to say that, in many cases, teachers use tests that demonstrate high validity, but use then to measure the wrong skills or give the test to much emphasis when they measure specific and limited aspects of the course work. For a test to be valid, a critical analysis must be made as to the nature of the activity it is to measure and the skills and special abilities that are involved. Now, all this goes to show you that just because you have a degree does not mean you are a teacher. How else could you explain ninety percent of gym teachers in our school systems?
Since testing has different purposes, and since validity refers to the degree to which a test achieves its purpose, it follows that there are different types of validity. We will briefly describe four kinds of validity and ways by which validity may be established. Note, that there may be considerable overlap among the different types of validity, depending on the purpose or rationale for using a particular test.
Face Validity A test has face validity if it appears to measure the ability in question. Some measurement specialists do not like the term face validity. The term logical validity is often used when a test obviously involves the skill or ability that is being evaluated. A test that calls for a student to walk along a balance beam is obviously a test of dynamic balance. A test that requires a subject to lift a maximum weight in a particular exercise can be considered to have face validity as a test for strength, and so on. To put this in laymen’s terms if a test looks like a duck, walks like a duck and quacks like a duck…it’s a duck. How do you know that it is a duck…face validity? You have been taught your whole life what a duck looks like, sounds like, walks like, and consequently, when you see one you know what it is…a duck. That is face validity. While face validity does not lend itself to any type of statistical treatment, it is an important concept that unfortunately is often overlooked by testers in search of highly objective measures. A tester should always be cognizant of face validity, since it is of paramount importance from the student's point of view. More will be said about face validity as we discuss other concepts of validity as well as the other functional criteria for judging the worth of a test.
Content Validity I already mentioned content validity in the example above but just in case you were daydreaming at the time, I will go over it again with you but in a little more detail. So pay attention this time…Darn it! Maybe you have had an experience where you were given a test and you wondered where in the heck the questions came from. I know I encountered this type of unnerving incident when I was in college too. I won’t tell you where…the University of Georgia. I studied for this psychology test like there was no tomorrow. I read the chapters at least three times went over my notes six or seven times and I attended every class. I was so prepared for that test I was sure I was going to get a flat out 100. When the teacher gave me the test I thought I was in the wrong class. I can tell you this, the test he gave the class seemed to pertain to some course other than the one in which I was enrolled. I don’t know where the heck he got the questions but they certainly didn’t come from the book or anything he lectured about. In this case the test failed miserably in content validity, because the material was not related to the material that was covered in the course. In other words, the test wasn’t valid…it did not measure what it was suppose to measure content wise. Whether a test possesses content validity is a subjective judgment. Content validity is, however, fundamental for tests that aim to measure what students have learned in a class. It must be kept in mind that a test represents a sampling of test items from a so-called universe of possible test questions. Unfortunately, it sometimes becomes a guessing game on the part of the students, who hope that the teacher will select a sample of items that corresponds to what they have been taught and are studying. A big difference between the test and the students' expectations may indicate either poor communication with respect to the course objectives or poor test construction. Content validity applies to physical performance tests as well as to written tests. I had this wonderful experience too when I was in college. I
won’t tell you where…the University of Georgia. I was in this golf class and 90% of the instruction dealt with driving and fairway play. When we got our skills test, about 90% of it dealt with chipping and putting. Now, if 90% of the instruction concentrated on driving and fairway play, the skills tests should attempt to assess these skills approximately 90%, if the test is going to demonstrate any semblance of content validity. It only stands to reason that face validity of the test items should serve to reinforce content validity in skills testing. Of course, some instructors have major issues with reasoning ability or so it would seem. Now, I am not going to tell you where I got my Doctorate… the University of Georgia. Don’t you dare tell anyone…it will be our little secret.
Criterion Validity Sometimes a test is used to predict future performance or it is used in place of another test that is perhaps longer or requires more elaborate equipment or facilities. The terms predictive validity and concurrent validity also have been used to identify these concepts. Both are based on a criterion that is considered to represent the performance or characteristics in question. The SAT examination is an example of a test that attempts to predict success in college. In this case the SAT examination would be the criterion. When I was in professional baseball they used running speed, bat velocity, and arm velocity to predict the success that a player would have in the major leagues. Obviously, the identification of the proper criterion is of critical importance. As a general rule, the use of more than one criterion is recommended. As indicated many times tests that are valid require sophisticated equipment, trained administrators, and a lot of time to administer. Consequently, it can be a tremendous advantage for a physical educator to be able to use a shorter, less elaborate, or less rigorous test. For example, suppose that you wanted to measure cardiorespiratory fitness. It is generally acknowledged that the most valid test for this component of fitness is maximal oxygen consumption (MOC). Unfortunately MOC requires expensive equipment for the necessary gas analysis, considerable time for individual testing, trained administrators, and subjects who are willing to exercise maximally. Therefore, if you had a test that measured basically the same thing as MOC, but one that was easy to administer and could be given to a large group of students, you would be better off…way better off. In street terms you would be home free. Well, I have good news for you, they have such a test. Actually, they have a number of such tests. How did they get them? I am glad you asked. First researchers devised tests that were less sophisticated than MOC, but ones they felt measured the same thing that MOC was measuring…cardiovascular fitness. What do you think they did next? They simply tested subjects on MOC and the tests that they had designed and then they ran a correlation between the two tests. If the correlation was high then they made the assumption that the two tests were measuring the same thing and that they could be used interchangeable. Remember, we talked about this before in Chapter II…sure you do. In other words, they employed MOC as the criterion. I know what you are thinking now; I always know what you are thinking…so watch yourself. You are thinking: “How did they know that MOC measured cardiovascular fitness when there was no initial criterion to correlate it against?” That is really good thinking…so if you weren’t thinking that make believe you were. I’ll tell you how they did it….they used judge’s ratings in order to establish face validity. They got a number of experts in the field of cardiovascular fitness together. Them they got a large number of
subjects together. Once they had the subjects they had them engage in activities that required cardiovascular fitness. While the subjects performed these activities the judges rated their performance independently. Once they completed their independent ratings of the subject’s performance the judges put their evaluations together and formed a hierarchy of their ratings. In other words, the subject who the judges rated highest got the number one ranking, the second best subject was give the second highest ranking and so and so forth. Once they had the rankings completed they tested all of the subjects on MOC, and then they ran a correlation between MOC and the judge’s ratings. What they found was that the judges ratings correlated almost perfectly (99.17) with MOC. Meaning that the subject who was ranked number one, did the best on the MOC test, and the subject who was ranked number two, did second best and so on and so forth. Obviously, it was not a perfect correlation so a few of the subject’s rankings were not identical with their performance on MOC but it was very close and unlike in horseshoes close does count here. This brings up a good question. What is an acceptable validity coefficient when one is calculated? Actually, there are no strict standards. Most experts in the field of test and measurements however, believe that if a test is being used as a substitute for more a sophisticated test, coefficients of .90 or larger are desirable, but values exceeding .85 are somewhat acceptable. Now, I have actually heard some individuals in the field, that’s right experts, say that in some situations values of .50 or .60 may be acceptable. They contend that when a predictive test is needed and there is not a high validity coefficient for any of the tests evaluated, a test with a validity coefficient of .50 or .60 is better than nothing, and thus acceptable. “Is better than nothing and thus acceptable?” To me that is an absolutely ridiculous statement. That is like saying, “Well, we don’t have a test to measure strength, but MOC, a cardiovascular fitness test, correlates .50 with strength so let’s use MOC to measure strength…it is better than nothing.” I am being a little facetious here…not even my professor at Georgia would be that dumb. Then again, he did use a shuttle run test to measure strength. Can you tell that I hated that guy? Let me try harder here…I hate that guy and hate is a strong word. In other words, according to these so called experts using a test that is measuring something other than what they are trying to evaluate is okay if they don’t have anything else. Think about that for a second. How would you like to train for three months developing strength and then be given a cardiovascular test or a shuttle run test to measure your strength because…well, it is better than nothing? You don’t have to answer that question. I wonder where these experts got their degrees…the University of Georgia? Now let me explain something to you. The square of the correlation coefficient indicates the amount of variance common to both tests and, thus, the degree to which the two tests measure the same thing…remember that from Chapter 2? So a validity coefficient of .50 squared is 25% (.50 x .50-.25) common variance. In this case, r2=.25 means that only 25% of the variance of the two tests is in common and 75% of the variance must be accounted for by other factors. And you are going to use a test like that to evaluate your students. Hell, with the validity coefficient, how about a little common sense. If your objective is to measure strength, then you need a test that measures strength…DAH! And if you don’t have one, then don’t measure it or at use a form of face validity to evaluate you students. It is a lot fairer to do that, then to test them with an instrument that doesn’t measure what it is suppose to measure. I am glad I got that off my chest. Here is something else you need to know. A test could be valid but if used in the wrong situation it could be invalid. I think that goes without saying, but I am not sure with you guys so I will say it anyway. For instance, the case previously mentioned, were MOC was used to evaluate strength. Maximal oxygen consumption is without question a valid test for measuring cardiovascular fitness, but it is not a valid test for measuring strength. Consequently, in that situation MOC is an invalid measure.
Again, for a test to be valid it has to be measuring what it is supposed to measure in that particular situation. Moving right along, as I previously mentioned the degree of association of a measure with the criterion is usually evaluated statistically by a correlation. In Chapter 2 I also mentioned that you need to be cautious that a correlation does not imply causation. A test may correlate with the criterion, but the reason may not be apparent. It could well be that another factor or several factors are associated with both variables and are the cause of the relationship. A former colleague who had been a sports psychologist for the Kansas City Royals for many years maintained that he could devise a psychological test that would correlate as high, or higher with baseball ability than some of the physical skills test that were being used to predict baseball performance. Although he never followed through with this claim, he was undoubtedly basing his claim on the fact that there are probably some underlying psychological traits that are necessary to be successful in baseball. Another example is the Shape-o-ball test that consists of placing different shaped objects into a ball that has the corresponding openings of the various shaped objects. The test is administered to children 6 years of age or younger. The designers of the test reported that the test correlated very high with reading ability. I am proud to say that I did extensive research on the shape-o-ball. The correlations I got were very low in comparison to what the designers were reporting. Actually, the only thing I succeeded at with my research was driving the poor little kids loony trying to get those freaken objects into the corresponding holes. You heard the expression putting a square object into a round hole. Well, by the time I completed my studies some of these kids were actually doing that which drove me Looney. In this case, the choice-response factor that is probably being measured by the Shape-o-ball test is also a factor in reading ability, and this might account for the correlation that the designers got‌or maybe they just lied like hell. Still, the test completely fails with respect to face validity and content validity and would likely not be accepted by experts in linguistics as a measure of reading ability. Face validity should never be ignored in test construction, and the potential user of a test must not be totally concerned with (or misled by) statistical evidence of validity.
Construct Validity The term construct validity refers to the degree to which performances on a test correspond to the abilities or traits that the test purports to measure. Construct validity can be established in different ways, but when done methodology it is typically based on testing the theory that underlies the parameter. For example, the LSU Fitness Test consists of pulse counts: before exercise, immediately after exercise, and 1, 2, and 3 minutes into recovery. The test is based on the premise that these pulse measures reflect differences in cardiovascular fitness; that is, a person with a high degree of cardiovascular fitness generally has a lower pulse rate at rest, after a standard workout, and his recovery rate after exercise is faster. Consequently, if the test is valid, conditioned individuals should have a lower profile of pulse counts than unconditioned subjects. An experimental study could be set up to establish construct validity based on the underlying theory as follows: First you would give all of the subjects the step test and then assign them to groups. One group would serve as your control. The subjects in the control group have it easy‌they don’t undergo any conditioning. The other subjects will serve as your experimental groups. These guys will engage in a physical conditioning program. If you want you could include another experimental group in which the
subjects would engage in an exercise program of greater or lesser intensity or duration. The step test can be said to demonstrate construct validity if it reflects the expected gains in cardiovascular fitness as a result of the conditioning programs. The degree to which it can distinguish between levels of fitness may be evaluated by whether it reveals a difference between the two exercise programs. Another method by which construct validity can be estimated is by correlation of the test with another test of its kind which is known to be valid. Since a cardiovascular test could be expected to correlate with other valid cardiovascular tests. Remember I just explained that to you when we were talking about criterion validity? Construct validity may also be appropriate, however, when no measures of the ability in question are widely accepted as valid tests. An example might be perceptual motor ability, for which there is no acknowledged criterion measure. It would be logical, then, for a tester to examine the assumptions underlying perceptual motor performance and seek to evaluate his or her test in light of these assumptions.