Developing Essay Questions A lot of teachers have this idea that essay questions are easy to construct, but a bear to grade. Well, as usual, they are wrong! Essay questions when constructed properly are not exactly a walk in the park. Asking a student to simply generate the parts of the heart or to give the definition of important words in a chapter is not an essay question. There is not a specific category for such questions, but they would more nearly fit the Short Answer category than Essay. Essay questions are supply and constructed response type questions which are designed to measure the cognitive clearance of the student. With this type of testing the student may be asked to make application, organize, synthesize, integrate, and evaluate material that has been taught, while at the same time providing a clear and concise measure of writing skills. As with other tests of assessment, essay tests or questions should be aligned with the objectives and instruction for the class. It goes without saying that instruction should prepare students for essay questions. If a student is asked to compare and contrast material on an essay question and the teacher has not taught the student what the aforementioned means there is a good chance that a the student’s response will be his understanding and interpretation of these terms rather his ability to demonstrate the higher level skills involved. In other words, instruction should prepare students for essay questions if he is going to use this method of testing. Since a substantial amount of time is necessary to answer and score essay questions and because only a limited amount of material can be covered it is probably best to use essay questions only when other types of questions cannot measure accomplishment of the objectives. They are particularly useful when there is little time to prepare the assessment but more time in which to grade it. One of the major advantages of essay questions is that it eliminates the possibility of the students' guessing the correct answer. On the flip side essay question are rather time consuming to grade in comparison to other testing procedures. There is also the fact that for students to respond to an essay question they have to have some semblance of writing skills. Fairly or unfairly, (depending upon the objectives of the class) students could be penalized because of handwriting, spelling, grammar, neatness, vocabulary, sentence structure, and organization, even though they know the material. One way to address this issue is to give separate grades for essay content and writing skills. There is also the risk that grading of essay responses can be subjective and unreliable. The concept of reliability is commonly applied to the results of tests and measurement instruments. When consistent results can be obtained with an assessment, we say that the instrument is reliable. In the case of essay questions, reliability is dependent on the scoring of the question. For the scoring to be reliable, there should be consistency among scoring. In other words, if the teacher read the essays on two different occasions he should come up with basically the same score. If the scoring is not reliable obviously then the scoring is not valid. There is also the point of objectivity. For the scoring to be objective, there should be consistency among scorers. Two individuals independently scoring the same set of papers should arrive at the same scores. There is also the risk of bias in grading. The lack of validity in scoring may be evident by the teacher's awarding higher grades to students who have a history of performing at higher levels when their answers do not justify the better marks. The order in which papers are graded can also have an impact on the grades that are awarded. A teacher may grow more critical or more lenient after having read several papers, thus the early papers receive lower or higher scores than papers of similar quality that are scored later. Also, if the scorer becomes tired, judgment can be affected. Also when
controversial issue are being address by the students in an essay question the teacher must be careful to evaluate the essay based on merit not on the position taken by the student. It should be noted that there are a number of scoring rubrics for an essay tests on line. One of the best rubrics was designed by Thomas Brookhart in 1999. The rubric is based on three criteria: thesis and organization, content knowledge, and writing style and mechanics.
General Guidelines for Writing Essay Questions Other guidelines suggested for writing essay items include the following: 1. Only use essay questions to measure complex concepts. Knowledge that is basically contingent on rote memory profit little from being measured by essay questions. These outcomes can usually be measured more effectively by objective items, which lack the sampling and scoring problems that essay questions present. 2. Keep in mind the reading and writing levels of the students you are testing. 3. Construct questions that are explicit and easily addressed. Two or more questions that are more specific and shorter are preferable to a single longer question. 4. Suggest a (reasonable) time or page limit for each essay question. 5. Decide in advance what you are seeking in the answer. Write a model response and/or develop a scoring system that includes the information you will reward in the answer. 6. Provide ample time for answering and suggest a time limit in each question.
General Guidelines for Scoring Essay Questions Other guidelines suggested for scoring essay items include the following: 1. Review the text and your class notes; list the main points to be covered in the essay response. 2. Develop a model answer first to determine what you are looking for. 3. Evaluate all of the students' answers to one question before proceeding to the next question. It is a good idea to randomly rearrange the order of papers after scoring each item, and attempt to have students' identities hidden from you when grading papers. 4. In order to control for personal bias evaluate answers to essay questions without knowing the identity of the writer. 5. Whenever possible, have two or more persons grade each answer. The best way to check on the reliability and objectivity of the scoring of essay answers is to obtain two or more independent judgments. Although this may not be a feasible practice for routine classroom testing, it might be done periodically with a fellow teacher who is equally competent in the area.
Item Analysis Obviously the primary objective of any assessment is to ascertain what the student’s knowledge of the subject matter is and to determine what they can do. It also provides feedback as to your teaching effectiveness, your teaching skills and the validity of testing instrument. In order to determine the validity and effectiveness of your testing instruments you need to know how to analyze your tests. The best method for determining the effectiveness of the questions you are presenting to your students is an item analysis. With an item analysis you can see which questions are being answered correctly and which questions the students are having problems with. You can also tell which destructors are being used and which ones are not relevant. With this information you can get a better idea of questions that are differentiating and which questions are not. Of course, you can than edit the questions so that they are more relevant to the material. With the item analysis you can also immediately determine which standards the students are having difficulty with. Consequently, you can design the quiz section to emphasis the areas were the students are weak. Item difficulty is one form of item analysis. After quizzes or tests have been scored, the percentage of students who gave the correct answer to each question is called the item difficulty. (P value, in technical terms) and can vary from zero to 100%. Item difficulties can and should be calculated and analyzed for all types of questions, not just true false and other multiple-choice items. Let’s look at our first example. 1. The most valid test to measure cardiovascular fitness is: A. 12 minute walk/run test B. Maximum Oxygen Consumption C. 1.5 mile walk/run test D. 1 mile walk/run test E. None of them Question covers objective (s) –19 and 20 ITEM ANALYSIS QUESTION
# ANSWERING CORRECTLY
ITEM DIFFICULTY INDEX (% of students answering correctly)
26
72.2
1
DISTRACTOR ANALYSIS (Multiple-choice Items) Q#1
A 8 (22.2%)
B 26 (72.2%)
C
D 2 (5.5%)
E
In the example above 72% of the students answered the question right. Is that a good percentage for a question to be answered correctly. Good question. I am glad you asked it. Sometimes teachers believe that it is all right for some questions to have high item difficulty to differentiate better students from the others. This will depend on the grading philosophy of the teacher. If the teacher is using a norm-referenced interpretation of scores (grading on the curve), the teacher may find it desirable to have some items that are more difficult and answered correctly by only a few students. This allows the teacher to differentiate between students when assigning grades. Oosterhof (2001) presents some item difficulties to use as a guide if the teacher wants to gain the maximum information about student achievement, but he emphasizes that these are only guidelines: alternate-choice questions, 85%; multiple choice questions with 3 alternatives, 77%, four alternatives, 74%, and five alternatives, 69%; and short-answer or completion questions, 50%. Ebel (1986) proposes the ideal difficulty for forced-choice questions should be halfway between perfect (100%) and what could be obtained by guessing (50%) or 75%. For a multiple-choice question with four choices, it would be halfway between 100% and 25%, or 62.5%. 2. Cardiovascular fitness involves: A. Heart and circulatory and respiratory systems B. Heart and reproductive and respiratory systems C. Circulatory, respiratory and skeletal systems D. Heart and muscular and digestive system E. None of the above Question covers objective (s) – 13, 15, 19 and 20 ITEM ANALYSIS QUESTION
# ANSWERING CORRECTLY
ITEM DIFFICULTY INDEX (% of students answering correctly)
33
91.6
2
DISTRACTOR ANALYSIS (Multiple-choice Items) Q#2
A 33 (91.6%)
B
C 1 (2.7%)
D
E 2 (5.5%)
In the example above 91.6 % of the students got the answer correct. This would seem to indicate that the question is too easy and that it does not differentiate the good students from the others unless you are teaching at a school like Harvard University where the population is skewed to the high end of intelligence.
3. The path of deoxygenated blood returning to the heart is: A. Vena Cava, Left Atrium, Left Ventricle, Lungs, Right Atrium, Right Ventricle, Aorta B. Aorta, Right Atrium, Right Ventricle, Lungs, Left Ventricle, Left Atrium, Vena Cava C. Vena Cava, Right Atrium, Right Ventricle, Lungs, Left Atrium, Left Ventricle, Aorta D. Vena Cava, Right Ventricle, Left Ventricle, Lungs, Right Atrium, Left Atrium, Aorta E. None of the above Question covers objective (s) – 13, 15, 19 and 20 ITEM ANALYSIS QUESTION
# ANSWERING CORRECTLY
ITEM DIFFICULTY INDEX (% of students answering correctly)
10
27.7
3
DISTRACTOR ANALYSIS (Multiple-choice Items) A 10 (27.7%)
Q#3
B 6 (16.6%)
C 15 (41.6%)
D 1 (2.7%)
E 4 (11%)
In the above example only 27.7% of the students answered the question correctly there is an indication of a problem. Perhaps the question is not well written or it is misleading, which may have caused students to select the incorrect answer. Another possible explanation is that the material was not covered as thoroughly as the teacher thought or had wanted to cover it. There is that third possibility that the students are intellectually constipated, didn’t study or all of the above. Discussion of the question with the students can probably clarify the situation. If students do not know the content, further instruction or review is probably in order. For questions in which students construct their responses (short answer, completion, essay, etc.), examining the incorrect answers given by the students may also be helpful in determining the problem. 4. Stroke volume is the amount of blood the heart pumps per minute._______True or false Question covers objective (s) – 13, 15, 19 and 20 ITEM ANALYSIS QUESTION 4
# ANSWERING CORRECTLY
ITEM DIFFICULTY INDEX (% of students answering correctly)
28
77.7
Remember that approximately half of the students should answer true and false questions correctly even if they don't know the material because each student has a 50/50 chance of getting the correct answer by guessing. If the students simply did not know the material, approximately half would select the correct answer by chance alone, resulting in an item difficulty close to 50%. In addition to the item difficulty index (percentage of students answering a question correctly), multiple-choice questions should be examined using a distractor analysis to determine the effectiveness of the various distractors that were provided. As you have probably noticed a distracter analysis was given for the aforementioned examples. As was mentioned in relation to item difficulty and deciding what an acceptable level of difficulty is, it is necessary to consider the philosophy of the teacher when evaluating the effectiveness of the individual distractors. If the teacher uses criterionreferenced grading, with the intent that all students achieve mastery, there is less pressure to make sure students select all of the distractors on a fairly equal basis than if the teacher grades on the curve (norm-referenced evaluation) and wants scores to be distributed more widely. Using example one again look closely at the distracter analysis. 1. The most valid test to measure cardiovascular fitness is: A. 12 minute walk/run test B. Maximum Oxygen Consumption C. 1.5 mile walk/run test D. 1 mile walk/run test E. None of them Question covers objective (s) – 13, 15, 19 and 20 ITEM ANALYSIS QUESTION
# ANSWERING CORRECTLY
ITEM DIFFICULTY INDEX (% of students answering correctly)
26
72.2
1
DISTRACTOR ANALYSIS (Multiple-choice Items) Q#1
A 8 (22.2%)
B 26 (72.2%)
C
D 2 (5.5%)
E
Note that no one selected items C or E for their answer. Obviously, these distracters are worthless…the kind of answers that Mr. Potato and Mike Tyson know straight away that they are not creditable. Remember, all of the distracters should be plausible. Remember also what we said…two distractors are as effective as three if one of the three is not plausible. In other words, if you have a distracter that is highly unlikely to be selected you might just as well use two distracters. If one or more distractors are not chosen, as occurs in this example, the unselected distractors probably are not
plausible. If the teacher wants to make the test more difficult, those distractors should be replaced in subsequent tests. Of all the examples I presented you with question three is by far the best one. Each of the distracters was selected and the question did seem to differentiate the good students from the others. Even with this question though, D which only got 2.7 responses should be reworked.
ITEM ANALYSIS QUESTION
# ANSWERING CORRECTLY
ITEM DIFFICULTY INDEX (% of students answering correctly)
10
27.7
3
DISTRACTOR ANALYSIS (Multiple-choice Items) A 10 (27.7%)
Q#3
B 6 (16.6%)
C 15 (41.6%)
D 1 (2.7%)
E 4 (11%)
It is not desirable to have one of the distractors chosen more often than the correct answer, as occurred with the example below. This result indicates a potential problem with the question. Distractor C may be too similar to the correct answer and/or there may be something in either the stem or the alternatives that is misleading. 2. Cardiovascular fitness involves: A. Heart and circulatory and respiratory systems B. Heart and reproductive and respiratory systems C. Circulatory, respiratory and skeletal systems D. Heart and muscular and digestive system E. None of the above Question covers objective (s) – 13, 15, 19 and 20 ITEM ANALYSIS QUESTION 5
# ANSWERING CORRECTLY
ITEM DIFFICULTY INDEX (% of students answering correctly)
33
91.6
DISTRACTOR ANALYSIS (Multiple-choice Items) A B C Q#3 10 (33.1%) 6 (16.6%) 15 (41.6%)
D 1 (2.7%)
E 2 (5.6%)
Item Discrimination Index Since you love statistic so much we can go one step further by computing an item discrimination index. This will enable you to determine whether the question discriminates appropriately between lower scoring and higher scoring students. When students who earn high scores are compared with those who earn low scores, we would expect to find more students in the high scoring group answering a question correctly than students from the low scoring group. That just makes sense …right? In the case of very difficult items which no one in either group answered correctly or fairly easy questions which even the students in the low group answered correctly, the numbers of correct answers might be equal for the two groups. What we would not expect to find is a case in which the low scoring students answered correctly more frequently than students in the high group. This usually occurs when the low scoring students steal the freaken test. The item discrimination index (labeled "D" in Example 3) can vary from -1.00 to +1.00. A negative discrimination index (between -1.00 and zero) results when more students in the low group (you know the ones who stole the test) answer a question correctly more than the students in the high group. A discrimination index of zero means equal numbers of high and low students answered correctly, so the item did not discriminate between groups. A positive index occurs when more students in the high group a question correctly more than the students in the low group. If the students in the class are fairly homogeneous in ability and achievement, their test performance is also likely to be similar, resulting in little discrimination between high and low groups. Still, because we are taking the extremes (I will explain that in a moment) of lows and highs there usually is a difference if the question is valid. Questions that have an item difficulty index of 1.00 or 0.00 need not be included when calculating item discrimination indices. An item difficulty of 1.00 indicates that everyone answered correctly, while 0.00 means no one answered correctly. As already indicated these types of questions don’t differentiate between students. Consequently, such questions serve no purpose because they simply provide evidence of what the teacher probably already knows about the student. Okay, are you ready for this now? Put on your thinking cap. That’s right, your Yankee cap…thank you! The generally accepted procedure in analyzing a test for item discrimination is to sort the test papers from the lowest score to highest. In other words, put the scores into an array from highest to lowest. Then identify two equal groups using the highest and lowest scores, with intermediate scores not being used. This means that the entire test must be graded before the high and low groups can be determined. DAH! Using natural breaks in the scores, select approximately 1/4 of the scores at the highest end of the frequency distribution and an equal number beginning with the lowest score and counting upward. For example, if you have 100 scores and the lowest score on the test is 30. Start there and count out 25 tests (1/4 of 100 is 25). Then go to your highest score and count out 25 scores in reverse (1/4 of 100 is 25). Once the two equal groups are selected, place their tests in two stacks, one for high scores and the other for low scores. Then, examine the responses to each question. First determine how many students in the top or high group (H) answered question 1 correctly. Next, see how many students in the bottom or low (L) group answered the same question correctly. Subtract the number of correct response in the low group from the number of correct response in the high group. Then divide the difference by the number in each group…not the sum of students in both groups. In Example below, there are 25 students in the high group and 25 in the low group (with 50 students between the two groups which are not represented. For question 1, all 25 students in the in
the high group answered the question correctly, while only 10 in the low group answered the question correctly. Consequently, if we subtract the correct response from the low group from the correct response in the high group we get 15. (25-10 = 15). The last step is to divide the 15 by the number in each scoring group (25) to yield +.6 (15 / 25 = .6). Simply looking at the numbers of correct answers in the two groups (25 in the high group and 10 in the low group) indicates that the question discriminates in the desired direction, and our positive index (+.6) verifies this. Number in high group Number in low group Answering correctly Answering correctly Question
H
L
H-L divided by the number H-L
/D
____________________________________________________________________ 1.
25
10
15
.6
2.
20
10
10
.4
3.
15
18
-3
-1.25
For question 2, 20 in the high group answered correctly, and 10 in the low group. Dividing the difference of 10 by the number in the groups (25) gives you a result of .4. Both questions one and two provide results in the expected direction. We would expect more of the students with high scores to answer correctly than students with low scores. When this situation occurs we have a positive index. Question 3, however, shows an unexpected result, with more of the low scorers getting the question right than high scorers. Note that when we subtract the low scores from the high scores in this case the result is a negative number. And as we proceed to divide by the number in the groups, we have a negative index. This is a cue that there may be a problem with the way the question was presented on the test or the way the material was taught (or not taught)…or the low group swiped the freaken test.