Assessment Chapter 11 by Thomai Karakasis

I mpr oving T eacher D eveloped Zofia Drotlef, Tami Karakasis Assessments & Carmela Arfuso

Overview â&#x20AC;˘ There are two general improvement strategies - Judgemental item improvement - use human judgement to improve assessments; your own and others - Empirical item improvement - use student responses to improve assessments â&#x20AC;˘ Ideally, using both improvement procedures is most beneficial, depending on your time and motivation

Judgementally based improvement procedures

• Human judgement is a very useful tool • Judgemental approaches to test improvement can be carried out systematically and also informally • Judgemental assessment-improvement strategies mainly differ according to who is supplying the judgements • There are three sources of test-improvement judgements to consider – those supplied by: - yourself - your colleagues - your students

• Not only should you assess items on a test, but also test instructions • It is helpful to take a break from editing and assessing to “cool off” and return after a time period when it is more likely that you will notice deficits that were originally not obvious

â&#x20AC;˘ You will improve your assessment procedures even more if you also approach test-improvement systematically â&#x20AC;˘ You can create and/or use specific review criteria to help judge your assessment

5 systematic review criteria to consider: • Adherence to item-specific guidelines and general item-writing commandments (Chapter 6) • Contribution to score-based inference: does your assessment procedure really contribute to the kind of score-based inference about your students you wish to draw? • Accuracy of content: content can become inaccurate over time; make sure that content in your assessment instrument is still up to date and change your answer key accordingly • Absence of content lacunae (or gaps): when important content is overlooked in an assessment instrument; anything meaningful that is overlooked reduces inference accuracy • Fairness: eradicating bias; good to undertake another bias review

Collegial Judgements â&#x20AC;˘ Often helpful to ask someone you trust to review your assessment procedures - provide your co-worker with a brief description of review criteria - describe the key inferences you will base on the assessment procedure - identify the decisions that will be influenced by your inferences about students

• Collegial judgements are especially helpful to teachers who employ many performance tests or who use portfolio assessments; for these types of tests, judgmental approaches are often more useful; empirical improvement procedures are intended for more traditional items, such as those found in multiple-choice tests • Your own views are apt to be more biased in favour of your tests than a colleagues • Larger school districts may provide access to central-office supervisorial personnel; get them to review your assessment procedures • Listen to what reviewers say, but be guided by your own judgements when it comes to their suggestions

Student Judgements • Rich source of data; useful insights • SJ can help you spot shortcomings in particular items, such as test direction or timing • Questionnaires can given to students after they have completed an assessment procedure • Anticipate more negative feedback from low-scoring students; tend to blame the test

An Illustrative item-Improvement Questionnaire for Students 1. If any of the items seemed confusing, which ones where they? 2. Did any items have more than one correct answer? If so, which ones? 3. Did any items have no correct answers? If so, which ones? 4. Were there words in any items that confused you? If so, which ones? 5. Were the directions for the test, or for particular subsections, unclear? If so, which ones? Refer to p.255 in text, Figure 11.1

What you should take away: â&#x20AC;˘ In the final analysis, you will be the decision maker regarding whether to modify your assessment procedure â&#x20AC;˘ Nonetheless, it is helpful to have others react to your tests

Empirically Based Improvements Procedures • Just like bias, empirically based improvements can be made to a test • Let’s go back to recall the formula p = __R_(right response)_ (total # item responses) where p = difficulty and is dependent on who’s taking the test (recall weaknesses of CTT) • P values can range from 0 to 1.00 • The higher values of P means that more students answered the question correctly

Empirically Based Improvements Procedures • A score of 0.98 means that almost all of the students correctly answered the question • On the reverse, a score of 0.15 means that 85% of the students answered the question incorrectly • As teachers, these questions suggest : 1. More practice or review prior to the test is needed 2. Investigate alternate methods to teach the information. For example, presentation style alone may not be sufficient 3. The questions could be replaced or improved BUT

Item-Discrimination Indices (IDI) • Low P values alone do not determine the course of action • IDI is the frequency or correct answers provided by students that performed well on the entire test • IDI relates the test responses to the item

Item-Discrimination Indices (IDI) • +ve IDI = more students that got “As” on the midterm scored better on the question than students that got ‘Bs’ • -ve IDI = more students that got ‘Bs’ on the midterm scored better on the question than students that got ‘As” • If the number is equal then it’s called the nondiscriminator

Item-Discrimination Indices (IDI) 1. 2. 3. 4.

HOW DO YOU CALCULATE IDI? Organize the papers sequentially from high grades to low grades Divided in to two equal groups Calculate the p value for each group (high scores = Ph; low scores = Pl) Next Ph â&#x20AC;&#x201C; Pl = IDI

Item-Discrimination Indices (IDI) IDI IS CALCULATED, NOW WHAT? DISCRIMINATING INDEX

ITEM EVALUATION

.40 AND ABOVE

VERY GOOD QUESTION

.30 .39

GOOD, BUT COULD IMPROVE?

.20 .29

MARGINAL, NEEDS IMPROVEMENT

.19 AND BELOW

POOR, DISCARD THE QUESTION/OVERHALL BUT

Distractor Analyses (DA) • For items with a low IDI before changes are made DA helps you look deeper • DA looks at multiple choice questions (items) • Look at both p values and D values with multiple choice questions

Distractor Analyses (DA) â&#x20AC;˘ Item D has many responses from the students that performed well on the test; it needs to be revised â&#x20AC;˘ Item C did not entice any students, that means Hong Kong was an option for the capital of Canada

Distractor Analyses (DA) â&#x20AC;˘ D has many responses from the students that performed well on the test, so it needs to be revised â&#x20AC;˘ DA looks at multiple choice questions (items)

Item Analysis for CriterionReferenced Measurement

• A criterion-referenced test is used to identify an individual`s status with respect to an established standard of performance. (Popham1969) • Two general item-analysis schemes/approaches for criterionreferenced measurement • The first approach involves the administration of the test to the same group of students prior to and following instruction. • A disadvantage of this approach is the teacher must wait for instruction to be completed before securing the item-analysis data. • Another problem is the pretest may be reactive, in the sense that its administration sensitizes students to certain items so students’ posttestperformance is actually a function of the instruction plus the pretest’s administration.

• Using the strategy of testing the same group of students prior to and after instruction, we can employ an item discrimination index calculated as follows: • Dppd = Ppost- Ppre • wherePpost = proportion of students answering the item correctly on posttest. •

Ppre= proportion of students answering the item correctly on pretest.

• Dppd (discrimination based on the pretest-posttest difference) • Value of Dppdcan range from -1.00 to +1.00, with high positive values indicating an item is sensitive to instruction

• Example : 41% of students answered item 27 correctly in the pretest and 84% answered it correctly on the posttest, then item 27’s Dppd would be .84 - .41 = .43. • A high positive value would indicate the item is sensitive to the instructional program you’ve provided to your students. • Items with a low or negativeDppd would be earmarked for further analysis because such items are not behaving the way one would expect them to behave if instruction were effective. (if many items fail to reflect large posttest-minus-pretest differences, possible the instruction being provided was not so good)

• The second approach to item analysis for criterion-referenced measurement is to locate two different groups of students, one of which has already been instructed and one of which has not. • By comparing the performance on items of instructed and uninstructed students, you can pick up some useful clues regarding item quality. • This approach has the advantage of avoiding the delay associated with pretesting and posttesting the same group of students and also avoiding the possibility of a reactive pretest. • Its downside is that you have to rely on human judgement in the selection of the “instructed” and “uninstructed” groups.

• If you use two groups, and instructed group and an uninstructed group-one of the more straightforward item discrimination indices isDuigd (discrimination based on uninstructed versus instructed group differences). The calculations for this index are as follows: •

Duigd=Pi – Pu

• where Pi = proportion of instructed students answering an item correctly; Pu= proportion of uninstructed students answering an item correctly. • Value ofDuigdcan range from -1.00 to +1.00 • Example: If an instructed group of students scored 91% correct on a particular item, while the same item was answered correctly by only 55% of an uninstructed group, then Duigdwould be .91 - .55 = .36.

What do classroom teachers REALLY need to know about improving their assessments? â&#x20AC;˘ Teacher -made tests can be improved as a consequence of judgemental and/or empirical improvement procedures. â&#x20AC;˘ Judgemental approaches work well with either selected-response or constructed-response kinds of test items. â&#x20AC;˘ Empirical item improvements have been used chiefly with selected response tests, hence are more readily employed with such test items.

â&#x20AC;˘ Most of the more widely used indices of item quality, such as discrimination indices, are intended for use with test items in normreferenced assessment approaches, and may have less applicability to your tests if you employ criterion-referenced measurement strategies. â&#x20AC;˘ Must understand because educators have far less experience in using (and improving) performance assessments and portfolio assessments, there isnâ&#x20AC;&#x2122;t really a delightful set of improvement procedures available for those assessment strategies-other than good, solid judgement.

The presentation has come to an end...please prepare yourselves for a fun and unfortunately short quiz...