AMP Connection 2Qtr 2012

Connection 2012 Issue 2 Technology That Works

People Who Care

How Many is Enough? Determining the Ideal Number of Candidates Needed for Reliable Item Statistics Dr. Steve Nettles, AMP Sr. Vice President, Psychometrics Lily Chuang, MS, AMP Research Associate

I

t is one of the most common questions a psychometrician gets asked…“How many candidates do we really need to get good stats on our exam?” While there are many programs lucky enough to have thousands of people sitting for their examination each year, the reality is there are far more programs faced with limited candidate volumes. Therefore, determining the ideal number of candidates needed to generate “reliable” item statistics is a common concern as reliable statistics are a cornerstone for instant scoring, a major benefit to computer-based testing (CBT) candidates. In the past, we have been hesitant to provide a firm answer to this question because there have been few studies conducted to find the “magical number”. However, recent research by AMP staff may help provide small volume programs guidance on how many candidates it takes to get meaningful and reliable data.

using small numbers of candidates, one outlier can cause a huge deviation in the statistic’s value. Thus, when using the item performance data to make any decisions on an item, questions arise of how much we should rely on the item statistics and how confident we are with the statistics. We used data from a large credentialing group to conduct a research study which simulated calculating item statistics for various sizes of small groups of candidates to compare the observed item performance with data from the larger group of candidates.

Methodology

Introduction

The current study used data from an examination form with 140 scored items. Item statistics, p-value and rpb, were calculated for each item using a population of 1,525 candidates. These population values became our “gold standard” for comparison purposes.

Many certification programs with small numbers of candidates use classical item statistics to evaluate item performance. The two most commonly used statistics are item difficulty (p-value or the proportion of candidates answering an item correctly) and item discrimination (rpb or the correlation between candidates’ total test scores and if they answered items correctly). If the correlation is sufficiently high (e.g., greater than 0.20), the item is said to discriminate between high and low-scoring candidates – a desirable outcome. Although item performance information can be calculated using any number of candidates, when

To simulate small groups of candidates, we began by randomly selecting 10 candidates from the population of 1,525 candidates. Responses from those 10 candidates were used to calculate difficulty and discrimination indices for each of the 140 items. We ran 100 iterations of randomly selecting 10 candidates, calculated the item statistics, and then calculated the differences between the population values and the sample values. The absolute value of difference between each of these item statistics for the group of 10 candidates and item statistics from the population was then calculated. This procedure was repeated for groups of continued on page 2

Empower Your Chief Staff Officer (CSO)

3

www.goAMP.com

On the Road

4

Congratulations To Dr. Steve Nettles

4

Turn static files into dynamic content formats.

Create a flipbook