Clinical Rehabilitation http://cre.sagepub.com/
The Barthel Index: an ordinal score or interval level measure? Alan Tennant, Joanna ML Geddes and M Anne Chamberlain Clin Rehabil 1996 10: 301 DOI: 10.1177/026921559601000407 The online version of this article can be found at: http://cre.sagepub.com/content/10/4/301
Published by: http://www.sagepublications.com
Additional services and information for Clinical Rehabilitation can be found at: Email Alerts: http://cre.sagepub.com/cgi/alerts Subscriptions: http://cre.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations: http://cre.sagepub.com/content/10/4/301.refs.html
Downloaded from cre.sagepub.com at Sheffield Hallam University on October 18, 2010
301-
The Barthel Index: level measure?
an
ordinal
Alan Tennant, Joanna ML Geddes and M Anne Chamberlain University of Leeds, Leeds
score or
interval
Rheumatology and Rehabilitation Research Unit,
The Barthel Index is one of the most widely used activities of daily living (ADL) measures in stroke rehabilitation and there has been some debate recently about whether or not the Index is an ordinal score or an interval level measure. An audit of 192 consecutive patients undergoing inpatient rehabilitation following stroke has provided the opportunity to examine this question with recently developed mathematical techniques based on the work of George Rasch. Rasch models define the criteria which data must follow to produce an interval level measure. It thus becomes possible to test the data derived from the audit against the Rasch model. Calibration of the 10 items in the Index shows considerable differences in the degree of difficulty (weight), and these differences are not compensated for by the current scoring. Thus adding together the items produces a scale whose intervals vary considerably, particularly between intervals at the lower or upper ends of the scale, and those at the centre. This can give rise to considerable differences between the change score based on the Rasch transformation (taking into account item difficulty) and the change score based on raw scores. These findings confirm the ordinal nature of the Barthel Index. Further questions are raised about the unidimensionality of the Index, and the context in which it should be used.
Introduction For many years the Barthel Index’ has been the mainstay of measuring functional ability in rehabilitation. Developed to indicate the extent of Address for correspondence: Alan Tennant, Rheumatology and Rehabilitation Research Unit, University of Leeds, 36 Clarendon Road, Leeds LS2 9NZ, UK.
nursing care required by patients in institutions, it is essentially a measure of dependency in selfcare. It has nevertheless been utilized as a ubiquitous measure of activities of daily living (ADL) used both in the management of individual
patients (for example to determine when to discharge home), and, at the aggregate level, to show the efficacy of various rehabilitation programmes.2-6 Recently Wade argued that it was
Downloaded from cre.sagepub.com at Sheffield Hallam University on October 18, 2010
302
’probably the most7widely used and best standard
of ADL The Barthel Index has 10 items, and in the original version, used in this study, each item is scored 0 (unable to perform the task) and then a variable range of points, scoring 5, 10 or 15 to reflect independence or the intervals on the way to independence (Table 1). Sometimes these scores are simplified by dividing by 5 to give an overall score of 20.1 The items are weighted so, for example, in the original version, independence on chair/bed transfer scores 15, whereas independence in bathing scores only 5 points. Such scores, whether 5, 10 or 15, or 1, 2 or 3, are characteristic of many ADL scales,9 and are ordinal in their level of measurement. The characteristics of ordinal scales are such that increasing scores reflect a trend, for example in the level of independence. Intervals between the points are, however, not equal in value, simply indicative of the underlying trend. Thus an increase in independence on a single item of between 0 and 5 (or 1 to 2) is not necessarily of the same magnitude as a change from 10 to 15 (or 2 to 3). Such inequality in the intervals within items on ordinal scales is usually complicated by the lack of equality in the degree of difficulty between items. This is not to say that the items do not all contribute to a single underlying construct (i.e. unidimensionality), but rather that identical scores on different items cannot be considered equal. Yet despite this, most ADL scales add together item scores, as though a score of 5 on measure
Table 1
Items and scoring for the Barthel Index
item was the same as a score 5 on another. Even scales which are based on a set of yes/no items face this problem. Adding such scores together implies equality in item difficulty. The Barthel Index implicity acknowledges differences in its own items in that some are given a greater one
for achieving independence) than others. There has been some criticism of the Barthel Index of late,lO-12 one aspect of which has centred on whether or not the Index is an ordinal score or interval level measure. This is important for it determines the way in which the Barthel can be used. For example, can the Index (say on admission) be used as an independent predictor for length of stay in a multiple regression model? Such criticism has led to a robust defence 13 of the Index, in which it has been suggested that the distance between a score of 10 and 20 is approximately the same as the distance between a score of 80 and 90. From this it should follow that the scaling is approximately evenly spaced and consistent with requirements for interval level measurement ; ’thus all statistical functions carried out will be valid’.13 In other words, this argument suggests that the Index does indeed have the property of an interval level measure. However, clinical experience suggests that intervals at opposite ends of the scale are not equivalent. For example, some would consider that achieving an improvement in score from 70 to 90 is a lot harder than, say, an improvement from 10 to 30. Wright and Stone’4 tell us that such ’boundary effects’ (i.e. where, for example, it is increasingly difficult to accrue further points as one approaches the limit of the scale) cause any fixed differences in points to vary in meaning over the score range of the test. Some attempt has been made to improve the sensitivity of the Barthel Index.l5 So, for example while individual item scores for achieving independence remain fixed, the intervals between dependence and independence have been increased to reflect clinical experience; rather than [0, 5, 10], intervals are increased to [0, 2, 5,
weight (score
8, 10]. However, improving sensitivity
ascore only if ambulation
=
0.
to
change
does not necessarily convert an ordinal score into an interval level measure and may not address the difference in degree of difficulty of the different items in a scale. Achieving independence
Downloaded from cre.sagepub.com at Sheffield Hallam University on October 18, 2010
303 in
one task, resulting in a gain of 10 points, may imply a greater change in ability than an increase of 10 points on another item. Thus we return to the question, are the intervals on the Barthel Index equal?
Design
and
setting
Patients undergoing inpatient rehabilitation at the National Demonstration Centre for Rehabilitation in Leeds have Barthel scores recorded at two-weekly intervals. This practice has been established for five years and forms part of a comprehensive audit of rehabilitation. During this time 192 consecutive patients undergoing rehabilitation following stroke were routinely assessed, most recently with the modified index proposed by Shah et al.’S For the purposes of this article, the data are presented as scores of 0, 5, 10, 15. Both the original and modified versions have a range from 0-100. The 192 patients had a mean age on admission of 57 years (95% CI ± 1.5) and mean stay of 67 days (95% CI ± 5.0). Mean Barthel score on admission was 54.4 (median 50.0) and upon discharge was 72.8 (median 80.0). Thus the overall mean change in Barthel score was 18.4 (median
30.0). Rasch analysis
George Rasch was a Danish mathematician whose work in the 1950s has been described as the point ’at which psychometrics moved from being purely descriptive to become a science of objective measurement’.16 Much use has been made of the Rasch models in the field of educational testing, 16,17 and more recently in the field of rehabilitation.’s~’9 The Rasch model defines the criteria which data must follow to qualify for making (interval level) measures. 20 If the data fit the model it is possible to see how the raw score compares with the transformed measure arising from the model, and it is this comparison which will indicate whether the raw score is at the interval level. The principle behind the approach is that it is possible to determine the probability that a per-
particular level of ability will ’pass’ (or independent at) a certain item. Both ’person ability’ and ’item difficulty’ can be
with become son
a
addressed in this way. The type of distribution that one looks for on item difficulty is akin to the Guttman scaling&dquo; principle, which has occasionally been adopted for hierarchical ADL scales,22 including the Barthel Index itself.23 In these scales achieving independence on one difficult item means that independence must have been achieved on all easier items. Clinical experience, however, shows that there are always patients who cannot do some specific tasks when their overall level of ability would suggest otherwise. This is perhaps why a recent application of Guttman scaling to the Barthel Index found that the scale failed to meet the necessary (rigid) criteria for scalability. 23 Probability, the basis of the Rasch model, offers a more elegant approach, and allows for unexpected results. It can be viewed as an ’imperfect Guttman scale’ where ’the probability of responding increases gradually with more of the trait, rather than jumping from a probability of 0 to 100 percent’. 24 Linacre at the University of Chicago has described the process of working out person ability and item difficulty by using the analogy of an archery competition in Sherwood Forest. 21 Using this example, imagine that Robin Hood, Little John and Will Scarlet are shooting at three trees. Robin hits the first tree 11 out of 12 times, Little John 8 out of 12, Will Scarlet 4 out of 12. From this some idea about the relative ability of the three archers can be seen by computing their probability of success. Then Robin, who hit the first tree 11 out of 12 times, only scores 8 out of 12 on a second tree which is further away, and 4 out of 12 on a third tree even further away. In this way the different degrees of difficulty of the targets can be perceived. This is essentially what the Rasch analysis is doing - calibrating the ability of patients to achieve independence on a given item, and calibrating the degree of difficulty of the items. In practice a computer algorithm26 looks at the scale items, and the range of person abilities, and through an iterative procedure seeks to combine the two by their difference. Ideally, this difference should govern the probability of what is supposed to happen when a person of a given ability
Downloaded from cre.sagepub.com at Sheffield Hallam University on October 18, 2010
304
ability against a given task.14 According Wright and Stone 14 there is no alternative
uses
to
that
mathematical formulation that allows estimation of person measures and item calibration to be independent of one another. This is important for we would not want, for example, a temperature measure to be dependent on the particular thermometer used, or on whether the patient is a child or adult. Results of the Rasch transformation are reported in ’logits’ which are the natural log-odds for succeeding on items of the kind chosen to define ’zero’ (the mid-point) on the scale. The key aspect of using Rasch analysis is that the data must ’fit’ the mathematical model. Various statistics are provided for this purpose, but an important one is the standardized information weighted-fit statistic (INFIT), which is distributed as a t-value. This determines whether or not items belong to a single underlying construct, that is, whether or not the scale is unidimensional. Items which have values outside the range ± 2.0 (5% significance) need to be examined for misfit. The recent demise of Cronbach’s alpha as a measure of unidimensionality emphasizes the importance of this alternative approach. 27 Results
.
Model fit The Rasch analysis reported an INFIT with mean of -0.4 and a standard deviation of 2.4 for the 10 items comprising the Barthel Index. This indicates a fair fit of the data to the model, but suggests that there may be some lack of fit to the model. One item, bladder control, had a standardized INFIT value of +3.0. This would indicate that there is not always a strong association between bladder control and the ’physical dependency’ construct determined by the Barthel Index. If a new index were being developed with these items it would be worth considering omitting bladder control, as it weakens the unidimensional construct. However, it is important to remember that no data fit a mathematic model perfectly, and that under the normal distribution items at the margins of the distribution would be expected (i.e. 5 in every 100). Another fit statistic, the adjusted test standard
deviation, was reported as 3.09, some 13 times greater than the root-mean-square calibration of 0.23, indicating satisfactory separation of items
along the underlying construct. This is also important for it shows that the items are measuring different points on the underlying construct of dependency - 13 strata in this case. A lower separation factor might indicate redundant items measuring the same point on the construct, or the possible lack of discrimination for the underlying construct. Item calibration Figure 1 shows the item calibration for the 10 items of the Barthel Index for the 192 patients under study on discharge from the rehabilitation ward. Items are identified along the side (y-axis) and the logit scale along the bottom (x-axis). The logit scale runs from the easiest item (that is the one with the greatest probability of achieving independence), which in this group of patients is independence in bowel control, to the most difficult, independence in bathing. Achieving independence in personal hygiene is the item of average difficulty for this group of patients, reflected by its score of zero. Although not applicable to the current analysis, it has recently been suggested that wherever higher communication skills are being assessed (for example with the Functional Independence Measure Scale [FIMS]),28 discharge calibration is more reliable, as shy patients may appear less functional on admission than their true underlying ability.29 Items such as ambulation, and particularly stairs and bathing, are shown to be much more difficult than other items, and bladder and bowel control, and feeding, are easier than other items, that is independence is likely to be achieved (if not already present on admission) much sooner than on other items. The order of item calibration derived from this dataset is remarkably similar to that identified in the attempt to create a hierarchical index using Guttman scaling.23 This would be expected, as the basic approach is the same in that frequency of steps achieved on each item is calculated and used as the basis for a hierarchical ordering. However, the Rasch approach then uses a logistic function of the odds ratio (log-odds) to determine its person and item weights. Unfortunately the original weighting of
Downloaded from cre.sagepub.com at Sheffield Hallam University on October 18, 2010
305
Figure
1
Calibration of Barthel items
on
discharge
items
(the score for independence is shown on culty are added together? Figure 2 shows the Figure 1) in no way reflects their degree of change in Barthel score for patients during their difficulty (the Rasch-derived weighting) for this stay on the rehabilitation ward, set against the group of patients. change in the logit measure. The latter has been Using
the
raw score
set so that its range is similar to that of the raw score change and, in each case, a negative score
and the Rasch
transformation What happens when items of different diffi-
Figure
2
Comparison between changes
in Barthel
raw score
reflects a deterioration. It is clear that a single
and Rasch
measure
Downloaded from cre.sagepub.com at Sheffield Hallam University on October 18, 2010
change
score on
the
306
Barthel Index masks a broad range of change on the logit measure derived from the Rasch transformation. Most patients have improved during their rehabilitation programme, but, for example, those who had improved by just 15 points on the raw score show a range of change on the logit measure (recalibrated to the same range as the raw score change) from 5 to 25. How can this come about? When a five-point reduction in one item matches a five-point increase in another, the resultant change score is zero. What though if those items represent considerable differences in degree of difficulty, not compensated for by the ordinal weighting? Here the Rasch transformation shows the change in ability level when the raw score fails to do so. There is a clinical significance to this discrepancy : rehabilitation staff see differences between the (lack of) change implied by the raw score, and their own perception of the patient’s change in ability. While improving the sensitivity of the Index by expanding the range of steps on any item may help, it will not overcome the problem as long as there is a discrepancy between the item weighting vis-à -vis the true difficulty of the task. Thus the raw score may mask an underlying decrease or increase in ability, and this can reduce acceptability of the instrument for rehabilitation staff.
Discussion The Barthel Index is used extensively in rehabilitation as a measure of functional outcome. It is commonly used in our own rehabilitation facility to indicate when a patient is ready to go home; it has been shown elsewhere to predict functioning at home,3° and the ability to live indepen-
dently.31 Criticism about using ordinal scales as interval level measures needs to be addressed directly and Rasch analysis provides one way in which this task can be approached. By doing so evidence is found to suggest that the items which comprise the Index represent different degrees of difficulty that are not compensated for by the original weighting. The Rasch analysis shows that a fivepoint change in score at the upper end of the scale has a logit distance three times greater than
a similar change in the middle of the scale. This indicates that the change in ability at the upper level of the scale is three times greater, unit for unit, than an apparent similar change in the middle of the scale. Thus the Barthel Index is an ordinal score, and should not be used as an interval level measure. These results do not imply that the Barthel should be discarded but it must be used as an ordinal scale and appropriate nonparametric statistics applied. If necessary it can be transformed (e.g. through Rasch analysis) into an interval level measure and parametric statistics applied. However, two other important implications arise from these findings. First, there is the question of unidimensionality. If items are added together to produce a single score then it needs to be demonstrated that they do belong to a single underlying trait or construct. Bladder control appears to lay outside the construct expressed by other items in the scale. Similar lack of fit for incontinence items has been observed with the Functional Independent Measure,32 and it may not be coincidence that these items are impairments, while the rest of the items in the scale are disabilities.33 Further work needs to be done to examine the dimensionality of the scale. We have also shown that the original weighting gives fewer points for achieving independence on the hardest items. While the existing weighting system may be fully compatible with indicating the need for nursing care, which is in itself an important input in the rehabilitation process, if there is any association between item difficulty and therapy input, then it must be considered whether the current weighting will give rise to change scores which reflect that input. In other words, is the Barthel Index a valid instrument for measuring the efficacy of a rehabilitation programme?
Conclusion The Rasch analysis confirms that the intervals of the overall score are not equal. These results suggest that the raw score should be treated with caution, and, without transformation, should only be subjected to nonparametric statistics. Furthermore, although the Rasch model is robust in its
Downloaded from cre.sagepub.com at Sheffield Hallam University on October 18, 2010
307
tolerance of deviation, there is an indication that unidimensionality is compromised by the ’bladder’ item. It is also important to remember that these results apply only to stroke patients. Fit of Barthel Index data to the Rasch model, examination of unidimensionality and resulting item calibration should be investigated for other diagnoses, and may not be the same as presented here. Finally, the lack of relationship between the current weighting and item difficulty, as determined by the Rasch model, suggests the need for caution in application. This brings to mind the strictures of Silverstein and colleagues32: ’The question of validity should not be &dquo;Is an instrument valid or not?&dquo; It is more properly phrased, &dquo;How valid is it for a given purpose?&dquo; ’
12 13
1993; 56: 70-72. 14
15
16
17
18
19
References 20
Mahoney FI, Barthel DW. Functional evaluation: the Barthel Index. Md State Med J 1965; 14: 61-65. 2 Granger CV, Albrecht GL, Hamilton BB. Outcome
1
3
4
5
6
7
8
9 10 11
of comprehensive rehabilitation: measurement by the PULSES Profile and the Barthel Index. Arch 1979; 60: 145-51. Phys Med Rehabil Granger CV, Dewis LS, Peters NC, Sherwood CC, Barrett JC. Stroke rehabilitation: analysis of repeated Barthel Index measures. Arch Phys Med 1979; 60: 14-17. Rehabil Wade DT, Skilbeck CE, Langton-Hewer R, Wood VA. Therapy after stroke: amounts, determinants and effects. Int Rehabil Med 1984; 6: 105-10. Shah S, Cooper B, Maas F. The Barthel Index and ADL evaluation in stroke rehabilitation in Australia, Japan, the UK and the USA. Aust J Occup Ther 39: 5-13. 1992; Geddes JML, Chamberlain MA. Outcome of stroke rehabilitation - observing current practice: a prerequisite for targets and standards. Clin Rehabil 1992; 6: 253-60. Wade DT. Measurement in neurological rehabilitation. Oxford: Oxford University Press, 1992. Collin C, Wade DT, Davies S, Horne V. The Barthel ADL Index; a reliability study. Int Disabil Stud 1988; 10: 61-63. Law M, Letts L. A critical review of scales of ADL. Am J Occup Ther 1989; 43: 522-28. Murdock C. A critical evaluation of the Barthel Index, part 1. Br J Occup Ther 1992; 55: 109-11. Murdock C. A critical evaluation of the Barthel Index, part 2. Br J Occup Ther 1992; 55: 153-56.
Bowling A. Measuring health. Milton Keynes: Open University Press, 1991. Shah S, Cooper B. Commentary on ’A critical evaluation of the Barthel Index’. Br J Occup Ther
21
22
23
24 25
26 27
28
29
30
31
Stone MH. Best test design. Chicago, IL: Messa Press, 1979. Shah S, Vanclay F, Cooper B. Improving the sensitivity of the Barthel Index for stroke rehabilitation. J Clin Epidemiol 1989; 42: 703-709. Rasch G. Probabilistic models for some intelligence and attainment tests. Chicago, IL: University of Chicago Press, 1980. Boone WJ. Using item calibration to improve teacher education. Rasch Measurement Trans 1992; 5: 180-81. McArthur DL, Cohen MJ, Schandler SL. Rasch analysis of functional assessment scales: an example using pain behaviours. Arch Phys Med Rehabil 1991; 72: 296-304. Granger CV, Hamilton BB, Linacre JM, Heinemann AW, Wright BD. Performance profiles of the Functional Independence Measure. Am J Phys Med Rehabil 1993; 72: 84-89. Wright B. IRT in the 1990s: Which models work best? Rasch Measurement Trans 1992; 6: 196-200. Guttman L. The basis of scalogram analysis. In: Stouffer SA, Osborne F eds. Measurement and prediction. New York: Wiley, 1950. Nouri FM, Lincoln NB. An extended activities of daily living scale for stroke patients. Clin Rehabil 1987; 1: 301-305. Barer DH, Murphy JJ. Scaling the Barthel: a 10point hierarchical version of the activities of daily living index for use with stroke patients. Clin 1993; 7: 271-77. Rehabil Streiner DL, Norman GR. Health measurement scales. Oxford: Oxford University Press, 1994. Linacre JM. Log-odds in Sherwood Forest. Rasch Measurement Trans 1991; 5: 162-63. Wright BD, Linacre JM. A user’s guide to BIGSTEPS. Chicago, IL: Messa Press, 1992. Cortina JM. What is coefficient alpha? An examination of theory and applications. J Appl Psychol 1993; 78: 98-104. Granger CV, Hamilton BB, Sherwin FS. Guide for the use of the uniform data set for medical rehabilitation. New York: Uniform Data System for Medical Rehabilitation Project Office, Buffalo General Hospital, 1986. Magalhaes L, Velozo C, Pan A-W, Weeks D. Medical multidimensionality. Rasch Measurement Trans 1993; 7: 265-66. Wade DT, Leigh-Smith J, Hewer RL. Social activities after stroke: measurement and natural history using the Frenchay Activities Index. Int Disabil Stud 1985; 7: 176-81. DeJong G, Branch LG. Predicting the stroke
Wright BD,
Downloaded from cre.sagepub.com at Sheffield Hallam University on October 18, 2010
308 to live independently. Stroke 1982; 13: 648-55. 32 Silverstein B, Fisher WP, Kilgore KM, Harley JP, Harvey RF. Applying psychometric criteria to functional assessment in medical rehabilitation: II.
patient’s ability
Defining
interval
measures.
Arch
Phys
Med Rehabil
73: 507-18. 1992; 33 World Health
Organization. The International Classification of Impairments, Disabilities and Handicaps. Geneva: WHO, 1980.
Downloaded from cre.sagepub.com at Sheffield Hallam University on October 18, 2010