Educational Evaluation and Policy Analysis http://eepa.aera.net
Measuring Instructional Practice: Can Policymakers Trust Survey Data? Daniel P. Mayer EDUCATIONAL EVALUATION AND POLICY ANALYSIS 1999; 21; 29 DOI: 10.3102/01623737021001029 The online version of this article can be found at: http://epa.sagepub.com/cgi/content/abstract/21/1/29
Published on behalf of
http://www.aera.net
By http://www.sagepublications.com
Additional services and information for Educational Evaluation and Policy Analysis can be found at: Email Alerts: http://eepa.aera.net/cgi/alerts Subscriptions: http://eepa.aera.net/subscriptions Reprints: http://www.aera.net/reprints Permissions: http://www.aera.net/permissions
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
Educational Evaluation and Policy Analysis Spring 1999. Vol. 21, No. I, pp. 29-45
Measuring Instructional Practice: Can Policymakers Trust Survey Data? Daniel P. Mayer Mathematica Policy Research, Inc. Policymakers' new interest in reforming teaching has created a demand for accurate data on instructional practice. Most such data come from teacher surveys, although the accuracy of these surveys in assessing practice is virtually unknown. This exploratory study examines the reliability and validity of using a survey to gauge the percentage of time that algebra teachers use practices that are consistent with professional standards for teaching mathematics. A 13-item composite measuring the frequency with which teachers use these practices had a test-retest reliability of .69 based on two waves of survey data collectedfour months apart. In addition, there was a .85 correlation between a composite based on survey data and a parallel composite based on classroom observations. Both sets of results suggest that the composite is quite reliable, and the second set of results suggests that it has some validity. This composite fails, however, to capture the quality with which teachers engage in reform practices. In addition, the results indicate that teachers do not reliably report how much time they spend using one practice or another. This suggests that the trend of presenting individual indicators, rather than composite indicators, in state and national reports may be misguided. More research should be done to improve the reliability and validity of both.
Policymakers are trying to reform education by reforming teaching practices (Blank & Pechman, 1995). To monitor the impact of these unprecedented efforts, the country needs accurate instructional practice data, which have traditionally been in short supply and of questionable quality (see, e.g., Burstein et al., 1995). The push for the routine collection of instructional practice data only began in the late 1980s (see, e.g., Murnane & Raizen, 1988; OERI, 1988; Porter, 1991; Shavelson, McDonnell, Oakes, Carey, & Picus, 1987). In 1997, the National Center for Education Statistics (NCES) added an "instructional methods" category to its list of annually released vital statistics (U.S. Department of Education, 1997b), and other national and state reports are just beginning to present similar data (Council of Chief State School Officers, 1997; U.S. Department of Education, 1997a; Weiss, Matti, & Smith, 1994). The NCES data tell us the frequency with which teachers lecture, work with students in small groups, and work with students individually. As data of this sort become more
widely available, their quality needs to be assessed before we use them to pass judgment on the state of instructional reform. Historically, education reforms have tinkered at the edges of the educational process. "In its two and a quarter centuries, the United States has never [until now] had explicit education content... goals" (Marshall, Fuhrman, & O'Day, 1994, p. 12). Even the extensive reform efforts of the 1970s and 1980s remained aloof from curriculum and teaching practices. During those decades, policymakers tried to improve schooling by adjusting resource allocations (e.g., striving for racial balance and financial equity) and by setting outcome goals (e.g., setting minimum course requirements and implementing minimum competency tests). The perceived failures of these policies arguably have led to the country's current enthusiasm for educational standards aimed at influencing teaching practice and the need for high-quality teacher practices data. Much of the data are self-reported by teachers (e.g., Council of Chief State School Officers, 1997;
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
29
Mayer U.S. Department of Education, 1992; Weiss et al., 1994), even though Burstein et al. (1995) suggest that the validity of the process questions used on the national teacher surveys might be limited because none of the national survey data collected from teachers have been validated to determine whether they measure what is actually occurring in classrooms. Despite major advances in the design of background process measures, studies have generally developed only a few new items and then "borrowed" others from earlier studies. Little effort has been made to validate these measures by comparing the information they generate with that obtained through alternative measures and data collection procedures, (p. 8) Why should the public be skeptical about the quality of this data? Burstein et al. (1995) argue that all surveys are limited in their ability to portray a valid picture of the schooling process.... [S]ome aspects of curricular practice simply cannot be measured without actually going into the classroom and observing the interactions between teachers and students. These interactions include discourse practices that evidence the extent of students' participation and their role in the learning process, the specific uses of small-group work, the relative emphasis placed on different topics within a lesson, and the coherence of teachers' presentations, (p. 7) Surely, this perceived limitation of surveys, combined with policymakers' and researchers' historical emphasis on input-output studies, helps explain why much of what the country currently knows about the instructional process comes from in-depth studies done in only a handful of classrooms. The problem with these in-depth studies is that their generalizability to other classrooms is unknown. As reform initiatives increasingly focus on classroom processes, demand for impact analyses increase, and the generalizability limitations of indepth studies becomes more and more problematic. In turn, the appeal of surveys grows because they are a cost-effective way to include large numbers of classrooms in studies. Alternative study models that straddle these two approaches for gathering teacher practice data are being tried. The Third International Mathematics and Science Study (TIMMS) supplements teacher surveys with a "video survey" of teachers in classrooms. The video survey, 端ke classroom observations, promises objectivity and specificity but has 30
the added advantage of being available for wider and more systematic scrutiny. The TIMMS approach does not, however, surmount the primary hurdle associated with conducting classroom observations, which is cost. Conducting video surveys in a nationally representative sample of classrooms of different grade levels and subject areas undoubtedly would be cost-prohibitive. Consequently, teacher self-reports of the sort collected in large national studies such as the School and Staffing Survey and the National Survey of Science and Mathematics Education (U.S. Department of Education, 1992; Weiss et al., 1994) remain the most viable means for obtaining information about the status of instructional practice in the United States. Although reliability and validity are the fundamental components used in assessing the quality of measurement instruments (Carmines & Seller, 1979), the research assessing the validity and reliability of self-reported teacher data is limited in quantity and focus. The reliability of an instrument determines if its use on repeated trials will yield the same results. But knowing that an instrument is reliable does not assume that it is valid. T7iis study explored these issues by seeking to answer the question: Do teacher surveys offer an accurate portrait of what teachers do in their classrooms?1 The study focused on the use of the National Council of Teachers of Mathematics' (NCTM) Professional Standards for Teaching Mathematics (National Council of Teachers of Mathematics, 1991) in algebra classrooms.2 Examining the accuracy of surveys within this context afforded a test of Burstein et al.'s (1995) suggestion that surveys used in reform settings may well be invalid. NCTM was one of the earliest and most important players in the development of curriculum and teaching standards (National Council of Teachers of Mathematics, 1989). Most states have recently created or revised mathematics curriculum frameworks with explicit recommendations regarding teaching practices that are heavily influenced by the NCTM standards (Blank & Pechman, 1995). The ideas presented in these standards undergird not only many of the state frameworks, but also other prominent science, mathematics, and technology education reform movements throughout the United States and other developed countries (Black & Atkin, 1996). In addition to their prominence in the curriculum-reform movement, the NCTM standards are significant because their successful implementa-
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
Measuring Instructional Practice
tion could provide future economic rewards for students. Employers today require workers with higher mathematics skills than in the past, and some of the most important conceptual skills emphasized by the NCTM approach (e.g., the ability to solve problems, to make conjectures, and to communicate both verbally and in writing) are increasingly valued in the workforce (Murnane & Levy, 1996). The NCTM approach argues that optimal learning of mathematics requires that teachers place less emphasis on memorization of facts and mastery of routine skills and greater weight on application, reasoning, and conceptual understanding. For students to understand what they leam, they must enact for themselves verbs (hat permeate the mathematics curriculum: "examine," "represent," "transform," "solve," "apply," "prove," "communicate." This happens most readily when students work in groups, engage in discussion, make presentations, and in other ways take charge of their own learning. (National Council of Teachers of Mathematics, 1989, pp. 58-59) In assessing the reliability and validity of teaching practice measures, this study explores how the reliability of measuring the degree of use of the practices consistent with the NCTM methods are affected by moving from relying on individual survey questions to relying on a set of questions. It is widely acknowledged that measurement quality increases if additional items measuring the same underlying characteristic are combined into one indicator of that characteristic (Light, Singer, & Willett, 1990). Furthermore, previous research on the subject has concluded that "an item-by-item" description of teacher practices "fails to produce a coherent picture of instruction" (Burstein et al., 1995, p. 36). A fundamental concern of this study was whether individual survey items like those used by NCES are reliable and valid measures of classroom practice and how combining them might improve their overall quality.
1993). To date, Smithson and Porter (1994) and Burstein et al. (1995) have completed the only systematic investigations of the accuracy of using surveys to measure various aspects of the curriculum.3 The Smithson and Porter study was designed to look at how the mathematics and science curriculum was delivered in secondary schools. The bulk of the classroom practice data for this study came from daily logs, which teachers kept for a full year. Sixtytwo teachers in 18 schools in six states completed, on average, 165 logs describing their curricula and teaching approaches. Teacher interviews, classroom observations, and teacher questionnaires were also used to examine the curriculum. Burstein et al. designed their study to investigate the instructional content, goals, and strategies used in secondary mathematics classrooms. The study included 70 mathematics teachers from nine schools in two states. Over 5 weeks, the researchers collected assignments and exams given by the teachers. During this same period, teachers completed daily logs in which they reported what they taught and how they taught it. Teachers were also asked to complete two surveys, one given at the beginning of the 5-week period and the other at the end of the semester. The surveys asked teachers to describe (prospectively on the first survey and retrospectively on the second) the instructional practices they used and the topics they covered. The textbooks used in the class were also reviewed by the researchers. These two studies provide useful information about teacher logs, surveys, and instructional artifacts (e.g., assignments, exams, textbooks) that can be used to measure many aspects of the curriculum, not just teaching practices. Yet as explained below, the designs of these studies limited the researchers' ability to directly test the reliability and validity of using surveys to measure instructional practice. Reliability
Prior Research Few studies have examined the quality of selfreported data on teaching practices, although these data are widely circulated (Council of Chief State School Officers, 1997; U.S. Department of Education, 1997a; Weiss et al., 1994) and used to assess the impact of reforms (Carpenter, Fennema, Peterson, Chiang, & Loef, l989;Cobbetal., 1991; Knapp & Associates, 1995; Simon & Schifter,
The reliability of an instrument is based on that instrument's "ability" to elicit the same response each time the instrument is administered, regardless of whether it is a survey or a multiple-choice exam. And knowing whether an instrument is reliable is essential given that "reliability is a necessary precondition for validity. An unreliable measure cannot be valid' (Light et al., 1990, p. 172, italics added).
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
31
Mayer Burstein et al. re-administered a survey to the same teachers twice, once at the beginning of their study and again at the end of the semester. (Smithson and Porter did not re-administer their survey.) The Burstein survey questions concerned instructional activities, content, goals, and teacher background. They found that 60% of the responses were exactly the same, and 90% were within one response category. For some items, however, "a large proportion of responses changed" (1995, p. 16). They speculated that this might be because "teachers have more precise information at the end [of the year]" (p. 16). While they specified that the responses that changed concern instructional practice, the authors did not document all of the items associated with a changed response or what percentage of the total number of instructional practice items they represented. In addition, there is no way to verify whether the reason for the change was, as the authors speculate, that teachers could be more precise about their estimates at the end of the year or that the survey instrument is unreliable. Both Smithson and Porter and Burstein et al. also provide data on the degree of overlap between teacher logs and teacher surveys. While these data do not provide a direct assessment of the reliability of their surveys, they do cast some light on the issue. The data from these two studies cannot tell us whether the absence of overlap, or limited overlap, is because of a problem with the logs, the survey, or both. As a result, the reliability of both instruments remains questionable, and the ability to compare data across instruments thus becomes problematic. Nevertheless, it makes sense to take a look at their survey/log results because if there is a high degree of overlap, it suggests the instruments are reliable. Smithson and Porter (1994) found that the correlation for 6 of the 11 instructional activities is less than .43 and less than .60 for 9 of the 11 activities (see Table 1 ).4 The authors believe these correlations are encouraging, especially given that their survey and logs were not as congruent as they could have been. The questionnaire data pertained to only half of the school year, whereas the log data were for the full year. In addition, some teachers provided retrospective data on the survey, whereas others provided prospective data. Finally, the scale used to provide estimates of the time spent on instructional approaches was not the same for the logs and surveys. Each of these factors may have contributed to lowering the correlations. 32
TABLE 1 Correlations Between Teacher Practices Reported on Surveys With Those Reported on Logs Lab or field work 0.65 Whole-class discussion 0.63 Complete written exercises/take a test 0.53 Discuss/discovery lesson 0.52 Students working independently 0.47 Students working in pairs/teams/ small groups 0.42 Lecture 0.41 Listen/take notes 0.40 Recitation/drill 0.39 Present/demonstrate 0.25 Write report/paper 0.21 Note. Adapted from Smithson and Porter (1994). Burstein et al. (1995) found that for 9 of the 13 items, the agreement between the end-of-the-term surveys and the logs was less than 51%, and for all 13, it was 60% or less (see Table 2). The authors conclude that this rate of direct agreement is "quite low" (p. 40). This lack of agreement might be a result of differences in the period of data collection: The logs were designed to collect data covering 25 days of instruction, whereas the surveys collected data pertaining to the whole semester. Another reason for the lack of agreement might be due to the fact that to compare the survey to the log entries, the authors had to convert the continuous data from the logs into the categorical responses on the survey. Burstein et al., however, note that if the answers within one survey response category are considered, the rate of agreement changes dramatically (see Table 2). For example: "Students working in small groups" goes from 50% agreement to 100% agreement; "lecture" goes from 57% to 97%; "teacher-led discussions" goes from 35% to 87%. But given that there are only four categories ("almost every day," "once or twice a week," "once or twice a month," and "never"), it is not clear that this approach is statistically or substantively meaningful. Statistically, given that there are only four categories, the odds that a person picks the same response is one in four, even if there is no association between their responses. These odds more than double when the within-one-category rule is applied. A person who selects the first or fourth category thefirsttime will have at least a 50% chance of being within one category of that response the second time. A person who selects either of the two
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
Measuring Instructional Practice TABLE 2 Consistency Between the Reported Frequency of Instructionalf Activities on Teacher Logs and Teacher Surveys
Administer a lest Have students work exercises at the board Lecture Use manipulatives to demonstrate a concept Have students work with other students in small groups Demonstrate exercise at the board Have students work on computer Have students work with manipulatives Have students work with calculator Have students work individually on written assignments Correct or review homework in class Have students work on next day's homework in class Have teacher-led, whole-group discussions
Direct agreement (%)
Within one survey response category
60 59 57 53 50 50 49 48 47 43 38 35 35
83 92 97 84 100 85 87 87 91 83 85 62 87
(%)
Note. Adapted from Burstein, L., McDonnell, L.M., Van Winkle, J., Ormseth, T., Mirocha, J. , and Guitton,G.( 1995).
middle categories the first time has a 75% chance of being within one category the second time. Under the assumption of no association between responses, 62.5% of the responses should be within one category if the distribution of responses were evenly spread out among the categories. Despite these statistical arguments, being within one survey response category does not seem to be very meaningful in terms of substance because the categories represent such different amounts of time. Should "once or twice a week" and "once or twice a month" be considered the same? Or "once or twice a month" and "never"? Thus, both substantively and statistically, these "improved" agreement rates do not offer convincing evidence that the survey is reliable. Thus, the surveys and logs do not appear to overlap enough to prove with any degree of confidence that the surveys are reliable. But because we do not know why these statistics of agreement are so mediocre, we cannot say they are unreliable either. Validity To be valid, surveys must offer an accurate portrait of what teachers do in their classrooms. There are several reasons to be concerned that they might not: (a) The teaching process consists of complex interactions between students and teachers that a survey may not be able to do justice to; (2) teachers might provide biased responses to a survey because they feel that they should (for a variety of reasons) respond to the questions in an "accept-
able" or "socially desirable" way; and (3) teachers might unknowingly provide misleading responses to the survey questions. Research suggests that teachers sometimes truly believe they are embracing pedagogical reforms, but in practice, their teaching comes nowhere near the vision of the reformers (Cohen, 1990). It would be inappropriate to use Burstein et al. and Smithson and Porter's measures of consistency between teacher logs and teacher surveys to assess validity. As Burstein et al. note, "Because logs were completed by teachers, they do not constitute an external source for validating the surveys" (p. 17). In essence, because teachers could provide, on both instruments, an equally inaccurate portrait of what they are doing in their classrooms, the instruments cannot be used as a check against each other. Classroom observation data provide one potential external standard against which to assess the validity of teacher self-reports. Smithson and Porter collected classroom observation as part of their study, and they compared these observation data with the teacher logs. But they did not compare the observations with the surveys. Burstein et al. did not collect classroom observation data, but they used what they call classroom artifacts (i.e., textbooks, exams, and assignments) to assess the validity of some items on their surveys. While this approach is useful for looking at some dimensions of the curriculum, they find that instructional practice "is the dimension of the curriculum least amenable to validation through writ-
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
33
Mayer
ten artifacts" (p. 39). Textbooks and assignments do not reveal, for example, anything about the amount of class time spent engaged in particular activities (e.g., group work, lecture, or class discussion). Burstein et al. did compare the artifacts with teacher responses on the survey concerning the characteristics of their exams and homework assignments. They conclude: 'To the extent that we were able to validate the survey data on teachers' instructional strategies, we found that those data report accurately the instructional strategies used most often by teachers" (p. 45). Burstein et al. note, however, that their validity findings may well be driven by the fact that only a handful of teachers in their sample embraced teaching activities consistent with the mathematics reform movement. "Although the picture of teaching that can be drawn from survey data is quite general, it is probably valid because both the survey and the artifact data clearly show that there is little variation in teachers' instructional strategies" (p. 45, italics added). They make this qualification because they found that the survey items having to do with the teaching methods endorsed by NCTM were the least valid. In other words, the lowest levels of consistency existed between these questions and the artifact data. Although Burstein et al.'s validity findings are encouraging, they would be seriously limited if they are not valid (as they imply might be the case) in places where teachers have embraced mathematics reform. Indeed, as policymakers and educators push standards-based reforms, variation in mathematics teaching styles likely will increase. And if a primary reason for validating survey instruments is to ensure that they can provide us with critical information about the progress of reforms (e.g., the degree of implementation or the impact on student learning), then Burstein et al.'s concerns must be taken seriously. Given the unprecedented interest in modifying instructional practice, researchers, educators, and policymakers need more information on whether surveys can accurately measure instructional practice. The study described in this article attempts to address this need. Methods Overview Research for the study was conducted in two parts: one designed to test data reliability and one 34
to test data validity. The reliability study examined whether teachers' answers to questions regarding their teaching style were consistent over time. Identical survey questions were administered to the same group of teachers on two separate occasions. The validity study compared the survey responses of a group of teachers with observed classroom practice. Study Site The target population was in the Elm school district (a pseudonym for the actual district), which was chosen because the probability that teachers there were implementing practices consistent with the NCTM standards was thought to be high. The Elm district is in one of the country's largest school systems. The district rings a major city and has over 100,000 students. Seventy percent of the student population is Black, 20% White, 4.6% Hispanic, and 4.4% Asian. Thirty-seven percent of the students receive free or reduced-price school meals. In 1989, just after the release of the first NCTM standards document (National Council of Teachers of Mathematics, 1989), the Elm school committee recommended that the NCTM standards be implemented in all mathematics courses. Incentives for teachers to adopt the standards came from the state's testing program and the professional training offered by the College Board's EQUITY 2000 project. The state in which Elm resides is one of the only states in the country using an "authentic" assessment program. The program has been in place for several years and in many respects is aligned with the standards. In 1991, the College Board's EQUITY 2000 project committed over $2 million to conduct five annual summer professional development institutes for all Elm algebra teachers. The institutes focused on teaching the teachers to use the standards in their algebra classrooms (Choike, 1993), and the average teacher attended at least two institutes. Teaching Style In the study survey, 34 questions were used to gauge the amount of time each teacher used practices consistent with the NCTM teaching standards in his or her classroom. The survey asked teachers to describe the practices used in a typical Algebra 1 class over the course of the entire year. Table 3 lists the 17 teacher practices measured by these questions. Four of them represent traditional activities (e.g., lecturing and working from text-
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
Measuring Instructional Practice TABLE 3 Teacher Practice Variables Used to Assess the Percentage of Classroom Time Teachers Devote to Activities Consistent with the NCTM Standards Traditional approaches Students... Listen to lectures Work from a textbook Take computational tests Practice computational skills NCTM approaches Students... Use calculators Work in small groups Use manipulatives Make conjectures Engage in teacher-led discussion Engage in student-led discussion Work on group investigations Write about problems Solve problems with more than one correct answer Work on individual projects Orally explain problems Use computers Discuss different ways to solve problems books); the remaining 13 represent NCTM-type activities (e.g., working in small groups and dis cussing different ways to solve problems). The descriptions of teacher practices were culled from national and statewide studies designed to gauge the level of implementation of an N C T M type teaching approach (Center for Research on the Context of Teaching, 1994; Porter, Kirst, Ostho f, Smithson, & Schneider, 1993; Weiss et al., 1994). These descriptions were used because they had been thoughtfully developed for projects with substantial resources, they are in vogue, and they are being used to shed light on the current state of teaching reforms. The questions and the response options are almost identical to questions used by NCES in their Teacher Follow-Up Survey (U.S. Department of Education, 1992). The survey was constructed to avoid making the teachers think they should say they are using the NCTM approach. In fact, the survey never men tions the NCTM standards and steers clear of any implication that it is part of an EQUΠΎ 2000 evalu ation effort. The survey cover letter was written by the director of the Elm district's research and evalu ation office. It states that the office is conducting "an external agency study" in order to better un
derstand "the teachers' opinions and practices." The teaching practice questions are given no special emphasis and appear in the middle of the survey. Calculating the average weekly amount of time teachers spent using each teaching approach in volved three steps. First, the six-point Likert scales used in the survey (and presented in Table 4) were converted into days per year for frequency and minutes per class period for duration. Second, fre quency was multiplied by duration to provide an estimate of the number of minutes spent on each teaching approach over the course of the school year. Third, an estimate of the amount of time spent using each teaching approach weekly was arrived at by dividing the total minutes per year by the av erage number of weeks in a school year. While evaluating the amount of time teachers spend using each of the 17 activities is informa tive, creating one indicator (or more) made up of a group of items may be even more so. Conceptu ally, looking at the items separately is somewhat misleading because the 17 activities are not mutu ally exclusive. Teachers could require students to be engaged in activities that include both NCTM and traditional components (e.g., students use cal culators when they work on textbook problems). By creating an indicator of the percentage of time that teachers spend using the 13 NCTM-type ap proaches (relative to all 17 approaches), the pre ferred pedagogical style of the teachers could be identified. A statistical argument for combining variables can also be made: When items representing the same construct (such as NCTM instruction) are combined, the indicator becomes more reliable (Light et al., 1990). Thus, for both substantive and statistical reasons, both the reliability of the indi vidual indicators and the reliability of a composite measure were investigated. An equally weighted composite variable was cre ated that represents the percentage of available teaching time that each teacher spends using the 13 practices that are consistent with the NCTM stan dards. The composite measure of the 13 variables has a very high internal reliability rating (α = .85). There may be other ways to composite these vari ables. For example, an argument could be made for two composites—one representing cognitive skills (e.g., making conjectures, solving problems, explaining answers, and discussing ways in which the problems were solved) and the other represent ing instructional skills (e.g., using calculators, work-
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
35
Ma\er TABLE 4 Teaching Style Scale Conversions For how many minutes?b
How often?" Original scale
Conversion
Original scale
Conversion
Never A few times a year Once or twice a month Once or twice a week Nearly every day Daily
= = = = = =
None A few minutes in the period Less than half About half More than half Almost all of the period
= = = = = =
0 days 5 days 14 days (1.5* 9.1) 55 days (1.5* 36.8) 129 days (3.5 * 36.8) 184
0 minutes 5 minutes 15 minutes 25 minutes 37.5 minutes 50 minutes
"Assumes 184 days, 36.8 weeks, 9.1 months in a school year. "Assumes a 50-minute period.
ing in small groups, using manipulatives). But an exploratory analysis revealed that the results of the study would not have changed by redefining the composite in this way. Sample and Analytic Approach The target population for this study includes all Algebra I teachers in the district. All teachers who completed a survey and answered the questions pertaining to teaching practices were eligible to be in the reliability and validity study (n = 124). The response rate for the survey was 80%, and the teach ers who completed a survey are not statistically different (in years of teaching experience, highest degree attained, gender, or ethnicity) from those who did not. Table 5 presents some demographic information for the study samples. The reliability study consisted of a stratified ran dom subset of 19 teachers. Observed changes in teacher responses are thought to reflect lack of con sistency rather than actual change in practice be cause no Elm, EQUΓΓΎ 2000, or NCTM profes sional development activities occurred during the four months between the first and second adminis
trations of the survey.5 All teachers in the analytic sample were sorted by their initial self-reported NCTM score into four quartiles. Then 5 teachers were randomly selected from each quartile. Nine teen of 20 teachers completed the follow-up sur vey, providing a response rate of 95%. The validity study used a stratified random sample of nine teachers. The full sample of teach ers was stratified by the teachers' initial self-re ported level of NCTM implementation into three groups: teachers whose NCTM score was in the bottom quartile, the middle two quartiles, and the top quartile. Three teachers were selected from each stratum and observed for three class periods. These teachers were selected by an independent researcher to ensure that the observations would not be colored by knowing the teachers' self-re ported NCTM scores. The teachers' survey re sponses were withheld until all observations were carried out and the collected data were coded. Because a teacher might emphasize a certain type of pedagogical approach when teaching a specific part of the curriculum, the three observations of each teacher were spread out over a period of weeks
TABLE 5 Demographics of Study Samples Full sample M (SD)
Reliabil ity study M (SD)
Valiídity study M (SD)
N-= 124
N =:I9
N =9
Variables Percent of time using NCTM approaches Black White Female Male Years' teaching
36
68% 33% 67% 52% 48% 13.2
(20%)
(10.3)
69% 42% 58% 68% 32% 12.8
(20%)
(10)
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
58% 44% 66% 44% 66% 12.1
(25%)
(10.1)
Measuring Instructional Practice to allow for observations of the teachers teaching different material. During the observation, the teaching activities used in each lesson were described in detail. Special attention was paid to the content of the lesson and to the dialogue and interactions among students and teachers. The time spent on each distinct activity was noted as were the types of pedagogical approaches used throughout the lessons. Two analytic approaches were used to interpret these observations. One, "the time estimation approach," is quantitative and consists of assigning each teacher an "observed" NCTM score. The observed score is then correlated with the "self-reported" NCTM score in order to gauge how well the observed and self-reported time estimates match. The second approach, "the descriptive approach," consists of juxtaposing in-depth descriptions of actual lessons with the findings from both the teachers' survey responses and the time-estimation analysis. This approach seeks to reveal whether there are limitations in using either the selfreported or observed time estimates. Findings Reliability The reliability of the NCTM composite is en-
couraging. Teachers reported using NCTM-type practices 69% of the time on the first survey and 70% of the time on the second survey. The correlation between these two NCTM composites was .69 (p = .0013), and a paired t test illustrates that the means did not differ significantly from one period to the next (p = .45). This indicates that the NCTM composite is highly reliable. The picture is less encouraging when we look at the reliability of individual instructional practices. Table 6 presents two sets of means and standard deviations for each instructional practice indicator, one from each wave of the survey. If the indicators are ranked by minutes per week, the overall pattern in the ranking is quite similar. For example, the top five reported teaching practices (using calculators, working in small groups, working from textbooks, teacher-led discussion, and practicing computational skills) and the bottom five reported teaching practices (solving problems that have more than one correct answer, writing about problems, working on group investigations, student-led discussion, and working on individual projects) are the same for both survey waves, although the order within these bands differs a little. In addition, for 10 of the 17 practices, the mean amount of time using each practice is quite similar in each wave of the survey. Less encouraging is that the average amount of time these practices are used does shift
TABLE6 The Number of Minutes Per Week That Teachers Reported Using Various Instructional Practices on the First and Second Survey (N = 19) First Items Use calculators Work in small groups Work from a textbook Teacher-led discussions Practice computational skills Listen to lectures Discuss different ways to solve problems Orally explain problems Conjectures Use manipulatives Take computational tests Use computers More than one correct answer Write about problems Group investigations Student-led discussions Work on individual projects
Second
M
SD
M
SD
94.0 77.0 65.5 55.2 52.5 42.9 40.5 37.5 33.0 32.8 25.9 15.0 13.8 12.5 6.0 5.5 5.0
67.2 66.0 46.2 51.6 40.7 36.8 41.4 40.6 40.7 47.8 43.2 19.3 29.5 17.4 6.9 12.1 5.7
87.3 74.8 51.1 54.5 69.7 34.9 44.9 37.1 37.8 32.4 27.0 28.4 8.8 15.3 9.1 13.6 5.1
70.5 47.5 49.8 44.5 54.3 32.4 46.5 36.7 48.1 48.4 29.1 43.3 12.8 19.4 16.8 18.5 8.4
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
37
Mayer
quite a bit between survey waves. For example, there is a 22 to 89 percentage-point difference between the average amount of time reported spent using 7 of 17 practices (working from a textbook, practicing computational skills, listening to lectures, solving problems with more than one correct answer, writing about problems, working on group investigations, and engaging in student-led discussion) between survey waves. On average, teachers reported spending 89% more time using computers, 33% more time practicing computational skills, and 22% less time using textbooks the second time they completed the survey. A correlation analysis reveals more about the consistency of the teachers' responses between waves. Table 7 shows that only 3 of the 17 teacherpractice items measuring time are correlated at above .60 (a moderate or somewhat problematic level of reliability) and 9 of the items are correlated at less than .30 (a low and very problematic level of reliability). Do these weak correlations indicate that the survey does not provide a reliable measure of how teachers divide their time among various individual instructional approaches, or are the weak correlations an artifact of limitations inherent in the data? The following explanations for the weak correlations are worth considering: (a) They are due to the categorical nature of the data; (b) there is a problem inherent in one of the two variables used to
construct the measure of time (time = frequency x duration; see "Methods" section above); (c) some of the response options were hard to distinguish between (e.g., "nearly every day," "daily," and "once or twice a week" might be so closely clustered that teachers could reasonably select one the first time they completed the survey and another the next time); or (d) the teachers' estimates differed from one wave to the next because the surveys were given four months apart. The first explanation, that the categorical nature of the data limits the possibility of finding strong correlations, appears to be weakened by the fact that each indicator of time has 36 plausible combinations and 26 plausible values (see Table 4). When a variable is measurable at only 11 levels, then it is safe to say that "little information is lost relative to when more continuous measurement is possible" (Nunnally, 1978, p. 123). Thus, the time indicator used in this analysis is akin to using continuous data. The possibility of a problem inherent in one of the two variables used to construct the measure of time was investigated by examining the frequency and duration-of-use questions separately using the origi nal response option metric. As Table 7 shows, the overall picture does not change much. For frequency, some items appear more reliable (e.g., work on group investigations, orally explain problems), but others appear less reliable (e.g., discuss differ-
TABLE 7 Correlation Between Answers on the First and Second:,>'Ă?N = 19) Items Time Use calculators Write about problems Make conjectures Discuss different ways to solve problems Practice computational skills Use manipulatives Take computational tests Engage in teacher-led discussions Solve problems with more than one correct answer Orally explain problems Work from a textbook Work on group investigations Work on individual projects Listen to lectures Work in small groups Engage in student-led discussions Use computers ~ p < . 1 0 . *p<.05. **p<.Ql.
38
0.66** 0.65** 0.60** 0.48* 0.48* O.44~ 0.35 0.31 0.30 0.28 0.25 0.24 0.16 0.12 0.07 -0.06 -0.09
Duration
Frequency
0.76*** 0.35 0.51* 0.54* 0.61* 0.76*** 0.390.36 0.440.24 0.02 0.29 0.25 0.34 O.42~ 0.35 0.29
0.77*** 0.36 0.78*** 0.25 0.39 0.390.16 0.33 0.55* 0.47* O.45~ 0.47* 0.20 0.22 0.26 0.18 0.54*
***p<.001.
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
Measuring Instructional Practice
ent ways to solve problems, write about problems). The same is true for duration. The tight clustering of response options as a possible problem was explored by collapsing response categories (e.g., "nearly every day" and "daily" were coded to indicate the same response). Even after the categories were collapsed in various ways, the correlations did not improve. Thus, the findings do not appear to be driven by poorly discriminating response options. The fourth possible explanation for the weak correlations could be that the two waves of the survey were administered four months apart. Because teachers were asked to describe their practices used in a typical Algebra 1 class over the course of the entire year, they might change their estimates from one wave to the next simply because they are spending more or less time using each of the practices than they origina]ly anticipated. However, given that the average teacher in the sample has been teaching for 13 years, it is hard to imagine that their initial estimates would be so far off as to fully explain the weak correlations presented in Table 7. In sum, none of the four explanations adequately reveals why the self-reports of individual teaching practices are unreliable, and thus the weak correlations do not appear to be an artifact of limitations inherent in the data. Instead, the weak correlations seem to indicate that the survey does not provide a reliable measure of how teachers divide their time among various individual instructional approaches. However, the NCTM composite measure does provide a consistent estimate of the amount of time teachers use a variety of NCTM-type practices. The Validity Study The time-estimation approach. The coding scheme for the observations consisted of deriving estimates of the amount of time teachers spent using each of the 17 teacher practices listed on the survey (see Table 3, above) and then calculating the percentage of time spent using teaching approaches consistent with the NCTM standards for each lesson and averaging it across the three visits. Observing the teachers only three times could not provide a complete picture of how much time the teachers spend using each of the 17 approaches because teachers use some of the practices infrequently. Thus, teaching approaches that are only used occasionally may or may not get observed. For this reason, the observations cannot accurately gauge the use of each of the 17 individual items.
Given this and the fact that individual items were already shown to be unreliable (and therefore, by definition, invalid), only the results pertaining to the validity of the NCTM composite are presented here. Figure 1 illustrates that the observed and selfreported composites are strongly correlated (r = .85, p = .004). Low self -reports are paired with low classroom observation scores, and high selfreports are paired with high classroom observation scores. The figure also shows a pattern of systematic inflating on the self-reports: The self-reported scores of seven of the nine teachers are above the line of equality. In other words, seven of nine teachers claimed they spent more time using the NCTMtype approaches than they used during the observations. A paired t test suggests that the means significantly differed (p = .06) by about 10 percentage points. At the very least, thesefindingssuggest that the relative rankings of the self-reported scores are relatively correct. The data from this study do not prove that the observations underestimated, or the teachers' survey responses overestimated, the amount of time spent using the NCTM-type approaches. In sum, the survey appears to have a great deal of validity as an instrument for distinguishing the relative, if not the exact, amount of time teachers spend using NCTM approaches in their classrooms. In designing the study, I investigated how many classroom observations would be sufficient to get
Classroom Observations (%) FIGURE 1. Percentage of time nine teachers spent using NCTM-type practices according to self-reports and classroom observations.
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
39
Mayer
an accurate picture of a given teacher's teaching style. I found that no solid evidence exists on this question. Conversations with several researchers who conduct classroom observations led me to believe that three visits were "intuitively" about right. Surprisingly, my initial classroom visits were the most closely aligned with the self-reported scores. The correlation between my first observed NCTM score and the reported score was .84. The observations from my second and third visits were correlated at .69 and .59, respectively. Why did the correlation between observed and reported scores go down with each visit? Was this because I got lucky on the first day and saw a very representative lesson? Or was it that the teachers came to realize what I was looking for and adjusted their teaching styles accordingly? There is some evidence to support the latter hypothesis because the teachers with the lowest reported NCTM scores used the NCTM-type approaches more in the second and third observations. Despite these fluctuations, further analysis of the amount of variation in the observations revealed that there was much more variation among teachers than among observations for a particular teacher. In fact, 74% of the overall variation in the observed scores existed between teachers, and only 26% existed within classrooms. The descriptive approach. The time-estimation approach provides an aggregate look at whether the self-reported scores are similar to observed scores. The descriptive approach provides a detailed portrait of the classrooms to clarify what the NCTM measure is, or is not, telling us. Using the approach provides some insight into the strengths and weaknesses of the NCTM measure. A strength of the measure is illustrated by comparing the observed and reported teaching approaches used by two teachers, Mr. Hill and Ms. Jackson; a weakness in the measure is revealed by comparing that of two other teachers, Ms. Weiss and Mr. Hemphill, who have qualitatively different interpretations of the NCTM standards. MR. HILL AND Ms. JACKSON. Mr. Hill reported on his survey that he spent 16% of his overall class time implementing instructional approaches consistent with the standards, and the observations indicated that he spent about 14%. This places him among the least enthusiastic NCTM users in the study. His classroom of 9th- and lOth-grade students were rowdy; they had been grouped as part
40
of a statewide initiative to target students atriskof dropping out of high school. They were all African American. None of them had yet passed the state's basic skills test, thus raising the question of why they were in an algebra class. Mr. Hill is a barrel-chested, 6'5"African American in his mid-5Os. He became a teacher after retiring from the army a few years before. Teaching, he says, is the hardest thing he has done since fighting in Vietnam. During the three observation classes, his students were engaged in seat work over 62% of the time and listened to lectures 21% of the time. The students were almost never asked to share their work with the class or to participate in any way other than working by themselves at their desks. During my visits to other classrooms, the most common opportunity for group interactions occurred when students reviewed their homework or during the warm-up activities at the beginning of class periods. The other teachers generally called students to the board or asked them to explain their answers to the class. Homework and warm-ups were not assigned or reviewed during the three lessons observed in Mr. Hill's classroom. Mr. Hill gave two lectures during the observationsâ&#x20AC;&#x201D;one on finding the slope of the line, the other on using the Pythagorean theorem. They were both dry and mechanical. Students were rarely called on, and if they were, Mr. Hill usually asked for little or even answered his own questions for the students. For example, after explaining the Pythagorean theorem, he presented the students with a right triangle with the lengths labeled on two of the three sides. He then pointed to the theorem on the board and asked, "What do we need to do for Question 1 ?" Without pausing, he answered, "Solve for C." He continued, "Now you need to find the value for 112 + 152 to get the answer. Who can use their calculator to find 112?" After both lectures, before asking students to work on their own, he provided examples in this manner. After a couple of examples, students were requested to crank through traditional-looking worksheets containing a series of equations and nothing more. In each of the three lessons, the students were not asked to share their work at any point during the period. While the students sat with their worksheets in front of them, Mr. Hill usually sat at his desk, occasionally asking them to be quiet and get to work. From time to time, he would circulate and answer individual questions.
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
Measuring Instructional Practice
Ms. Jackson's students and teaching approach differ dramatically from Mr. Hill's. She is an African-American women in her 50s who has been teaching for 19 years and is deeply engaged in her work. She is the mathematics coordinator for her school and has clearly won the respect of the school leadership. The principal of the school told me I was "in for a real treat" as she escorted me to Ms. Jackson's classroom. Her algebra class consisted of eighth graders, almost all of whom were African American. Her observed and self-reported NCTM scores were both 73%. Each lesson observed was tightly constructed, and each used a variety of pedagogical approaches. Her most common (over one third of the time) approach was to ask students to orally explain how they or other students solved particular problems. Over one eighth of the time, students worked in groups; another eighth of the time, they were engaged in teacher-led, whole-group discussion. Ms. Jackson also spent significant amounts of class time asking students to make conjectures about problems, and she asked students to write about how they solved problems. Traditional teaching approaches were evident, although the lectures and seat work were kept to a bare minimum. When students engaged in oral discussions about the work, sometimes they needed to explain what they had done, but often they needed to interpret other peoples' work for the class. In one lesson, a student named Sarah came to the board to write out how she factored x(x + 5). After Sarah wrote out her answer to the problem (x2 + 5x), Ms. Jackson asked her to explain the steps she took. When she was finished, Ms. Jackson asked the class, "How many have the same answer?" Quite a few hands darted up. "Who can put it in their own words? Explain to the class how and why Sarah answered this the way she did." After a student did this, she asked for another volunteer. Jeff was chosen and asked to come up and illustrate a different method of factoring than the one used by Sarah (she used the "horizontal" method; Jeff illustrated the "vertical" method). During another lesson, students were taught how to find the function of given lines and used graphing calculators to identify equations that would approximate drawn lines. For the first part of the class, the teacher asked the students to place some equations in "slope intercept form" and then to identify the slopes and intercepts of those equations. Then she drew three lines on a transparency with-
out providing any numerical information about the line (one had a steep negative slope, another had a steep positive slope, and the third had a shallow positive slope). The students were then asked to work in groups to find the function they thought best fit the images. She said, "You will have to use trial and error to figure this out." For the next 20 minutes, she circulated the classroom to answer questions and assess the progress of the groups. Students meanwhile plugged different equations into their calculators, graphed them, and then checked to see if they approximated what was on the overhead. The differences between Mr. Hill and Ms. Jackson's teaching approaches points out a strength of the survey instrument: It can delineate a traditional from a nontraditional teacher. Mr. Hill and Ms. Jackson had dramatically different self-reported scores, and the observed scores corroborated this. The differences in their pedagogic styles are striking. However, not all of the observations attested to the survey's ability to draw clear distinctions among teaching styles. A comparison of the observations of Ms. Weiss's and Mr. Hemph¡ll's classrooms highlights that the study survey fai Is to capture some important information. Ms. WEISS AND MR. HEMPHILL. Both of these teachers reported using the NCTM methods more than three quarters of the time, but the survey failed to reveal that they implemented the standards in a qualitatively different manner. Focusing on one specific pedagogical technique that both teachers used frequently helps illustrate this point. Education constructivists are in favor of teaching in a way that allows for a public display of the student's thinking processes. One teacher practice that allows for this is to ask students to "orally explain how to solve a problem." Constructivists would advocate using this pedagogical approach for several reasons: • It forces students to show the teacher how they think about a particular math problem. • If students have misperceptions, it allows the teacher to correct their thinking. • Students who are watching the exchange between the teacher and the students can evaluate whether their own perceptions of how to solve the problem are correct. Ms. Weiss and Mr. Hemphill both frequently used this approach. Whenever Ms. Weiss used this ap-
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
41
Mayer
proach, she seemed satisfied asking her students a generic set of questions, and she rarely, if ever, diligently probed their thinking. Each time I visited her class, she seemed exhausted and exasperated by her students. She is White, 34 years old, and responsible for instructing a class made up predominantly of Hispanics. The class environment could have created her exhaustion. There were 30 ninth graders jammed into too small a room in one of the most economically disadvantaged schools in the district. Maintaining control would be a challenge for anyone, but Ms. Weiss seemed defeated before the class sessions even started. In fact, she may have even felt defeated before the school year began. Although she had been teaching for only six years, she had the air of a burnt-out veteran. Over the summer, she had requested to be transferred to this disadvantaged high school so that she could cut down on her commuting time. When asking students to "orally explain how to solve a problem," she did not attempt to probe whether the students really understood the mathematics. A typical exchange sounded like this: Ms. Weiss asked, "What answers did you get for Question 5 (-27X + -18 =)?" Students simultaneously blurted out answers, but she ignored the cacophony and just wrote the correct answer on the board. "How many students got this correct?" Almost no hands went up in the air. Many students were clearly confused. Next, she asked for a volunteer to "explain" his or her answer. At first blush, this seemed promising. Ms. Weiss chose Carlos. Would she examine Carlos's thought process? Would she help the class understand Carlos's thoughts and in turn teach the class what was right or wrong in his thinking? Ms. Weiss's questions never went this deep. She practically ignored Carlos's response and just asked the class, "Does anyone disagree with Carlos or have a question about how he did it?" The class sat in silence for a moment before Ms. Weiss moved on to the next problem. Mr. Hemphill's approach differed dramatically. This 24-year-old, African-American, first-year teacher approached his classes with excitement and energy. His teaching persona was friendly but serious and unrelenting. He commenced each class by quickly handing back graded assignments and collecting the previous night's homework. He noted who owed him assignments, and he reminded the class of important upcoming events. Mr. Hemphill typically reviewed problems by 42
calling on students individually and asking them to explain their answers. He insisted that students articulate each step of the problem, and he listened carefully to the explanation. Often, he interjected: "Please illustrate that more clearly," or "Why did you do that?" or "How else could you have done that?" When he was satisfied with the student's answer, he would commonly turn to another student and ask him or her to explain how he or she arrived at the answer to the same problem. He even called on a third student on a couple of occasions and asked that student to note the differences, if any, in the approaches used by the first two students. Comparing Mr. Hill to Ms. Jackson illustrates that the NCTM indicator appears to do a fairly good job of distinguishing teachers at opposite ends of the NCTM scale (i.e., a very-low-NCTM-scoring teacher from a very-high-NCTM-scoring teacher). However, the comparison between Ms. Weiss and Mr. Hemphill demonstrates that not all high-scoring teachers are alike. The observations of these teachers corroborates research that found that teachers can perceive of themselves as practicing a reformed pedagogical approach (e.g., Ms. Weiss) when in actuality their teaching may not look all that different from the traditional methods (Cohen, 1990). This finding suggests that more work needs to be done to devise surveys that better distinguish between teachers who perfunctorily use reform practices and those who use them effectively. Discussion Few studies have examined the quality of selfreported data on teaching practices, although data of this sort are widely circulated (Council of Chief State School Officers, 1997; U.S. Department of Education, 1997a; Weiss et a!., 1994) and used to assess the impact of reforms (Carpenter et al., 1989; Cobb et al., 1991; Knapp & Associates, 1995; Simon & Schifter, 1993). Although the findings reported here are relevant in light of this context, the study data come from one school district and reflect only algebra instruction. Generalizing to other locations and subjects should therefore be done with caution. In addition, this study should be considered exploratory given its small sample size. Keeping these caveats in mind, we can view the study findings as a mixture of encouraging and discouraging information on the accuracy of using self-reported survey data to measure instructional
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
Measuring Instructional Practice
practice. On the positive side, a composite of classroom practice indicators created from self-reported survey data provides a fairly accurate picture of the amount of time teachers use the practices that are consistent with the NCTM standards. The 13item composite had a test-retest reliability of .69 based on two waves of survey data collected four months apart. In addition, there was a .85 correlation between a composite based on survey data and a parallel composite based on classroom observations. Both sets of results suggest that the composite is quite reliable, and the second set of results suggests that it has some validity. The composite can distinguish between teachers who do and do not use the types of practices advanced by NCTM. In addition, the Elm district reported using practices consistent with the NCTM standards to a great extent. Therefore, the concern that surveys might not be valid when there is true variation in teaching style (Burstein et ah, 1995) appears unjustified. On the negative side, the classroom observations revealed that the survey did not adequately capture the quality of the interaction between teacher and student. The example of Ms. Weiss and Mr. Hemphill clearly shows that teachers can be using the NCTM-type techniques in dramatically different ways even when they report using them for the same amount of time. Ms. Weiss's style did not reflect the standards. The observations about these teachers corroborate research that found that teachers can, like Ms. Weiss, perceive themselves as practicing in a "reformed" way when they are actually following the traditional methods (Cohen, 1990). So the question is: Is the failure to capture quality just a limitation inherent in surveys (as Burstein et ah, 1995, and Cohen, 1990, suggest), or can survey questions be created to unmask differences in quality? Many educational researchers have always doubted that surveys could accurately capture the complex dynamics between teacher and student. (For a history of the research community's attempts to study teaching, see Shulman, 1986.) Another discouraging finding from this study is that we cannot rely on the individual survey questions to assess the amount of time the Elm school district teachers use specific practices in their algebra classrooms because the teachers do not report this time in a consistent manner. Thus, the portrait of specific practices drawn by the survey is unreliable and therefore invalid. The finding that individual indicators of limited
reliability can be grouped into a highly reliable indicator reaffirms the principle that when multiple items measure the same underlying characteristic (e.g., the NCTM instructional approach) and are grouped together, the reliability of the construct will always be greater than the reliability of the individual items (Light et ah, 1990). Combining items also makes sense because a single item cannot "provide a coherent picture of instruction" (Burstein et ah, 1995, p. 36). Other composites such as academic aptitude test scores and the consumer price index (CPI) provide a good analogy. Both of these measures consist of several indicators instead of one to ensure reliability. Aptitude tests always consist of multiple questions that measure an underlying characteristic, such as mathematics ability. Likewise, the CPI, which tracks inflation, is created by monitoring the cost of a "basket" of goods that consumers might purchase in a given month. Tracking the cost of only one product, such as canned soup, would not provide an accurate or informative picture of inflation. Despite the statistical and substantive appeal of composites, educators, researchers, policymakers, and the public want to know how much time teachers use particular practices, such as lecturing, group work, and solving problems from textbooks. Therefore, several state and national reports provide survey data on this topic (Council of Chief State School Officers, 1997; U.S. Department of Education, 1997a, 1997b; Weiss et ah, 1994). But the accuracy of the portraits of American instructional practices drawn in these reports is questionable. For example, how valid is NCES's report that 73.2% of the nation's teachers used manipulatives or models at least once a week in 1994 and 1995? This study found that in the Elm district, the correlation between teacher responses to a question concerning manipulatives was only .44. This pattern of low reliability held true for most of the practice items examined in this study, and it emerged in the survey/log comparisons done by Burstein et ah (1995) and Smithson and Porter (1994). If the results from these studies can be generalized, then the quality of indicators of instructional methods as presented in national and state reports must be questioned. These reports would probably be more reliable and valid if the information they presented were based on composites representing different teaching styles rather than on individual indicators. As education policy focuses more and more on instructional practice, this type of data will
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
43
Mayer
play an increasingly important role in helping to assess whether the policies are having their intended effect. And the accuracy of these assessments will depend on whether accurate data is used in a mean ingful way. This is not to suggest that surveys should not be used to evaluate teaching practices. The nation can not continue to rely on in-depth studies of a small number of classrooms because they are of little value for assessing state and national instructional trends. In addition, in-depth studies based on a na tionally representative sample of classrooms would be cost-prohibitive. The TΓMMS video survey ap proach, while innovative and exciting, is not a vi able alternative to surveys either. It would also be too costly to use with a nationally representative sample. So, although surveys remain the most costeffective means for obtaining national estimates, more research is necessary to determine whether surveys can be designed to provide accurate data on instructional practice. NCES is in the process of conducting such research, but the results have not yet been made public. Future research should explore the feasibility of devising survey questions that enhance the reliabil ity and validity of teaching practice measures. First, ways to improve the reliability of individual items should be explored. In other words, how can the questions or response options be changed to make the results more meaningful? Second, questions that reveal the quality of instruction need to be devel oped. For example, the question: "How much time do your students typically spend engaged in smallgroup activities?" could be followed with a ques tion like: "When students are engaged in small group activities, which of the following things do they do?" The list of activities would need to be developed, but the point should be clear: These sur veys can be made more useful if research focuses on the issues of survey reliability and validity. Notes Research for this study was supported by the College Board. I would like to acknowledge Richard Elmore, Suzanne Graham, Lorraine McDonnall, Richard Mumane, Judy Singer, John Smithson, and John Supovitz for their helpful comments on earlier drafts of this ar ticle. Opinions and conclusions in this article are those of the author and do not necessarily reflect the views of the supporting agency. An earlier version of this article was presented at the Association for Public Policy Analy sis and Management's annual conference in New York, 44
October 1998. 'There are several types of validity, but this article fo cuses on criterion validity because of its relevance to the policy debate. Criterion validity is often used to ask how accurate are less costly measures (surveys as opposed to classroom observations) at predicting the criterion of interest (i.e., teaching practice) (Light et al., 1990). 2 The National Council of Teachers of Mathematics has published three separate standards documents, but this study's focus is on the Professional Standards because this document is the NCTM's most comprehensive stan dards document regarding pedagogy (National Council of Teachers of Mathematics, 1989, 1991, 1995). Refer ences to the other documents are made where appropri ate. 3 The Beginning Teacher Evaluation Study is another study that looked closely at the validity and reliability of using various instruments to gauge classroom practice. This study is not reviewed here because it did not exam ine the accuracy of surveys. Instead, it examined the ac curacy of teacher logs, classroom observation scales, and videotape (see Calfee & Calfee, 1976; Elias & Others, 1976; Lambert & Hartsough, 1977). 4 These 11 instructional indicators come from two cat egories in the Smithson and Porter study—"modes of instruction" and "student activity." The categories were combined because the student activity indicators (e.g., taking notes, writing reports, lab work, and discovery lesson) could plausibly be considered instructional indi cators and they overlap with many of the instructional indicators used in this study (see the "Methods" section). 5 Of course, it is possible that some of the teachers did change their teaching practice during this time. If this happened, it will make the measure appear less reliable than it really is. References Black, P. J., & Atkin, J. M. (Eds.). (1996). Changing the subject: Innovations in science, mathematics, and tech nology education. New York: Routledge. Blank, R. K., & Pechman, E. M. (1995). State curricu lum frameworks in mathematics and science: How are they changing across the states? Washington, DC: Council of Chief State School Officers. Burstein, L., McDonnell, L. M., Van Winkle, J., Ormseth, T, Mirocha, J., & Guitton, G. (1995). Validating na tional curriculum indicators. Santa Monica, CA: RAND. Calfee, R., & Calfee, K. H. (1976). Beginning Teacher Evaluation Study: Phase II, 1973-74, Final report: Vol. 3.2. Reading and mathematics observation sys tem: Description and analysis of time expenditures. Princeton, NJ: Educational Testing Service.Carmines, E. G., & Seller, R. A. (1979). Reliability and validity assessment. Newbury Park, CA: Sage. Carpenter, T. P., Fennema, E., Peterson, P., Chiang, C.
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
Measuring Instructional Practice P., & Loef, M. (1989). Using knowledge of children's mathematics thinlcing in classroom teaching: An ex perimental study. American Educational Research Journal, 26(4), 499-531. Center for Research on the Context of Teaching. (1994). Survey of elementary mathematics education in Cali fornia. Stanford, CA: Author. Choike, J. R. (1993). Math as a lever for reform: Ex cerpts from the Equity 2000 national committee re port. New York: College Board. Cobb, P., Wood, T., Yackel, E., Wheatley, G., Trigatti, B., & Perlwitz, M. (1991). Assessment of a problemcentered second-grade mathematics project. Journal for Research in Mathematics Education, 22( 1), 3-29. Cohen, D. K. (1990). Arevolutioninoneclassroom:The case of Mrs. Oublier. Educational Evaluation and Policy Analysis, 14, 327-345. Council of Chief State School Officers. (1997). How is science taught in Massachusetts schools ? Resultsfrom afive-stalestudy of instructional practices in science. Washington, DC: Author. Elias, P. J., & Others, A. (1976). Beginning Teacher Evaluation Study: Phase II, 1973-74, Final report: Vol. 5.5. The reports of teachers about their math ematics and reading instruction. Princeton, NJ: Edu cational Testing Service. Knapp, M. S., & Associates. (1995). Teaching for mean ing in high-poverty classrooms. New York: Teachers College Press. Lambert, N. M., & Hartsough, C. S. (1977). Beginning Teacher Evaluation Study: Phase II, 1973-74, Final report: Vol. 3.1. Apple observation variables and their relationship to reading and mathematics achievement. Princeton, NJ: Educational Testing Service. Light, R. J., Singer, J. D., & Willett, J. B. (1990). By design: Planning research on higher education. Cam bridge, MA: Harvard University Press. Marshall, S., Fuhrman, S., & O'Day, J. (1994). National curriculum standards: Are they desirable and feasible? In R. Elmore & S. Fuhrman (Eds.), The governance of curriculum: 1994 yearbook of the Association for Supervision and Curriculum Development (pp. 12-30). Alexandria, VA: ASCD. Murnane, R. J., & Levy, R. (1996). Teaching the new basic skills: Principlesfor educating children to thrive in a changing economy. New York: Free Press. Murnane, R. J., & Raizen, S. A. (1988). Improving indi cators of the quality of science and mathematics edu cation in grades K-12. Washington, DC: National Academy Press. National Council of Teachers of Mathematics. (1989). Curriculum and evaluation standardsfor school math ematics. Reston, VA: Author. National Council of Teachers of Mathematics. (1991). Professional standards for teaching mathematics. Reston, VA: Author. National Council of Teacher of Mathematics. (1995).
Assessment standardsfor school mathematics. Reston, VA: Author. Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill. OERI, State Accountability Study Group (1988). Creat ing responsible and responsive accountability systems. Washington, DC: U.S. Department of Education. Porter, A. C. (1991). Creating a system of school pro cess indicators. Educational Evaluation and Policy Analysis, I3{\), 13-29. Porter, A. C , Kirst, M. W., Osthoff, E. J., Smithson, J. L., & Schneider, S. A. (1993). Reform up close: An analysis ofhigh school mathematics and science class rooms. Madison: Wisconsin Center for Education Research. Shavelson, R. J., McDonnell, L. M., Oakes, J., Carey, N., & Picus, L. (1987). Indicator systems for moni toring mathematics and science education. Santa Monica, CA: RAND. Shulman, L. S. (1986). Paradigms and research programs in the study of teaching: A contemporary perspective. In M. Wittrock (Ed.), Handbook of research on leach ing (3rd ed., pp. 3-36). New York: Macmillaπ. Simon, M. A., & Schifter, D. (1993). Toward a constructivist perspective: The impact of a mathemat ics teacher inservice program on students. Educational Studies in Mathematics, 25, 331 -340. Smithson, J. L., & Porter, A. C. (1994). Measuring class room practice: Lessons learned from the efforts to describe the enacted curriculum—The Reform UpClose Study. Madison, WI: Consortium for Policy Research in Education. U.S. Department of Education, NCES. (1992). SAS and TFS questionnaires 1990-1991. Washington, DC: U.S. Government Printing Office. U.S. Department of Education, NCES. (1997a). American's teachers: Profile of a profession, 199394 (NCES 97-460). Washington, DC: U.S. Govern ment Printing Office. U.S. Department of Education, NCES. (1997b). The condition of education 1997 (NCES 97-388). Wash ington, DC: U.S. Government Printing Office. Weiss, I. R., Matti, M. C, & Smith, P. S. (1994). Report of the 1993 national survey ofscience and mathemat ics education. Chapel Hill, NC: Horizon Research, Inc.
Author DANIEL MAYER is a researcher at Mathematica Policy Research, Inc., 600 Maryland Avenue, SW, Suite 550, Washington, DC 20024-2512, e-mail: dmayer@mathematica-mpr.com. He specializes in edu cation policy and evaluation. Manuscript received February 17,1998 Revision received October 19,1998 Accepted October 27, 1998
Downloaded from http://eepa.aera.net by Armando Loera on October 31, 2009
45