DANISH TECHNOLOGICAL INSTITUTE
Adult Skills Assessment in Europe Feasibility Study Jens Henrik Haahr Martin Eggert Hansen Policy and Business Analysis Final Report November 2006
Table of Contents 1. Conclusions ................................................................................................................. 3 1.1. Self-Reports vs. Direct Testing............................................................................ 3 1.2. Computer Based Assessments ............................................................................. 5 1.3. The Relevance of International Qualifications Frameworks ............................... 8 2. The Background and Context of the Study ............................................................... 11 2.1. The Lisbon Agenda and Adult Skills Assessment ............................................. 11 2.2. Recent International Initiatives to Improve Knowledge on Adult Skills........... 14 2.3. European Commission Feasibility Study........................................................... 15 3. Self-Reporting vs. Direct Testing in Adult Skills Assessment ................................. 17 3.1. Tests in Skills Assessment ................................................................................. 17 3.2. Self-Reporting in Skills Assessment.................................................................. 30 3.3. Comparing Tests and Self-Reporting in Skills Assessment............................... 40 4. Computer Based Assessment .................................................................................... 55 4.1. Definitions and Developments........................................................................... 55 4.2. Experiences from Computer Based and Computer Adaptive Testing ............... 65 4.3. CBA in Adult Skills Assessment: Challenges and Possibilities ........................ 72 5. The Relevance of International Qualifications Frameworks .................................... 76 5.1. The Need for Setting Independent Assessment Standards................................. 76 5.2. The Steps and Challenges of Setting a Standard: Case DIALANG................... 77 5.3. ISCED – The International Standard Classification of Education..................... 84 5.4. The European Qualifications Framework .......................................................... 86 5.5. The Relevance of ISCED and EQF as Frameworks of Reference..................... 89 5.6. Example: Using the EQF in Assessing ICT Competences ................................ 93 Literature........................................................................................................................ 97 Appendix 1: Example Questionnaire for Self-Reported Job-Related Skills (O*NET) 106 Appendix 2: The Eight Reference Levels of the European Qualification Framework 125
2
1. Conclusions Against the background of the Lisbon Agenda and previous efforts aimed at improving the level of information on adult skills in the EU, the European Commission has asked Danish Technological Institute to carry out a feasibility study to examine the possible use of specific methodologies and techniques for the assessment of skills of the adult population as well as the feasibility of using international qualifications frameworks or other similar international qualifications as reference levels for the assessment of the level of skills of the adults within the EU. This report presents the results of the analysis as regards the three themes of the analysis: Self-reports vs. direct testing in skills assessment, computer based assessments in adult skills assessment, and the relevance of using international qualification frameworks and other classifications for the assessment of the level of skills of adults. The report has been prepared by Senior Consultant Dr. Jens Henrik Haahr and Senior Consultant Martin Eggert Hansen, with the assistance of Consultant Søren T. Jakobsen. Comments and suggestions have been provided by Professor Francis Green, University of Canterbury at Kent, Head of Division Dr. Rolf van der Velden, University of Maastricht, and Assistant Director Chris Whetton, UK National Foundation for Educational Research. 1.1. Self-Reports vs. Direct Testing Different Advantages and Disadvantages of Tests and Self-reports
The analysis suggests that there are different advantages and disadvantages of tests and self-reports in adult skills assessment. The strengths of self-reports and self-assessments concern the range of skills that can potentially be measured as well as the comparatively low costs associated with development and implementation of data collection. The validity of data from self-reports can be strengthened by using external anchors to response scales, and by formulating items so as to minimize social desirability effects. While the question of the validity of test results will remain an issue, the quality of the data generated nevertheless remains the strength of testing, provided that testing has made use of state-of-the art approaches. The final decision as regards the choice of assessment methodology hinges on the objectives and priorities of adult skills assessments: State-of-the Art Testing is the Superior Methodology – but there are Clear Limits
If the highest possible standards of validity and reliability of data are to be achieved, direct testing is clearly the superior methodology if sufficient resources are devoted to the development, tests and implementation of state-of-the-art tests, or alternatively that well-tried testing method and items are utilised.
3
However, at present, testing can only be applied to a rather narrow range of skills (literacy, numeracy, ICT skills, problem solving skills and foreign language skills). In other domains, considerable development work will be required, and in several domains it will probably not be possible to develop valid and reliable tests at all. Moreover, there are some very significant practical challenges that must be addressed if direct testing is to be administered in as many as five broad skills domains simultaneously (cf. DTI et al. 2004). Self-Assessments are Correlated with Test Results, but far from Perfect Correlation
Direct comparisons of test results and self-reports suggest that self-reported skills and tested skills are correlated, but that this correlation is far from perfect. In particular, there is a tendency for test-takers located at the lower skills levels to overestimate their skills when self-reporting. In the above-mentioned direct comparisons between test results and self-reports, the reviewed self-report questions do not live up to best methodological practice in the field. Methodologically, self-reports using the job-requirements approach and ex-ante expert anchoring or inter-subject anchoring of scales are superior to self-assessments, where a respondent is asked to assess him or herself. State-of-the-art methods in selfreports may improve the correlation between self-reported skills and test results for specific skills domains, but it is highly unlikely to change the conclusion that methodologically, state-of-the-art skills tests in the domains where this is possible is the superior option. In light of the fact that there is considerable focus on obtaining more information on the level, distribution and development of skills among low-skilled groups (Commission 2006), and in light of the fact that self-reports tend to over-estimate the skills of lowskilled groups, it cannot be recommended to make use of self-reported skills as a standalone method in adult skills assessment. Feasible, but not Unproblematic, to Use Self-reports on Job Requirements
For which purposes could it be feasible to make use of self-reports or self-assessments in adult skills assessment? It is feasible to develop and implement a form of skills assessment using the so-called job-requirements approach so that it covers a number of different skills domains, and covers all the EU member states. In the job requirements approach, respondents are asked not to assess the level of their own skills, but to describe different aspects of their skills use in their current (or recent previous) job. Development and implementation costs will be considerable, however, and the results must be presented and interpreted cautiously. The approach will yield information on the types of skills required in different types of jobs across the EU and on the distribution of skills requirements, just as information can be provided concerning the types and levels of skills requirements on the one hand and other variables such as wages, education, etc., on the other hand. However, it will be problematic to infer from skill requirements in jobs to the skill levels of respondents, as there may be over- or underutilization of skills in the labour mar-
4
ket. This problem is likely to be less important for some skills than for others. Moreover, the approach cannot yield information on groups that have been outside the labour market for long periods. A possibility is to use the use the job requirements approach in conjunction with direct testing, as a way of broadening the range of skills on which information can be obtained. Self-reports in Large-scale Skills Assessment Skills Surveys should be Improved
To the extent that self-reports, which do not make use of the job-requirements approach, are utilised in large-scale surveys to assess adult skills, it is recommended that state-of-the-art methodological practice be used. For instance, in connection with the second round of the European Adult Education Survey, scheduled for 2011, it could be examined whether the extended use of anchors and vignettes in the self-reporting of skills could lead to more valid and reliable results. With a view to methodological improvements in the 2011 round of data collection, pilot studies could be launched for analysing the effects of using vignettes and anchors. To make possible a systematic comparison of results, it would be beneficial if such pilot studies included parallel data collection using tests, traditional self-reports and selfreports using anchors and vignettes. Self-Assessments Should be at Most be Used as Supplementary Source of Information
Evidently, self-assessments, where respondents are – as opposed to what is the case for self-reports – asked directly to assess their own level of skills, are methodologically sounder if response scales make use of ex-ante expert anchoring or alternatively intersubject anchoring. However, the social desirability effects of self-assessment items will remain considerable, and making use of this method as a stand-alone approach is not recommended. Self-assessments may be useful in connection with direct tests, as they can provide information on respondents’ self-perception, which may be an element in shedding light on, for instance, the motivation of respondents for engaging in learning activities. In this connection, it may be a better solution to measure motivation directly, however. 1.2. Computer Based Assessments Computer Based Assessment, Computer Adaptive Testing, Adaptive Testing Using IRT
The term “computer based assessment” covers several different methodologies and approaches. It is useful to distinguish between 1) computer assisted/mediated assessment, which refers to any application of computers within the assessment process; 2) Computer Based Assessment, in which assessment is built around the use of a computer and the use of the computer is intrinsic to the test; 3) Computer Adaptive Assessments (CAT) are computer based assessments where the test items presented to the respondent is a function of previous responses, 4) Computer Adaptive Testing using Item Response Theory (IRT). In order to reap the full benefits of adaptive testing, computer adaptive testing is most frequently developed on the basis of IRT, but IRT is not a necessary precondition for adaptive tests.
5
Feasible to Make Use of CBA and CAT in European Adult Skills Assessment
In our assessment, it is feasible to make use of computer based assessments and computer adaptive testing for all of the five broad categories of skills for which standardised and reliable assessments are currently available or feasible (literacy, numeracy, problem solving, ICT skills and foreign language competences). Computerized tests in these domains entail methodological as well as practical advantages. It is our assessment that these advantages outweigh the disadvantages of CBA and CAT. Using advanced scenario-based techniques, computer based and computer adaptive tests may in the long term enable the test of some skills that are currently not testable, for instance certain social skills. However, at present we do not find evidence that CBA has enabled the assessment of politically interesting skills such as entrepreneurial skills or social skills such as team-working skills. Advantages of Computer-based Assessment Irrespective of Adaptivity
Computer based assessments without adaptivity entail advantages in their own right. Any meaningful test of ICT skills presupposes that the test is computer based. In addition, CBA makes possible a wider range of test items, test scores can be calculated more rapidly, and immediate feedback may be given to the test-taker. There is also is evidence of motivational benefits of CBA, irrespective of whether the test is adaptive or not, as the computer is seen as the question setter or expert, distancing the test from previous learning experiences. Significant Additional Benefits to Adaptive Tests
Adding adaptivity to testing will add further advantages. Measurement precision is potentially higher in adaptive testing, as the difficulty of test items is adjusted to the ability of the test taker. Computer adaptive tests can be shorter than paper-and-pencil tests or non-adaptive computer based assessments, as test takers are given very few test items that are significantly too easy or difficult for them. Perhaps most importantly, there are considerable motivational benefits of adaptive testing, both because tests are shorter and since the adapted difficulty of test items tend to reassure the test person and give them confidence in their own ability rather than induce anxiety and a sense of defeat. This benefit is potentially very significant for individuals located at the lower end of skills levels and is likely to affect response rates positively in comparison to both comparable paper-and-pencil versions and non-adaptive computer based versions. It is possible to develop adaptive tests without using Item Response Analysis as a theoretical starting point. This option does not make it possible to reap the full benefits of adaptivity, however, and will raise the issue of the comparability of test results that have not used identical test items. Limited Experiences with CBA and CAT in Connection with Adult Skills Surveys
Even if the use of CBA and CAT is feasible in connection with adult skills surveys, a number of risks and challenges must also be recognized. Whereas computer based assessment and computer adaptive testing are rapidly growing approaches to assessment of skills and competences, we have identified only one large-scale skills assessment
6
survey, the British 2003 Skills for Life Survey, which has made use of these technologies for collecting information about the levels and distribution of skills in an entire population. Moreover, only two computer based assessment systems, both in the field of language testing, have been developed with a transnational, multilingual perspective from the onset: The DIALANG project and the BULATS system. Of these two, only DIALANG approaches a language coverage that would be necessary for a truly Europe-wide skills assessment survey. However, neither DIALANG nor BULATS were developed with a view to large-scale skills assessment data collection in the form of surveys. The OECD-based adult skills assessment initiative, PIAAC, is planning to include an ICT skills test module which must be computer based, but this initiative is still in the planning stage. A European Computer Based Skills Assessment Would Break New Ground
This means that any European computer based or computer adaptive skills assessment would be breaking new ground. The EU would place itself at the forefront of the development, which may be positive in various respects. However, it also means that the EU would have to bear the costs and the risks connected with experimental development work. - And Will be Very Resource-demanding
Costs in developing and implementing computer based skills assessment and computer adaptive testing on an international basis are likely to be high in terms of time and resources. The most resource demanding process will be the development of item response theory based computer adaptive testing. The process involved in the construction of the DIALANG is illustrative in this connection. This process involved the development of a computer adaptive testing system for 14 languages, but only for one skill category, i.e. language skills. The required efforts will be multiplied if additional skills categories such as literacy, numeracy, problem solving and ICT skills are also to be tested. A possible shortcut could be to make use of existing computer based tests such as those developed by the US Educational Testing Services and adapt these tests to international purposes. However, methodological obstacles in international adaptation may add to costs, and taking an outset in ETS tests may also be politically problematic. - But Once Developed, Running Costs will not be significantly larger than Paper Tests
On the positive side, however, the high development costs are likely to result in a system which can be used in the future with relatively modest running costs involved. Adaptive tests will probably require some maintenance of test item pools, but apart from this, the administration of computer adaptive tests are most likely not significantly higher than the administration of paper-and-pencil tests.
7
1.3. The Relevance of International Qualifications Frameworks Standards Important for Interpretation of Results
In international skills assessments, an independent standard is essential. Facing a multilingual audience with different cultural, educational and occupational background it is an important and challenging task to establish a stable standard for test results obtained in different countries and target populations. In isolation, test results and assessments scores are of limited informational value. Thus, to be useful in a developmental perspective, at individual as well as societal level, results of tests and assessments must be related to a standard against which the results can be compared or measured. ISCED and EQF - Formal Qualifications vs. Broader Competences
ISCED (the International Standard Classification of Education) and EQF (the European Qualification Framework) are two different approaches to setting standards for the categorisation of competences and qualifications. The ISCED framework, which can be characterised as the older and more traditional approach, focuses on formal educational activities designed to meet learning needs. Consequently, the framework excludes various forms of learning that are not organized and the basic unit and analytical focus of the ISCED framework is the single educational programme, especially its scope (e.g. field) and level. In contrast, the proposed EQF represents a more modern approach linked to the context of lifelong learning. The EQF defines learning as taking place in formal as well as informal settings. The analytical unit and focus of the framework is the learning outcome which defines the competences of an individual at 8 different levels of reference. EQF More Relevant as a Framework of Reference for Skills Assessments than ISCED
The EQF is more relevant as a framework for skills assessment than the ISCED framework. The analytical focus of ISCED is the qualities (level and field) of the educational programme. Consequently, using the ISCED as a framework for expressing or measuring the skill level of an individual in an assessment is questionable: It can be argued that there is no close and universal relationship between the programmes which a participant is or has been enrolled in and actual educational achievement in terms of levels of skills and competences. As opposed, the analytical unit of the EQF is the learning outcome which defines what the individual is expected to know, understand and/or be able to do after the learning at a given level. As such, the EQF framework is more feasible for developing descriptors for tests items as well as “can do� statements for self-assessments. Specific Potential Advantages in Using EQF in Connection with Adult Skills Assessment
There are specific potential benefits of making use of the European Qualifications Framework in relation to adult skills assessment and the use of information gathered from international adult skills assessments: First, if adult skills assessment test scores can be presented and interpreted in terms of EQF levels, the informational value of the results is increased. The results are then related to an external standard where the level of the assessed skill can be described in absolute terms. External standards for the interpretation of test scores can be developed
8
independently of the EQF, of course. However, making use of the levels of a wellknown existing framework would increase the immediate intelligibility of results. Second, in a future situation where the principles of the EQF has been implemented in various countries’ educational systems, meaning that descriptions of specific learning outcomes at different levels have been developed for different competences in relation to different educational programmes, competence profiles of populations and segments of populations, as described by data from adult skills assessments, can be translated directly into the need for specific educational programmes and/or other forms of training measures. Third, assuming a fully implemented EQF in a given domain of competences, the formal educational background of a given person could be translated into an EQF level of competences. In this situation, differences between the predicted competence, as suggested by formal educational background of respondents, and the level of competence found via direct skills assessments can potentially point to the significance of informal learning or, alternatively, to deficiencies in the learning outcomes of the formal education system. EQF as a Framework for Assessment Requires Comprehensive Development Work
These potential benefits having been mentioned, it is also quite clear that the application of the EQF as a standard against which the results of international adult skills assessment can be measured and utilised will require a very considerable amount of development work. In principle, the eight levels of reference of the EQF may function as a global scale against which an individual’s level of performance in any given field of skills can be measured and interpreted. The scale measures how advanced the individual is at using his or her knowledge and skills – and how advanced the individual’s personal and professional competences are. However, for these eight levels to become operationally relevant, considerable efforts are needed, both as regards “horizontal” and “vertical” aspects of the EQF. “Vertically”, the application of the eight EQF reference levels as a standard for skills assessment will require development and testing of descriptors that consistently defines what separates one level from another. To ensure interpersonal and intra-personal consistency as to what performance is expected at each level, this will require a careful test procedure where items are sorted into levels by judges who are very familiar with EQF. “Horizontally”, the definition and categorization of different skill types needs further specification, especially as regards the description and categorisation of personal and professional competences. How, for instance, is the “professional and vocational competence” precisely to be distinguished from “skills” at each level? Moreover, if the full potentials of the EQF are to be realised in terms of relating adult skills assessment results directly to the learning outcomes of the educational system, educational systems need to apply the framework consistently. Among other things, this entails that precise descriptors in terms of learning outcomes at different levels must be
9
developed for relevant skills and in relation to relevant educational programmes. Adult skills assessments must then be developed for the competences in question, taking an outset in these descriptors. The work required in these connections is very considerable, and it should carefully be considered whether the resources required outweigh the potential benefits that could follow. - and Not all Categories of Competences Covered by EQF are Testable
Moreover, it is clearly unrealistic to expect that operational descriptors and relevant tests can be developed for all the categories of competences that are covered in principle by the European Qualification Framework. For a number of social and context dependent skills, it will probably not be possible to develop valid and reliable tests in the foreseeable future. The relevance of the EQF’s categories and levels with respect to “personal and professional competence” for largescale adult skills test purposes is therefore probably rather limited. At present, only tests of problem solving ability comes close to covering the category “professional and vocational competence”, and existing problem solving tests focus on “generic” problem solving skills rather than on problem solving abilities in relation to vocational contexts and tasks, as does the EQF. Self-reports using the job requirements approach will be better suited to provide some information on the “personal and professional competences” of the EQF, with the methodological limits and challenges entailed by this approach.
10
2. The Background and Context of the Study This section presents the background and context of the study. Thus, the objectives of the feasibility study must be understood in the context of the Lisbon agenda and the related efforts to reform and improve education and training in Europe. They also follow directly from previous recent efforts aiming to improve the level of knowledge and information on adult skills in the EU. 2.1. The Lisbon Agenda and Adult Skills Assessment The Lisbon Strategy, adopted by the European Council in 2000, placed new emphasis on knowledge, education and training. The European Council set itself a new strategic goal for the upcoming decade: to become “the most competitive and dynamic knowledge-based economy in the world capable of sustainable economic growth with more and better jobs and greater social cohesion” (European Council 2000). This, in turn, requires a strategy which, among other things, supports the transition to a knowledge-based economy through: • • •
Adapting Europe’s education and training systems to the demands of the knowledge society, making them ”a world quality reference by 2010” (European Council 2002) Ensuring “a substantial annual increase in per capita investment in human resources” Defining in a European framework those new basic skills that are to be provided through lifelong learning: IT skills, foreign languages, technological culture, entrepreneurship and social skills.
Competitiveness, Inclusion and Skills Assessment
The Lisbon strategy has two main goals. The first is to enhance economic competitiveness through improvements in human capital. Skills, knowledge and competencies are increasingly seen as crucial prerequisites for productivity and competitiveness. The European economies are increasingly confronted with a dual challenge. They face global competition not only from developing countries with cheap and plentiful labour, but also from high-productive, high-tech economies of North America and the Far East. The second is to promote social inclusion. In the view of Lisbon, competitiveness should be achieved “with more and better jobs and greater social cohesion”, and not at the cost of greater inequality or social marginalisation. A dynamic and competitive economy should benefit all, and the entire European population must be involved in and benefit from reform and development. “The knowledge society” is a society of not only full employment but “all-employment”. It is “an information society for all”- one in which “every citizen must be equipped with the skills needed to live and work”, where “info-exclusion” and illiteracy must be prevented, and where special attention is given to the disabled (European Council 2000).
11
Within this context, the assessment of adult skills has come to be seen as both politically and economically relevant. Adult skills assessments may help decision makers in government and business, at European and Member State levels, to take stock of their human resources, and to adjust their “human capital” investment accordingly. Assessments may also enable policy makers to gauge the returns to the Lisbon strategy’s hoped for increases in human resource investment and to take any necessary corrective steps. This stock-taking is especially important with respect to the new skill requirements that are believed to be in greater demand in the “knowledge society”. In addition to basic literacy and numeracy, the economy may require ICT-related skills as wells as entrepreneurial, self-management, and learning skills, to name a few. Adult skills assessments also potentially open up possibilities for strengthening the accountability of the education and training sector by providing one indicator of whether education and training institutions deliver on their promises. This would include not only compulsory schooling, but also opportunities for learning over the lifetime. Promise of International Comparison
In relation to both of these ambitions, international adult assessment initiatives hold particular promise. Systematic international comparisons have proved themselves efficient in stirring interest, stimulating debate, and affecting political decisions and priorities. They have also been effective in promoting mutual learning and the exchange of good practices. Not least, international comparisons appear to have been rather powerful tools for initiating reforms within the education sector. ‘Education and Training 2010’
‘Education and Training 2010’ is a set of EU activities which have been set in motion to help achieve the strategic objectives of education and training system reform. In 2001, the Council adopted a set of three overall and thirteen associated concrete objectives to support the Lisbon goal. A number of these objectives are relevant in an adult skills assessment context: Increasing numeracy and literacy, maintaining the ability to learn, improving ICT skills, developing the spirit of enterprise, and improving foreign language learning. The objective of ‘making the best use of resources’ is also highly relevant when considering adult skills assessment measures. In 2002, a work programme was developed to realising these objectives. Subsequently, twelve different working groups, comprised of stakeholders and experts, have been working on one or more objectives of the work programme, for example by supporting the implementation of the objectives for education and training systems at national level through exchanges of good practices, study visits, and peer reviews. One of the twelve groups, the Standing Group on Indicators and Benchmarks, has focussed on developing indicators to monitor progress on the work programme’s specific objectives. In July 2003, the Standing Group presented a list of indicators to support the implementation of the work programme, and suggested the development of several new indicators, including indicators for language competencies and learning to learn skills, ICT skills and indicators on social cohesion and active citizenship (Standing Group on Indicators 2003). Another standing group has focussed on basic skills. In November
12
2003, the basic skills working group presented a report which contained proposals for definitions of essential basic skills in eight domains (Working Group 2003). The so-called ‘Copenhagen process’ is a third set of activities which is relevant in the context of adult skills assessment. With the Copenhagen-declaration, the EU Ministers for Vocational Education and Training (VET) formulated a set of objectives for cooperation in VET, within the broader framework of the Lisbon Strategy and the ‘Education and Training 2010’ Work Programme. Among other things the Copenhagen Declaration calls for common principles for the validation of non-formal and informal learning to help ensure greater compatibility between approaches in different countries and at different levels (cf. Lisbon-to-Maastricht Consortium 2004). Interim Evaluation and Progress Reports on “Education and Training 2010”
The Commission presented an interim evaluation of the implementation of the Education and Training 2010 programme, and with the Council, developed a joint report for the European Council (Commission 2003a) in the spring 2004. The interim evaluation highlighted that too many young people fail to acquire key competencies, too few adults participate in further learning, and a language proficiency indicator had not yet been developed. A progress report from the Commission, published in March 2005 (Commission 2005a), confirm many of the findings in the interim evaluation. It also emphasised the need for the further development of valid statistical indicators of progress towards the Lisbon Objectives. Life-Long Learning Indicators Inadequate
Thus, although the promotion of lifelong learning is one of the key elements in the Lisbon Strategy, the structural indicators which are intended to assess the progress towards achieving comprehensive lifelong learning are inadequate, as they emphasise learning activities rather than outcomes. Although preparatory work to develop quality indicators of lifelong learning has been carried out (Commission 2002), the only indicator which specifically addresses the question of lifelong learning concerns the share of the adult population aged 25 to 64 who state that they received education or training in the four weeks preceding a survey. Other types of statistical information on lifelong learning in relation to the Lisbon strategy or more generally focus on educational attainment levels (e.g., Brainbridge, et al. 2004) or on public spending on training and education. Limited Information on Returns on Investments
Moreover, even though the Member States differ significantly in their level of public investment in education and training and in the scope of continuing education activities, little is known about whether these differences in spending and activities are related to differences in skills levels and characteristics. In 2000, public expenditure on education in Denmark, for example, amounted to almost 8.5 per cent of the GDP. In Germany and Ireland, the figures were 4.5 and 4.4 respectively. Yet, there is very little information available to support a conclusion that the workforce in Denmark is much better equipped in terms of skills than the workforces in Germany and Ireland. Data on youth education attainment levels are not very informative in this respect (the level in Ireland is higher than in Denmark), and not much can be said about the strengths and weaknesses of formal education systems.
13
2.2. Recent International Initiatives to Improve Knowledge on Adult Skills A Strategy for the Direct Assessment of Adult Skills
Against this background, the project ‘defining a strategy for the direct assessment of adult skills‘, funded by the Leonardo da Vinci Programme, addresses the possibilities and challenges that must be confronted in pursuing a European adult skills assessment initiative with the overall objective of providing better and more valid information on the competences of adults in the EU Member States and of progress towards the promotion of life-long learning. Proposed Strategy: Combination of Tests and Self-Reports
The study “Developing a Strategy for the Direct Assessment of Adult Skills” concluded that the development and implementation of adult skills assessment is feasible and realistic, even if it is connected with a number of limitations and risks (Danish Technological Institute et al. 2004). A number of skills were identified for which standardised, reliable assessment are available or feasible: Literacy skills, numeracy skills, problem solving / analytical skills, foreign language competency, job- or work-related “generic skills” and ICT skills. On the other hand, it was concluded that it is not presently feasible to directly assess some skills of interest to policy makers, for definitional, methodological, or perhaps political reasons. These include entrepreneurial skills, some social skills, and learning-to-learn skills or the ability to learn. Furthermore, the study emphasised that there are both strengths and weaknesses related to survey-based self-assessments and direct testing, respectively. As for testing, tests are restricted to a narrow range of skills and developing valid, reliable and comparable tests for additional skills will be costly and time-consuming. As for self-assessments, in which individuals are asked to say “how good” or “how effective” they are at certain activities, are subject to self-esteem bias, may be unreliable and are difficult to validate. Self reports linked to jobs, where individuals are asked directly about the work activities and requirements of the job and asked, for example, “how important” a particular skill is in the job, can provide valid and reliable estimates of skill within the context of the job. However, this method delivers measures of the skills actually used in jobs, rather than direct measures of the skills that individuals possess. Further, this method cannot be applied to the economically inactive. Against this background, the study concluded that there are several arguments for integrating different methods into a single assessment. Eurostat’s Adult Education Survey (AES)
The second EU initiative in the area of adult skills assessment is Eurostat’s Adult Education Survey, which is currently in its pilot phase and is planned for implementation by the EU’s member states in the period 2005-2007. With respect to skills assessment, the AES is concentrated on assessment of the adult population (25 to 64 years) in two domains: ICT skills and language skills. These skills are assessed on the basis of selfreporting. The use of and familiarity with ICT is considered as the best proxy for ICT skills. Language skills are measured using questions concerning the use of foreign languages, the frequency and context of foreign language use. Optional questions may include questions on the self-assessed level of language skills (Eurostat 2005).
14
The PIAAC-Initiative
In parallel to these efforts, a preparation of an adult skills assessment has also been launched in the framework of the OECD. Following up on the ALLS and IALSsurveys, which assessed prose literacy, document literacy and numeracy, and in the case of ALLS also problem solving and reasoning, the PIAAC Initiative aims to develop a new round of surveys, pulling in new skills among adults for assessments. The strategy proposed by the OECD secretariat in September 2005 thus proposes to focus on ICT skills and document literacy skills in a 2009 survey round as regards direct testing of skills, and possibly (contingent on a positive evaluation of instrument validity and reliability in an international context) to supplement this with an indirect assessment of skills, based on a battery of questions utilising the job requirement approach and/or self-reporting. In later rounds of surveys, direct tests are to be developed for additional skills (OECD 2005). Governments have not yet given their final approval of the proposed strategy, and it is uncertain how many countries will choose to participate in PIAAC. The European Commission, DG Education and Culture, participates in the preparatory work of PIAAC. 2.3. European Commission Feasibility Study Against the background described above, the European Commission wishes to move ahead with the preparation of an adult skills assessment at the European level, which may or may not be implemented in the framework of a geographically more farreaching PIAAC-initiative. The Commission has thus decided to carry out a feasibility study, which is to examine the possible use of specific methodologies and techniques for the assessment of skills of the adult population as well as the feasibility of using international qualifications frameworks or other similar international qualifications as reference levels for the assessment of the level of skills of the adults within the EU. Specifically, the feasibility will aim at the following objectives: a) Computed based assessment The establishment of an analytical framework based on the analysis of examples of computer based assessment of skills, including adaptive testing and the open-source concept, in order to identify benefits and disadvantages of computer based testing, the accuracy of skills measurement, cultural bias, etc. • The feasibility of testing the whole scope of skills recognized at the EU level as politically relevant should be explored. • Analysis of selected national and international surveys and data on skills collected using computer based assessment. •
b) Self-reporting versus direct testing • Specific comparison of two key methodologies: Self-reported assessment and direct testing, used for the assessment of skills, especially for adult skills assessment. Fo-
15
cus here will mainly be on two skills areas covered by the Eurostat Adult Education Survey: ICT skills and language competences. c) International qualifications frameworks and other classifications • Assessment of the feasibility of using international qualifications frameworks and other classifications, such as the International Standard Classification of Education (ISCED) or the European Qualifications Framework (EQF) under preparation by the European Commission, for the assessment of the level of skills of adults. The Commission has asked Danish Technological Institute to carry out the feasibility study. The following chapters present the preliminary results of analysis. Chapter 3 addresses the question of self-reports vs. direct testing. Chapter 4 and Chapter 5 turns to two questions which are relevant in connection with the direct testing of skills: Computer based assessments and the relevance of international qualifications frameworks in connection with adult skills assessments.
16
3. Self-Reporting vs. Direct Testing in Adult Skills Assessment The theme in this sub-analysis is a comparison of two key methodologies: direct testing and self-reported assessment. The advantages and disadvantages of each approach is will be presented and discussed, specifically in relation to adult skills assessment. There is a particular focus on two skills areas, which are covered by the Eurostat Adult Education Survey: ICT skills and language competences. A comparison of self-reporting and direct testing should include an assessment of the strengths and weaknesses of each approach. The analysis focuses first on testing in skills assessment. It considers the nature of tests, issues of contention in skills testing, and reviews the methodological and practical experiences with testing in adult skills assessments. Similarly, the following section focuses on self-reporting and selfassessment in adult skills assessment. The section following this provides information that directly allows us to compare testing and self-reporting in skills assessment. 3.1. Tests in Skills Assessment The concept of “testing” has several possible meanings. It is, however, closely related to the concept of “measurement”. In educational testing, tests are thus commonly defined as the measurement of student knowledge, abilities and aptitudes. Skills assessment via tests therefore concerns the measurement of skills using tests. Measurement has, in turn, been defined as "the assignment of numerals to objects or events according to some rule" (Stevens 1946). In the testing of individuals, measurement follows the presentation of a task, or stimuli, to the test taker. Measurement concerns the response of the individual to that task. Testing proceeds from the assumption that also phenomena that are not directly observable can be measured. The “literacy”, “knowledge”, “beliefs”, etc., of an individual are not directly observable by the researcher. However, it is posited that inferences can be made from test scores to the underlying, but un-observable phenomena. This will be the case where test variables are valid, i.e. measure the underlying dimension that they are intended to measure, and where tests are applied consistently, resulting in reliable data. 3.1.1. The Evolution of Testing Testing of the skills of individuals has a long history. Originally, the focus was exclusively on the skills of individuals in different contexts, the overall purpose being to measure the skills of individuals as a background for decisions – by the individual itself or by e.g. educators or employers – pertaining to this individual. It is only from the 1980s and onwards that skills testing, with a view to policy development and accountability and where entire populations are the object of investigation, has become widespread.
17
The present generation of large-scale international skills tests was preceded by the development of practices and methods in educational testing and the evolution of psychometric methodologies. Practices of Educational Testing
Individual educational testing dates back to the 19th century. In the early years of the twentieth century, after the field of educational psychology was established, a new discipline, educational psychology found a home in the new colleges of education and teacher training institutions, and most psychologists of education became engaged in the development of educational testing. Educational testing is now used widely, as entry exams, as an educational tool to identify student strengths and weaknesses, to place students at the right levels and to monitor student progress (Kubiszyn and Borich 2005). Increasingly, educational testing is also being used for accountability purposes (Ravitch 2002). Psychometric Testing
These developments went hand in hand with – and were indeed to a large extent preconditioned by – the evolution of psychometric testing, the field of study concerned with the theory and technique of psychological measurement, including the measurement of knowledge, abilities, attitudes, and personality traits. Much of the early theoretical and applied work in psychometrics was undertaken in an attempt to measure intelligence (e.g. Binet 1903). More recently, psychometric theory has been applied in the measurement of personality, attitudes and beliefs, academic achievement, and in health-related fields. Measurement of these unobservable phenomena is fraught with difficulties, and much of the research and accumulated art in this discipline has been developed in an attempt to properly define and quantify such phenomena. Ability Testing, Aptitude Testing and Personality Testing
Psychometric testing is now frequently described as falling into three main categories: Ability testing, aptitude testing, and personality testing. Ability tests measure a persons potential, for instance to learn the skills needed for a new job or to cope with the demands of a training course, and should not be confused with attainment tests which specifically assess what people have learnt e.g. mathematical ability or typing skills. Evidently, what people have learned does depend on their ability in that domain in the first place so the scores on the two types of test are conceptually linked. There is no widely accepted definition of the difference between ability and aptitude. However, it is commonly agreed that to some extent the two terms refer to the same thing: aptitude referring to specific ability, and ability referring to general aptitude. Aptitude tests tend to be job related and have names that include job titles such as the Programmers Aptitude Series (SHL). Ability tests on the other hand are designed to measure the abilities or mental processes that underlie aptitude and are named after them e.g. Spatial Ability.
18
Personality tests concern tests which seek to measure – for a range of different purposes – the enduring characteristics of the person that are significant for interpersonal behaviour (Goodstein and Lanyon 1975). 3.1.2. Large-Scale Skills Test The development and implementation of tests of aptitudes in large-scale surveys for policy development and accountability purposes has a much shorter history. However, such large-scale tests have been made possible by methodological and technological developments in relation to testing along a number of dimensions. Focussing on international surveys which allow international comparison, the TIMSS (Trends in International Mathematics and Science Study) survey was first carried implemented in 1995, and focussed on the mathematics and science aptitudes of students in the 4th and the 8th grade in a number of countries. Additional rounds of test development were carried out and new rounds of data collection where implemented in 1999 (8th grade students only) and in 2003 (both 4th grade and 8th grade students). The PIRLS (Progress in International Reading Literacy Study) was carried out in 2001. It focussed on reading literacy and involved 4th grade students in 20 countries. In the framework of the OECD, the PISA-study focuses on 16-year old students and measures aptitudes in mathematics, science and reading. Development work for the first survey started in the mid-1990s, and a first round of data collection was carried out in 2000. The second round was carried out in 2003 and a third round of data collection is being implemented in 2006. Skills assessment using tests among adults have also been implemented from the mid1980s onwards. In the United States, the National Center for Education Statistics (NCES) has conducted assessments of U.S. adult literacy since 1985, and a national assessment of adult literacy was implemented in 1992 and again in 2003. In France, the IVQ was a response to dissatisfaction with the international IALS survey (see below). Two pilot tests were implemented in December 2000 and April 2002. Full-scale data collection was carried out in November 2000 among a representative sample of adults aged 16-65 in 10 French regions. Internationally, the IALS (International Adult Literacy Survey) was the first attempt to implement a large-scale adult skills assessment based on tests on an international basis (OECD 2005a). IALS has been administered in three waves of data collection: 1994 (9 participating countries), 1996 (5 participating countries or regions) and 1998 (9 participating countries or regions). The LAMP project (Literacy Assessment and Monitoring Program) is an extension of IALS on the lower levels of literacy, to be used primarily on developing countries. The survey is still in the development phase. As of mid-2006, El Salvador, Kenya, Mongolia, Morocco, Niger and the Palestinian Autonomous Territories are participating in LAMP’s pilot phase. The participants have final-
19
ised all of the testing instruments and are now translating and adapting them for local administration.1 The Adult Life Skills and Literacy Survey (ALLS) is the successor to the IALS surveys. It is patterned on IALS and involves the administration of direct performance tests to a representative sample of adults aged 16-65. As with IALS, it focuses on literacy and numeracy. In addition, a problem solving module was also included in the test battery. The pilot survey, involving data collection in 7 countries, was conducted in late 2003 and the first results published in 2005 (OECD and Statistics Canada 2005). 3.1.3. The Development Process of Large-scale Skills Assessment Skills Tests During the past 15 years, considerable experience has been gained in the development and implementation of large-scale skills assessment skills tests. Based on these experiences, this section provides a summary of the elements that typically go into the development process. 1. Theory-related definitions of theoretical categories, tasks and variables
Theory based test development means that the definition of theoretical and operational variables takes place in accordance with an overall theory. Advantages of theory-based test development There are several advantages if tests can be developed on the basis of a validated theory with predictive force. First, a valid theory will – by definition – include relevant theoretical and operational definitions of variables just as it will identify relations between variables.2 These definitions and relations constitute a very useful starting point for the identification of skills domains where testing is relevant and for development of operational variables and test items. Second, skills tests that are based on validated theory hold particular promises in terms of interpretability and comparability. For instance, if a theory of speaking defines different proficiency levels, which have been demonstrated to be relevant for the ability of test-takers to perform certain tasks, test results can then be associated with difficulty levels and sound judgements can be made about different levels of performance. Validated theory may, thus, define common reference points from which relative performance can be absolutely judged. In the absence of any common reference points, the usefulness of test results for policy purposes is limited, as it is impossible to know whether a high, low, or average score is good or bad. Conceptual frameworks as an alternative starting point Unfortunately, theories with predictive force are not very common in the social sciences. For this reason, the development of skills tests are frequently based not on coherent and empirically validated theories but rather on “theoretical conceptual frameworks”, the empirical relevance of which is to some extent an unresolved matter. 1
See http://www.uis.unesco.org/ev.php?URL_ID=6409&URL_DO=DO_TOPIC&URL_SECTION=201. This argument presupposes that ‘theory’ is defined as a causal model with predictive force. There are other definitions of the concept of course. These are not immediately relevant in the present context, however.
2
20
The theoretical framework of the ALLS-survey (Murray et al. 2005), for instance, included as one starting point the work of DeSeCo (“Definition and Selection of Competencies”, cf. Trier 2003), an OECD project which attempted to define “key competencies for a successful life and a well-functioning society”. However, the skills definitions in DeSeCo were to a high extent the result of a political process rather than a stringent research process, and the definitions remain broad and contestable. ALL therefore reviewed existing theory and approaches to measurement in 7 domains in order to “construct frameworks for assessment that rendered explicit the factors that underlie the relative difficulty of tasks in each domain” (Murray et al. 2005: 19) and thus arrive at a starting point for the development of test items in each domain. While the conceptual framework in each domain aimed at improved measurement, a number of other potential benefits derive from development theoretical conceptual frameworks: Among other things, a framework provides a common language and a vehicle for discussing the definition of the skill area, and an analysis of the kinds of knowledge and skills associated with successful performance provides a basis – even if incomplete –for establishing standards or levels of proficiency (Kirsch 2005: 92). Figure 1 illustrates the elements and stages in the development of the conceptual framework for literacy measurement in NALS (the US National Adult Literacy Survey), IALS and ALLS. A similar framework was also used in PISA in connection with the reading literacy measure and in several other connections. Figure 1. Overall Contents of the Literacy Framework in IALS and ALLS
1. Defining literacy 2. Organizing the domain 3. Task characteristics 4. Identifying and operationalizing variables 5. Validating variables 6. Building interpretive scheme
Source: Kirsch (2005: 93)
The Significance of Reasoning and Arguments in Overall Framework Development It should be noted that to a high extent the contents of the framework is based on reasoning and arguments. Various definitions of literacy are discussed, and the final defi-
21
nition is the result of the work of a committee, which is comprised of a group of nationally recognized scholars, practitioners and administrators.3 As for the organization of the domain, this concerns the number of dimensions or scales that shall be used for reporting literacy. As some believe that reading is not a single, one-dimensional skill, literacy is not necessarily best represented by a single scale or a single score along that scale. In the case of IALS and ALLS, a compromise was reached among the various organizing concepts that was felt to reflect a number of salient notions from the literature, and three scales where hypothesized: a prose literacy scale, a document literacy scale, and a quantitative literacy scale. Identification of Tasks and Operational Variables that Affect the Literacy Process The identification of task characteristics is the identification of variables that are likely to affect the literacy process, and that should be taken into consideration when a) selecting materials to be used in the assessment and b) determining task difficulty levels. Three categories of variables were identified: Context/content, referring to the particular context or purpose of the literacy process; material/texts referring to different types of texts; and processes/ strategies, referring to the goal or purpose the readers are asked to assume while they are reading and interacting with texts. As for context/content, for example, it was argued that materials selected for inclusion should represent a broad range of contexts and contents and should be universally relevant, for which reason the variable was operationalized into six categories: Home and family, health and safety, community and citizenship, consumer economics, work and leisure and recreation. Validation of Variables: Identification of Variables that Affect Difficulty This stage in the development process consisted of a procedure for validating the set of variables that have been shown to affect task performance. Thus, out of the total number of variables that were identified in connection with the three broad tasks (context/content, material/texts and processes/strategies) as likely to affect the literacy process, a limited number of variables with high expected explanatory power for task difficulty were selected. On the basis of a model of prose and document processing in reading, which grew out of earlier exploratory work (Fisher 1981; Guthrie 1988; Kirsch and Guthrie 1984b), several crucial steps in the reading process were first identified. The model depicts the reading process as a goal oriented process, where readers search for requested information. Based on previous research (Mosental and Kirsch 1991), which had analysed the explanatory power of the model, three variables were then identified as being among the best predictors of task difficulty for prose and document literacy: a) Type of requested information, b) type of match and c) plausibility of distractors.4 The Scoring of Tasks The variables that were identified as the best predictors for task difficulty were used to score each test task. For instance, one task refers the reader to a page from a bicycle 3
”Literacy is using printed and written information to function in society, to achieve one’s goals, and to develop one’s knowledge and potential.” 4 Two other variables were constructed for the quantitative literacy scales: Type of calculation and operation specificity.
22
owner’s manual to determine how to ensure the seat is in the proper position. The variable Type of Match (TOM) was scored 3, because the reader needed to identify and state in writing two conditions, which needed to be met, not just to make a synonymous match in which case the score would have been 1. Type of Information (TOI) was scored a 3 out of 5, as the requested information was a manner, goal, purpose, condition, or predicative adjective, not for instance a person, animal, place, or thing in which case the score would have been 1. Finally, the variable Plausibility of Distractor (POD) received a score of 2, as distractors in the text corresponded literally or were synonymous with the requested information in the text, but not in the same paragraph. 2. Development of Background Questionnaire
A background questionnaire is developed, containing questions on relevant background variables for the person who is to take part in the test, as well as self-assessments of proficiency in the skills domains that are to be tested. The development of the questionnaire involves repeated rounds of expert consultations and pilot testing. 3. Pilot Testing and Refinement of Frameworks
The scoring of an initial set of test tasks is usually followed by small scale pilot tests to confirm key theoretical and measurement assumptions. Networks of experts carry out large-scale skills assessment development and translation of assessment items in each skill domain. Full scale piloting of assessment instruments are carried out in all the participating countries and items are selected for inclusion in the final assessment. 4. Establishing Scales Using Item Response Theory
Item response theory (IRT) scaling procedures are frequently used to establish scales for a set of tasks with an ordering of difficulty that is essentially the same for everyone. The use of IRT scaling procedures means that the fit of each test item to the underlying statistical model is empirically tested rather than assumed (OECD 2000: 88, 93-94; cf. also section 4.1 below). The application of item response theory presupposes a calibration study: the items are given to a sufficient number of test persons whose responses are used to estimate the item parameters. Establishing scales using IRT thus involves several steps: First the difficulty of tasks is ranked on the scale according to how well respondents actually perform them. Next individuals are assigned scores according to how well they do on a number of tasks of varying difficulty. The scale point assigned to each task is the point at which individuals with that proficiency score have a given probability of responding correctly. IALS and ALL used an 80 percent probability of correct response. Thus, individuals estimated to have a particular scale score perform tasks at that point on the scale with an 80 per cent probability of a correct response. It also means they have a greater than 80 percent chance of performing tasks that are lower on the scale. It does not mean, however, that individuals at a given level can never succeed at tasks with higher difficulty values. 5. Building Interpretative Schemes
Once scores are placed along each of the scales using the criterion of for instance 80 percent, it is possible to see to what extent interactions among task characteristics capture the placement of tasks among the scales. Analyses of the task characteristics reveal
23
the information processing skills needed to perform the task, and the order of these skills. To capture the order, the scale is divided into levels reflecting the empirically determined progression of information-processing skills and strategies. The levels are selected not as a result of any statistical property of the scales, but rather as a result of the shifts of skills and strategies required to succeed at various levels. Thus, in the North American Literacy Surveys, when researchers coded each literacy task in terms of the different process variables mentioned earlier, they noted that the values for these variables tended to “shift” at various places along each of the literacy scales. These places seemed to be around 50 point intervals, beginning around 225 on each scale (cf. Kirsch et al .1998). While most of the tasks of the lower end of the scales had code values of 1 on each of the process variables, tasks with scores around 275 were more likely to have code values of 2, etc. Based on these findings, researchers defined five levels of proficiency, level 1 with the score range 1 – 225, level 2 with the score range 226-275, level 3 with the score range 276-325, level 4 with the score range 326-375 and level 5 with the score range 376-500. These levels were also used in IALS and ALL. For instance, Prose Level 2 (score range 226-275) was described as follows in IALS (OECD 2000: 95): “Tasks at this level generally require the reader to locate one or more pieces of information in the text, but several “distractors” may be present, or lowlevel inferences may be required. Tasks at this level also begin to ask readers to integrate two or more pieces of information, or to compare and contrast information”. 6. Development of implementation procedures, implementation of data collection
High quality data collection procedures in large-scale skills assessment are crucial, especially in an international comparative assessment project. In IALS, several measures to ensure the reliability of data were imposed. Survey administration guidelines specified that each country should work with a repuatble data collection agency or firm, preferably one with its own professional, experienced interviewers. Rules were established concerning supervision, quality checks, etc. Precautions were taken against nonresponse bias and specificed in Administration Guidelines. Countries were required to capture and process data files using procedures to ensure logical consistency and acceptable levels of error and to map national datasets into highly structured, standardised record layouts. Statistics Canada ran various checks on range, consistency, etc. Test scorers received extensive training; re-score reliability was calculated, with rescoring of one country's tests done by another participating country. In-depth analyses were conducted to assess data quality, and cases which present problems for international comparability were noted. 3.1.4. Issues of Contention There is a general presumption in favour of testing in skills assessment as the best method of assessing individuals’ skills, because tests are “objective”. They do not suffer from the potential biases that can come from dependent parties, especially the indi-
24
vidual being assessed. Tests therefore start with a presumption of superior reliability over other methods of assessment. This claim to the high ground of objectivity underpins the history of testing in the field of international skills assessment from IALS onwards, and accounts for the considerable impact of the limited number of such assessments that have hitherto taken place. There remain, however, a number of issues surrounding the use of tests: 1. Questions of Validity
Lack of Consensus on the Concept of Literacy The “objectivity” of tests can very much be questioned. As noted above, the definition of literacy in the ALLS resulted from a compromise among the various organizing concepts that was felt to reflect a number of salient notions from the literature, and three scales where hypothesized. Attempts to define literacy (Venezky, Wagner, and Ciliberti 1990) and seventy-five years of literacy test development (Sticht and Armstrong 1994) reveal that there are many different ways to approach adult literacy assessment. The fact that assessors of adult literacy over the decades have constructed various representations of adult literacy and that different representations of adult literacy have produced discrepant findings, as we shall return to, may lead to the conclusion that “the direct measurement of literacy levels in the labour force of industrialized countries" (Benton and Noyelle 1992: 11) is not possible, because “literacy” per se does not exist, in the labour force or any where else for that matter, as something to be “directly measured.” Instead, different representations of literacy may be created based on different ideas (theories) of what literacy is and why it should be represented in one way rather than another (Sticht 2005). Measuring a certain idea of literacy, and thereby contributing to the production of a certain knowledge about this idea, is far removed from a notion of an objective measurement of “real literacy”. The same argument can be made in relation to other types of skills. Construct Validity Construct validity concerns the relation between the intended variable (construct, for instance “document literacy”) and the proxy variable (indicator, sign, for instance Type of Match as in ALL and IALS) that is actually used. Kolstad et al. (1998) presents evidence which questions the construct validity of the theory underlying the IALS and ALL literacy performance tests. The scale of difficulty was developed based on respondents having an 80 percent probability of getting items correct to validate their analysis of what made items less or more difficult and to determine at what points on the 0 to 500 point measurement scales the scales should be divided into the five levels of difficulty. But when lower response probabilities were used to test the robustness of the three factors for predicting the difficulty of test items, the factors contributing to the difficulty of items changed. Factors that determined the difficulty of the IALS items changed not as a function of a change in some specified aspect of literacy but rather as a function of the standard of proficiency that was adopted to be considered proficient in certain tasks.
25
IALS and ALL do not measure knowledge of a vocabulary or cultural nature, both of which have been demonstrated to increase with age (Sticht, Hofstetter and Hofstetter, 1996; Hofstetter, Sticht and Hofstetter, 1999). Instead it emphasizes “search and locate” types of tasks that introduce unknown and possibly irrelevant test variance due to the overloading of working memory. At the same time, it is well established that working memory becomes increasingly less efficient with advanced age (Bernstein et al. 1988; Meyer, Marsiske and Willis 1993). The construct validity of the performance assessments for IALS and ALL is therefore questionable for older adults (Sticht 2005). The surveys may produce serious underestimations of the breadth of materials that older adults can read and comprehend using their more extensive, and in some cases specialized knowledge base, and the tasks they can perform given sufficient time to study materials, without the pressure for efficiency typical of test-taking situations. The latter are of questionable “real world” ecological validity in the lives of most adults over the age of 25 who are not in school. Table 1. Correlations between the Four Assessment Domains, PISA 2003. All Participating Countries and OECD Countries Reading r
Science SE
R
Problem-Solving SE
R
SE
Mathematics OECD countries
0,77
0,003
0,82
0,002
0,89
0,001
All participating countries
0,77
0,002
0,82
0,002
0,89
0,001
OECD countries
0,83
0,002
0,82
0,002
All participating countries
0,82
0,001
0,82
0,002
OECD countries
0,79
0,002
All participating countries
0,78
0,002
Reading
Science
Source: OECD 2005b. Note: Latent correlations. These correlations are unbiased estimates of the true correlation between the underlying latent variables. As such, they are not attenuated by the unreliability of the measures, and will generally be higher than the typical product moment correlations that have not been disattenuated for unreliability.
Additional evidence exists to question the construct validity of the three IALS and ALL literacy scales (prose, document, quantitative) as distinct scales. Independent analyses was found that the IALS scales correlated around +0.90 (Reder 1998; Rock 1998), sharing some 80 percent of their variance, yet there is no theoretical statement about cognition that might account for this large overlap among the three literacy scales, findings which seem to argue against the construct validity of the three scales as “distinct” as expressed by the theory underlying the IALS. It seems likely that much of the commonality among the three scales could result from the fact that in all three scales there are knowledge and language components (words, syntactical rules) that overlap in the three scales. The same finding applies to PISA where high correlations have been found between student test scores across the four skills domains tested in PISA 2003 (reading, mathe-
26
matics, science and problem solving. The correlation is high for example between mathematics and reading (r = 0.77) and between mathematics and science (r = 0.82). Finally, the use of real world” tasks, as they are applied in IALS, ALL and other skills tests, has been criticised (Venezky 1992). Such tests use tasks that engage complex information processing activities with unknown mixtures of various types of knowledge and processes. For this reason it is not clear what they assess or what their instructional implications are. Without knowing what specific knowledge or skills are being assessed in “real world” tasks, it is not clear to what extent test performance reflects literacy ability or some other abilities, such as problem solving, reasoning, management of testtaking anxiety, interpersonal skills, or some complex, interactive combination of all these or others. Construct Validity Will Remain an Issue in Skills Testing More generally, the question of construct validity will necessarily remain an issue in connection with tests. Construct validity cannot be measured, only evaluated. Constructs are by definition not directly observable. Evaluations of construct validity can therefore take the form of identifying co-variations between observable indicators, which suggest that a common factor is underlying these variations. This common factor is not necessarily identical to the construct, however, as there is no logical necessity between the construct and the proxy variables. To take an example, the operationalization of theoretical variables in the IALS and ALL literacy performance tests as mentioned above included the selection of material from six different adult context/content categories. While there are good arguments for choosing these categories, it is also clear that they presuppose and define a certain kind of “modern life”, consisting of home and family, work and leisure, consumption, health and safety concerns and a certain involvement in society. Instead of claiming that ALL and IALS measure literacy, it may therefore be argued – perhaps more convincingly – that they measure “literacy in certain modern life situations”, as tasks refer to texts that presuppose a certain life situation. In a similar vein, the “quantitative literacy” tests in the IALS survey were criticised as being too far removed from an adequate test of numeracy, because they required reading skills to be able to do the tests. The Question of Standards: How Good is Good Enough? In NALS, IALS and ALL, developers set a criterion of having an 80 percent probability of getting the average item at a given literacy level correct to be considered proficient at that level of skill. However, the 80 percent response probability level was essentially arbitrary (Kolstad, 1996). Kolstad notes that the 0.65 response probability standard is used by the National Assessment for Educational Progress (NAEP) for children in the public school system in the United States. Using NALS data, he then recalculated the percentage of adults who would be in literacy Level 1 if a standard of 65 percent probability of a correct response was used for adults. In this case the percentage of adults assigned to Level 1 on the prose scale fell from around 20 per cent to 13 per cent. In NALS, the 80 per cent response probability level resulted in almost half of the US population being placed at the two lowest level of literacy proficiency. In the massive public debate that followed, the most frequently posed questions to NALS test develop-
27
ers were whether “the literacy skills of America's adults are adequate to ensure individual opportunities for all adults, to increase worker productivity, or to strengthen America's competitiveness around the world?" The NALS report replied that, "[b]ecause it is impossible to say precisely what literacy skills are essential for individuals to succeed in this or any other society, the results of the National Adult Literacy Survey provide no firm answers to such questions" (Kirsch et al. 1993: xviii). In PISA 2000 and 2003 it was acknowledged that the decision on which standard a test taker should live up to in order to qualified for being placed at a certain “level� is essentially arbitrary (OECD 2005c: 253-255). Nevertheless, an approach had to be determined for defining performance levels and for associating test takers with those levels. The decision was that the exact average percentage of correct answers on a test composed of items at a level could vary slightly among the different domains, but would always be at least 50 per cent at the bottom of the level. Or in other words: to be considered proficient at a level of skill, test takers shall have at least 50 per cent probability of getting the average test item at the level correct.5 2. Reliability Issues
Reliability is the consistency of a set of measurements or measuring instrument. In connection with both IALS and ALL and PISA, implementation issues with direct consequences for the reliability of data were acknowledged to be extremely important. There are a number of examples of countries whose data have been withdrawn due to shortcomings in the sampling procedure, too low response rates or other deficiencies in data collection. In PISA 2003, the data from England were withdrawn due to a too low response rate among the sampled schools. OECD (2000) identified several factors that will help reduce variability in multicountry surveys of the type analysed here: ''The presence of clear and realistic standards, consortia of skilled and experienced institutions, sufficient budgets to fulfil the complex statistical and operational demands imposed in such a study, well developed quality assurance procedures to minimise deviation from specification and to identify problems at a stage where they can be dealt with, and finally, and perhaps most importantly, a willingness on the part of participating countries to adhere to agreed standards and guidelines''. 3. Issues of Costs and Timing
Finally, issues of costs and timing need to be taken into consideration in assessing the advantages and disadvantages of direct testing. Developing tests on which there is a high degree of confidence in validity and reliability is a costly and time-consuming process. Developing tests which can claim validity and reliability on an international basis redoubles these costs. If the degree of reliability and the validity of the tests are thought to vary significantly across countries (for example, through differential data collection mechanisms leading to greatly varying response rates), the international comparability of the skills measures, and hence the overall assessment strategy, is compromised. 5
A precise mathematical formula defines the placement of test takers at the different levels.
28
3.1.5. The Limits to Testing Methodological Limits
From the evaluation of existing tests and the reading of the efforts towards test development on a wide range of skills, DTI et al. (2004) concluded that testing must be restricted to a rather narrow range of skills. These include literacy and numeracy. They might also include foreign language skills, problem solving and ICT skills. The development of tests for other skills, such as communication skills and entrepreneurial skills, would be costly and not very likely to successfully pass the criteria of validity, reliability and transferability. As for entrepreneurial skills, there are currently some private tests available, but they take the form of personality tests of limited policy relevance.6 Research-based tests of entrepreneurship, as opposed, focus on testing knowledge of concepts relevant for entrepreneurship (e.g. Kourilsky and Esfandiari 1997). Political Risks
There are also political risks that must be considered in connection with large-scale skills assessment skills testing. Even though the word “skill” is widely used, there is no consensus as to the precise meaning of the term. Skill as a label, which carries with it the prospect of labour market rewards, has historically been a contested concept among employers and employees, and sometimes between men and women, in the definition of jobs (Stasz 2001, Payne 1999). Research indicates that many skills associated with work productivity are context dependent (cf. Bjørnåvold 2000; Stasz 2001). Communication skills, for example, can differ according to audience, purpose, style and mode. Some jobs involve communicating with external audiences or the public (e.g., flight attendants, sales persons), while others involve primarily internal communications with colleagues (Stasz 1997). Moreover, many skills of interest to policy makers are tied to groups or organisations rather than to individuals. Problem-solving skills, for example, can apply to individuals and to teams or groups. The problem-solving capacity of the team rests not on the problem-solving capacities of each participating individual but on the capacities of the group or on the distribution of different types of skills within the group (Stasz 1997). Research also indicates that a group works better if a number of different skills profiles are represented within it (Belbin 2003). While it is a methodologically sound approach to seek to measure only what is measurable, the approach is risky. Many relevant skills and competencies may be overlooked simply because they are not measurable. New international skills assessment initiatives may, if successfully implemented, stir considerable interest. It may therefore also fix attention and the political agenda exclusively on the skills included in the initiative.
6
See e.g. http://www.davincimethod.com/entrepreneur-test/ or http://www.humanmetrics.com/#SBEP
29
Ethical Limits
A final point concerns the rights and responsibilities of the individual or the family versus public authorities in developing skills. The proliferation of broader notions of skills and competencies, and the predominant belief that “soft” skills will be ever more important in the modern economy, raises new questions. The desire to measure and compare European populations’ “problem-solving competencies”, “critical thinking competencies”, “physical/health competencies”, “communication competencies”, “aesthetic competencies”7 or other “soft” skills may add a new dimension to the way in which governments and international organisations become involved in personal life. It is not a question of governments or agencies directly interfering with specific individuals’ lives based on their performance on an assessment. Such problems can be prevented through confidentiality procedures. Rather the issue is that of new skill domains being pulled into the sphere of public policy and political intervention via measurement and international comparison. In developing a strategy for European adult skills assessment, it is important to consider what governments should be permitted to measure. This is also a question of political feasibility since skills assessment on a European scale may be controversial. 3.2. Self-Reporting in Skills Assessment The alternative and potential supplement to direct skills testing is to survey nationally representative samples of individuals. One way that has occasionally been used to assess skills through surveys is through self-assessment. The approach has been to ask individuals to say “how good” or “how effective” they are at certain activities or “how often” individuals use certain skills. An alternative survey-based method of skills assessment is through self-report of individuals’ jobs. The idea is to question respondents directly about their work activities and the requirements of their jobs. Different approaches may be used to distinguish between tests and self-assessments. One way could be to distinguish between the types of stimuli presented to the objects of investigation (test takers or survey respondents). Whereas tests in skills assessment seek to measure skills by presenting test takers with a performative task and then measure the response, skills assessment using self-assessment thus asks respondents to perform a self-evaluation, i.e. an evaluative task. However, in practice this distinction quickly becomes more unclear. Survey-questions in relation to skills assessments also present respondents with a task – namely the survey question – and these tasks easily include performative tasks, for instance to recall the frequency of their use of a particular skill during a period of time.
7
All examples of skills/competencies that are frequently mentioned as key competencies in the country contribution reports of the OECD’s DeSeCo-initiative for selecting and identifying key competencies for a “successful life and a well-functioning society”, cf. Salganik and Stephens (2003); Trier (2003).
30
A more precise distinction between tests and self-assessments in skills assessment is therefore whether or not an external evaluation is carried out of the respondent’s solution of tasks. •
•
In tests, the response of the test-taker is subject to an external evaluation. Often – but not always – there will be an agreed and known correct response to each task which is presented to the test-taker, and test scores can then be calculated on the basis of the ratio of correct to incorrect responses. In self-assessments, there is not external evaluation of the test takers response. The correct answer to the question is known only by the respondent.
The use of self-assessment and self-reporting in skills assessment follows the development and refinement of survey methodologies and technologies. A survey may, in turn, been defined as “an inquiry which involves the generation of data across a sample of cases, and the statistical analysis of the results”.8 3.2.1. The Development Survey-Based Assessments of Self-Reported Skills As is the case for the development of testing, the education sector has been important for the increasing use of self-assessment tools during the 20th century. Individual level self-assessments with a formative (i.e. development oriented) purpose has been included as an important learning tool in educational settings for several decades. More recently, a range of detailed questionnaires and surveys on skills and job requirements have been developed by O*NET, the US Occupational Information Network, and data has been organised in a comprehensive database, driven by the objective of providing occupational advice to individuals, human resource personnel and educators / trainers (cf. Peterson et al. 1995). Assessment of skills in large-scale surveys relying on self-reports with a view to informing policy making is a much more recent phenomenon. With the increasing focus on the knowledge component in modern economies during the 1990s, captured by the concept of the “knowledge economy”, came a growing interest in obtaining information about the skills levels of entire populations. As described above, the development and implementation of direct tests of skills is a time consuming, expensive and methodologically challenging task, and to date it has only been possible to develop tests of a relatively high quality of a limited range of skills types. For these reasons, attempts were made to identify other approaches to skills assessment that were quicker and less expensive to develop and implement. In the mid-1990s, the United Kingdom started development work for a skills survey, based on the “job requirement approach”, which will be described in more detail later. The survey was implemented in 1997 and again in 2001 (Ashton et al. 1999; Felstead et al. 2001); a third survey is currently in the field.
8
Adapted from March (1982), cf. also Olsen (1998).
31
Furthermore, the Skills for Life Survey, commissioned by the Department of Education and Skills and carried out between June 2002 and May 2003, was designed to produce a profile of adults’ basic skills in England, setting them in the broad levels defined by the national standards (DfES 2003). It contained both a self-assessment module and a test module. In Denmark, a survey based on respondents’ self-reported skills was carried out in 2003 – 2004, “The Danish National Competence Account” (Undervisningsministeriet 2002, 2005). This work was inspired by the conceptual work of the OECD’s DeSeCo-initiative. Among recent initiatives, Eurostat’s Adult Education Survey should be mentioned. This survey is currently being implemented in the EU’s member states. With respect to skills assessment, the AES is concentrated on assessment of the adult population (25 to 64 years) in two domains: ICT skills and language skills. These skills are assessed on the basis of self-reporting. The use of and familiarity with ICT is considered as the best proxy for ICT skills. Language skills are measured using questions concerning the use of foreign languages, the frequency and context of foreign language use. Optional questions may include questions on the self-assessed level of language skills (Eurostat 2005). 3.2.2. Issues of Validity and Reliability in Self-Reporting of Skills The greatest disadvantage of self-assessment as a method of obtaining data is the risk of measurement errors. Some measurement errors in surveys is also termed differential item functioning (DIF), referring to the interpersonal incomparability of responses to survey questions. In addition, there are validity questions in connection with the different scales that may be used to measure skills via self-reports or self-assessments. Allen and van der Velden (2005) distinguish between those measurement errors resulting from a more or less ‘intentional’ manipulation of answers by respondents, and unintentional discrepancies between the real and reported values. Unintentional Errors in Responses to Survey Questions
Unintentional measurement errors arise when the answers given by respondents in good faith do not correspond to the ‘real’ value on the variable in question. There are a number of reasons why unintentional measurement errors may occur (cf. Allen and van der Velden 2005).9 Respondents’ Memory and Comprehension Unintentional measurement errors may be the result of respondents’ problems with comprehension or memory. Thus, when survey questions pose cognitive tasks to respondents’ working memory, episodic memory and semantic memory is essential (Olsen 1998: 37-39). Where survey questions confront respondents with cognitive tasks that they cannot solve due to memory limitations, the reaction may be cognitive indolence where the 9
Section 3.2.2 and section 3.2.3 draw extensively on the insights of this paper.
32
respondent delivers answers at random (e.g. Krosnick and Alwin 1987) or satisficing, a strategy whereby the respondent provides answers only to satisfy the interviewer – the result being identical to the effects of cognitive indolence (e.g. Krosnick 1991). Olsen (1998) demonstrated how survey results were significantly affected by the demands that items put on the working memory of respondents. Problems with comprehension may be especially important as regards self-assessment of skills. To be able to reflect about the skills one possesses and the skills requirements of one’s job, a certain level of (meta-)cognitive skills is required. This may imply that self-assessment may be more difficult to administer among lower educated groups (Allen and van der Velden 2005: 12). The Precision and Clarity of Survey Questions The precision and inambiguity of survey questions is a significant factor for the validity of survey data. Thus, a survey question imposes upon the respondent certain semantic tasks. Some semantic tasks are easier to solve than others. In linguistic theory, it is common to distinguish between the intention and the extension of words (Føllesdal and Walløe 2001). The intention is what an individual understands when they understand a word. The extension – or reference – is the existing object to which the word refers. Not all words have both an intension and an extension. “Sausage” has both, but “freedom” has only an intension. Since intension and extension are not identical, many words do not have a precise meaning. On the contrary, the meaning of words and sentences is often vague and diffuse or absent references, and words may have both denotations (the main meaning) and connotations (alternative meanings). The precision and clarity of survey questions is therefore a permanent challenge in surveys. Ambiguity and the possibilities of different interpretations may entail that concepts and sentences are not understood in the same way by respondents and by the researcher, and ambiguous formulations may also lead to different understandings and interpretations among respondents (cf. Ward et al. 2002). Olsen (1998) demonstrated the effects of the “semantic fields” of words (the scope of interpretation for individual words) and the “empty spaces” of sentences (the scope of interpretation for questions or sentences) for survey results. In other words, survey results are affected significantly by the precision of individual words, as well as by the precision and inambiguity of entire sentences. The concept of “skills” is a concept with a wide scope of interpretation, and indeed its meaning appears to be developing over time, cf. section 4.3 below. To use linguistic terminology, the concept of “skills” has only an intension, not an extension. For this reason, imprecision and ambiguity is – at least potentially – a significant source of errors in the self-assessment or self-reporting of possessed or required skills. In Dykema and Schaeffer (2000) it was demonstrated that clarity and complexity are good predictors of measurement errors. The Problem of External Anchors Ideally, all respondents in a survey should share the same understanding of the measurement scale in connection with a specific question. For some types of survey ques-
33
tions, this is the case. Questions such as “what was your income before tax last month?” or “how many hours did you work last week?” build on a natural and well-known numerical scale for registering responses. In common with many questions that seek to measure attitudes or register respondents’ assessments, questions concerning skills have no natural numerical scale against which responses can be registered. A self-assessment scale on skills ranging from “very poor skills” to “very good skills” is for instance highly imprecise: it may differ considerably from respondent to respondent what specific skills correspond to “very poor”, “poor” or “good” skills, for instance. It is in this instance that one can speak of differential item functioning, referring to the phenomenon that perceptions of response scales differ from respondent to respondent (King et al. 2004). Thus, as a result of the ambiguous nature of anchor points, different groups of respondents are likely to use their own frames of reference when answering questions (Ward et al. 2002: 69-70). This may in turn entail systematic overestimation or underestimation of skills by different groups, whose reference groups have respectively a lower or higher level than the population at large. As formulated by Allen and van der Velden (2005), in the case of skills, the frame of reference which respondents apply in interpreting scales “is likely to be strongly biased by the respondent’s own educational background or occupational affiliation. This implies that differences between occupational groups or fields of study are probably biased towards the mean, making it difficult if not impossible to assess the overall skill level or to compare different groups”. Intentional Errors in Responses to Survey Questions
Respondents may also intentionally respond in other ways than they would do if answers should reflect true values. A frequently reported reason is the social desirability of certain answers (Allen and van der Velden 2005: 12; Victorin, Haag-Gronlund, and Skerfving 1998): Respondents may shape their responses to appear more ‘wellfunctioning’ or ‘normal’. In the case of skills, respondents may find it embarrassing to report very low or very high levels, out of a desire to be ‘normal’, or they may report more extreme values than is true, if they wish to boast (in which case reported levels would be exaggerated in comparison to true levels) or if they wish to appear modest (in which case reported levels would be too low in comparison to true levels). Respondents may also wish to appear consistent. Olsen (1998: 103-105) for instance reported significant consistency effects, where responses to questions were adapted to other, earlier questions regarding the same theme. Different Scales in Skills Assessment
Several different concepts have been used to measure skills. Murray (2003: 141) lists five different dimensions of skills use, some of which were used in connection with the development of the ALLS background questionnaire: • • • •
incidence (whether or not one uses the skills); frequency (the number of times the skill is used in a given period); range (the range of social situations or contexts within which the skills is put to use); complexity (the level of mental complexity involved); and
34
•
criticality (how relevant successful application of the skill is to achieving desired or desirable social, economic, or cultural outcomes).
An “importance” scale is used in connection with the UK Skills Surveys, but the international pilot test of the job requirements approach of these surveys may make use of a “frequency” scale. The choice of key measurement concept is important as the different concepts measure different dimensions of skills and skills use. Depending on the specific circumstances, a specific measurement concept may thus give rise to validity problems. If the researcher seeks to obtain knowledge about levels of skills complexity possessed, validity problems are likely to occur if the applied measurement concept is that of frequency or importance. For instance, being a cashier in a supermarket may involve the very frequent use of numeracy skills, and numeracy skills are likely to be very important in this job. However, the level of numeracy skills required is most likely not very high.10 Moreover, the concept of importance (a relatively unclear concept) is inherently more complicated than the concepts of level or frequency: the importance of a specific skill is likely to depend on the number of different skills which are required in the job function concerned. 3.2.3. Potential Solutions to Problems of Self-Reporting The possible solutions to problems of the self-reporting of skills are to a large extent also possible solutions to general measurement problems in connection with questionnaire based surveys in the social sciences. The literature on this problem has focused on developing ways of writing more concrete, objective, and standardized survey questions and developing methods to reduce incomparability. One of the more recent areas of development has been efforts to identify common anchors that can be used to attach the answers of different individuals to the same standard scale. Reducing Unclarity and Ambiguity and Assisting the Memory of Respondents
Olsen (1998) and Dykema and Schaeffer (2000) argue and to some extent also demonstrate that improvements in precision and clarity of survey questions and the introduction of remedies to alleviate requirements to respondents’ memory can reduce the effects of ambiguous and unclear survey questions. Dykema and Schaeffer (2000) conclude that retrieval of information is less accurate when the requested information is emotionally neutral, complex, and indistinct from other information. According to Allen and van der Velden (2005: 13) “this suggests that measurement errors can be reduced by formulating items that are clear and unambiguous, that are clearly distinguishable from other items, and that elicit an emotional response from graduates. In the case of skills, the challenge is to formulate items that have a clear and uniform meaning to all graduates, to avoid items that are composites of 10
The example has been provided by Rolf van der Velden.
35
several underlying dimensions, to choose items that are conceptually distinct from other skills, and to formulate the items in such a way as to tap into the feelings graduates have about their own (lack of) abilities. It is, however, doubtful to what extent the latter suggestion can be implemented”. Olsen (1998) provides a number of examples of how “probes”, “landmarks”, “cues” and other “aided recall” methods and other kinds of relieves of working memory affected the distribution of responses considerably. Some of these methods are also relevant in connection with self-reported skills assessments, in so far as they can contribute to the precision and clarity of survey questions on skills. Adjusting Responses or Framing Questions so as to Minimize “Demand Effects”
A second category of possible solutions to measurement problems in self-reporting is to adjust responses or alternatively frame questions and the posing of questions in ways that minimize the social desirability of certain responses. The objective in this category of solutions is to stimulate in respondents the attitude that all possible answers can also be regarded as ‘normal’ answers.11 First, confidentiality should be assured. Second, it is preferable if self-assessments are carried out without the presence of an interviewer. There is some evidence that the use of computer based assessments, where the computer poses as the neutral interviewer, and the interviewer acts more as an assistant who is “on the same side” as the respondent reduces the effect of the presence of an interviewer (DfES 2003: 239). Third, instead of asking respondents to assess for instance how “good”, “strong” or “advanced” they are as regards specific skills or tasks, it can be attempted to use more neutral scales. Similarly, a more neutral context of the self-reporting may also be established than asking the respondent to evaluate him- or herself as a person, or the abilities of him- or herself as a person. Instead of the respondent can be asked to assess the use or sufficiency of skills in a specific context, for instance the job of the respondent. This has approach to some extent been the method applied in the UK Skills Surveys of 1997 and 2001, which applied a “job requirements approach”. It is also the approach used in connection with the O*NET database, the US Occupational Information Network.12 In the UK Skills Surveys respondents were asked ‘in your job, how important is [a particular job activity]’. The response scale offered was: ‘essential’, ‘very important’, ‘fairly important’, ‘not very important’ and ‘not at all important or does not apply’. Examples of the activities included working with a team of people, working out the causes of problems or faults, making speeches or presentations and planning the activities of others (Felstead et al. 2001). However, even if it may be helpful, this method does not eliminate the problem of potential demand effects: Individuals might talk up their jobs, to boost their self-esteem. It
11
Different measures of social desirability have been developed. These can, at least in theory, be used to statistically correct the data afterwards, using factor analysis or covariance analysis. However, according to Allen and van der Velden (2005: 17) and Richter and Johnson (2001), there is general agreement that it is nearly impossible to eliminate the effect entirely. 12 http://www.onetcenter.org
36
is held that they are less likely to do so when reporting their activities than reporting their competencies in the performance of these activities. At the same time, there are specific drawbacks to the “job requirements” approach: 1) As an indication of the skills of individuals, the method gives us only approximate measures. Jobholders may have too many or too few skills for the job. If too many - and they have relevant under-utilised skills - the jobholders become dissatisfied, and the statistics would underestimate the stock of skills in the working population. If too few, the employer would become dissatisfied with the employee’s performance. 2) The job analysis method cannot be applied to the economically inactive population (who - by definition - do not have jobs). There may be difficulties applying it even to those who are unemployed, though it is possible that an adapted method could be applied to those recently unemployed. Thus, no direct inferences could be made about the skills of the whole adult population from the job analysis approach to measurement. Potential Solutions to the Anchoring Problem
The earliest and still the most common anchors involve giving the endpoints of the (or all) survey response categories concrete labels - “strongly disagree,” “hawk,” etc. This undoubtedly helps, but is often insufficient (King et al. 2004). An early and still used alternative is the “self-anchoring scale” where researchers ask respondents to identify the top- and bottommost extreme examples they can think of (e.g., the name of the respondent’s most liberal friend and most conservative friend) and then to place themselves on the scale with endpoints de- fined by their own self-defined anchors (Cantril 1965). This approach is still used but, depending as it does on external statistics, it often lowers reliability, and it will not eliminate DIF if respondents possess different levels of knowledge about examples at the extreme values of the variable in question. Allen and van der Velden (2005) distinguish between “ex ante expert anchoring” and “inter subject anchoring by vignette” to describe two alternative approaches. Ex Ante Expert Anchoring Ex ante expert anchoring involves an a priori development by experts of an answer scale with clear and uniform meaning for all respondents. Ideally, the extreme points on the scale, as well as the mid-point, should correspond to something that all respondents know and assign the same meaning or interpretation to. One ex ante expert anchoring method uses occupational titles as anchors in the scale. Expert ratings are used to locate characteristic examples of occupations at appropriate points over the full range of the scale. Respondents are then requested to position their own skill level with respect to the listed occupations. The method is attractive in theory, but it is based on some assumptions that may not necessarily hold (Allen and van der Velden 2005: 13): • • •
Anchor occupations are assumed to be clear to all respondents. Experts have to be consistent in their rating of occupations Anchors must be transitive: starting with the lowest level, each subsequent anchor in the scale must correspond to a more difficult level.
37
• •
It is assumed that an occupation can be regarded as a good proxy of a skill level for a particular skill. The assessment is presumed to involve two steps: 1) respondents form an image of the skill level associated with each occupation; 2) they score their own level on that skill relative to these occupations.
The second method of ex ante expert anchoring uses short descriptions of skill levels themselves as anchors, thus avoiding the use of occupational titles. An example from the questionnaire used in the O*NET surveys is provided in figure 2 (see Appendix 1 for the full questionnaire). Several of the above-mentioned assumptions also pertain to this method. Anchor points are assumed to be clear to respondents, and there must be expert agreement on them. Answers are also assumed to be transitive. It must be regarded as a methodological advantage, however, that the scale directly describes skills levels rather than occupations, which presuppose that respondents are able to form valid and consistent images of skills requirements related to each occupation. Figure 2. Example of ex ante anchoring using descriptions of skill levels What level of SOCIAL PERCEPTIVENESS is needed to perform your current job? Notice that customers are angry because they have been waiting too long
1
2
Be aware of how a coworker’s promotion will affect a work group
3
4
Counsel depressive patients during a crisis period
5
6
7
Source: O*NET
Inter-Subject Anchoring using Vignettes An alternative way of obtaining anchors is “inter-subject anchoring using vignettes (King et al. 2004). In this approach, before rating themselves, respondents are asked to rate imaginary persons described on vignettes, i.e. small descriptive illustrations. These vignettes are the same for all respondents. For this reason, the ratings of the vignettes can be used to correct statistically for differences in the scales applied by respondents in the rating of themselves. This method is attractive in principle, but it is also very resource demanding. If applied to skills assessments, respondents would have to answer a number of questions in relation to vignettes (hypothetical situations) for each skill.13 However, King et al. (2004) has shown that resources can be saved by asking vignettes to only a random sample (or sub-sample) from the same population as the selfassessments. For example, researchers could include the vignettes only on the pre-test 13
The model was originally developed in the context of WHO’s World Health Survey.
38
survey; alternatively, for each self-assessment on the main survey they could add, say, one additional item composed of four vignettes asked of one-quarter of the respondents each. In this parametric model the thresholds for the different categories are determined by a set of explanatory variables. These explanatory variables are used in turn to estimate the thresholds for those respondents for which only self-assessments are available. The same method can also be used post-hoc to recalibrate the scales, using a different sample. The parametric model rests to two assumptions: • •
Vignette equivalence, i.e. the vignettes must all describe different points on one underlying dimension Response consistency: respondents must use the response categories in the same way for the self-assessment and for the assessment of the vignettes.
3.2.4. The Limits to Self-Reporting Methodological Limits
The sections above have highlighted a number of methodological problems and challenges in connection with self-reported skills as a basis for adult skills assessment. A number of potential solutions have also been discussed. However, it is fair to say that these solutions can only go some of the way in solving the problems mentioned. Most importantly, self-reported skills will to a certain extent be characterised by measurement errors, since no efficient method has been devised to remove entirely intentional or unintentional errors in response behaviour. Political Risks and Ethical Limits
For various reasons, test results appear as politically more controversial than results of questionnaire based surveys. For this reason, the political risks of implementing largescale self-report based skills assessments in the EU can be presumed to be smaller than in connection with tests. The political risk in connection with testing to a high extent consists of the possibility that political attention will be fixed on the relatively few types of skills that can be measured via tests. The ability of surveys to “fix” political attention is most likely smaller than is the case for tests. Moreover, a broader range of skills can be measured using self-reports, for which reason intense political attention directed towards the measured skills can be presumed to be less harmful than if political attention is fixed only on a few testable skills. The political risks in connection with skills assessments based on self-reports or selfassessments are more likely to emerge from a questioning of the reliability and/or validity of survey results. The ethical concerns in connection with self-reported skills assessments can be presumed to be similar to those associated with testing.
39
3.3. Comparing Tests and Self-Reporting in Skills Assessment This section juxtaposes and compares testing and self-reporting in adult skills assessment. First, we present some empirical evidence on results derived from testing and self-reporting respectively. This enables us to gauge the extent of differences in the types of results that the two approaches are generating. Empirical comparisons focus mainly on comparisons between tests and self-assessments: Little relevant data has been identified which allow a direct comparison of test and self-reports using the jobrequirements approach.14 Second, we present some experience and evidence as regards testing and selfassessment in connection with two specific types of skills: language skills and ICT skills. Third, we provide an overview assessment of the various methodological and practical advantages and disadvantages of the two approaches. 3.3.1. Evidence on the Relation between Self-Assessed and Test Results What is the relation between respondents’ test-scores and their self-assessment as regards the same skill or competence? In principle, it is possible to illuminate on this question by comparing the proportions of test-takers at different proficiency levels with the proportion of test-takers that have placed themselves at the same proficiency levels via self-assessments in connection with the same test. Another possibility is to analyse correlations between test scores and self assessments. Standards or Levels May Distort Interpretation of Results
As for the first possibility, the results of such an analysis will necessarily be influenced by the problem of the definition of levels or standards in tests, as described in a previous section. An example from IALS is illustrative: As mentioned, the measurement range of scores for each scale was 0 to 500 in IALS. For each scale, five levels of literacy were defined, increasing from the lowest level, Level 1 (scores from 0 to 225), through Levels 2 (226-275), 3 (276-325), 4 (326-375) and 5 (376-500), the highest level of literacy. IALS also created a set of scales which asked adults to self-assess how well their reading, writing, and numeracy skills met the demands for such skills in their daily lives and at work. The measurement scale for each of these literacy and numeracy skills consisted of five categories: No Opinion, Poor, Moderate, Good, and Excellent. The use of these two different types of measurement methods resulted in one of the surprising findings from IALS (Sticht 2005). The number of adults thought to be “at risk” for various factors such as low employment, dependency upon welfare, poor health care, lack of civic participation and so forth, due to low literacy in each nation,
14
The IALS background questionnaire and the UK Skills for Life Survey measure self-assessed skills in relation to needs in daily life. This is not the same as measuring the frequency and/or importance of certain skills in the respondent’s current job, however.
40
was much higher when the performance scales were used than when the self-assessment scale was used. For instance, in Canada, on the document scale, 18.2 percent of adults were assigned to Level 1, the lowest level of literacy based on their performance task results (OECD 1995: 57). This suggested that some 3.3 million of Canada’s 18.5 million adults aged 16 through 65 were “at risk” because of low literacy. However, on the self-assessment scale of how well they read in daily life, only around 5 percent of Canadian adults, fewer than 1 million, rated their skills in reading for daily life or at work “poor”. For the 3.3 million adults in Level 1, the lowest level of literacy on the document scale, 21.9 percent thought they had excellent reading skills, 26.5 percent thought they had good reading skills, and 23.9 percent thought they had moderate reading skills (OECD 1995: 192). Similar discrepancies between the IALS performance tests and the self-assessed reading abilities were found for other nations for the two other literacy scales and for self-assessments of writing and numeracy skills. What might be the explanation? • •
•
It is easy to conclude that self-assessments were not valid and that respondents overestimated their own literacy skills. This possibility cannot be ruled out. A further and highly relevant possibility is that the self-assessment questions appeared to measure peoples skills in relation to the demands for such skills in their daily lives and at work, i.e. it was not attempted to measure the actual level of skills, but rather the actual level of skills compared to the level required in the job and daily life of the respondent, which is something quite different. Thirdly, it is possible that the requirements defined in NALS and IALS for an individual to be placed at certain proficiency level were problematic. As mentioned, NALS and IALS defined a criterion that respondents should have an 80 percent probability of getting the average item at a given literacy level correct to be considered proficient at that level of skill. In other large-scale skills assessment skills tests, the criterion has been between 50 and 65 per cent.
Statistical Relations between Test Scores and Self-Assessment Scores
Various skills tests have also included self-assessments of skills. Tables 2, 3 and 4 below contain the results of a comparison of self-assessment results with test results in the 2003 UK Skills for Life Survey (DfES 2003). Respondents were asked to assess how good they felt they were at reading, writing and working with numbers.15 These questions were asked before the respondents attempted the literacy and numeracy tests so the test experience did not affect their response.
15
The question for reading: “How good are you at reading English when you need to in daily life? For example: reading newspapers and magazines or instructions for medicine or recipes? The question for writing: “How good are you at writing in English when you need to in daily life? For example: writing letters or notes or filling in official forms?” The question for numeracy: “And how good are you at working with numbers when you need to in everyday life? For example working out your wages or benefits, or checking bills and statements?”
41
UK Skills for Life Survey The tests of the UK Skills for Life Survey were developed so as to directly reflect the National Qualifications Framework, which defines a number of entry levels and levels. Some test items were taken over from existing tests aimed at Level 1 and Level 2. For Entry Level 2 and below, new test items were developed. Table 2: Reading Self-Assessment by Literacy, UK Skills for Life Survey 2003 Entry level 1 or below (n=266)
Entry level 2 (n=176)
Entry level 3 (n=880)
Very good
20
26
50
70
83
71
Fairly good
34
55
42
28
16
25
Below average
16
12
6
2
1
3
Poor
14
6
1
*
*
1
100
100
101
Cannot Read English Total (%)
16
0
0
100
99
99
Level 1 (n=3138)
Level 2 or above (n=3413)
Total (n=7874)
1
Table 3: Writing Self-Assessment by Literacy, UK Skills for Life Survey 2003
Very good
Entry level 1 or below (n=266) 16
Entry level 2 (n=176)
Entry level 3 (n=880)
Level 1 (n=3138)
Level 2 or above (n=3413)
Total (n=7874)
18
37
55
73
59
Fairly good
30
50
46
40
25
34
Below average
17
20
12
4
2
5
Poor
20
11
5
1
1
2
Cannot Write
17
0
0
*
*
1
100
99
100
100
101
101
Total (%)
Table 4: Numeric Self-Assessment by Numeracy Level, UK Skills for Life Survey 2003
Very good
Entry level 1 or below (n=457) 15
Entry level 2 (n=1370)
Entry level 3 (n=2071)
Level 1 (n=2209)
Level 2 or above (n=1934)
Total (n=8040)
28
40
56
73
49
Fairly good
52
54
51
41
26
43
Below average
19
13
6
2
1
5
Poor
13
5
2
1
*
2
Total (%)
99
100
99
100
100
99
As it appears from the tables, the test results and the self-assessment results are correlated. The share of respondents assessing themselves to be very good at reading, writing or numeric skills rises from the lowest level (Entry Level 1 or below) to the highest level (Level 2 or above). Correspondingly, the share of respondents declaring to have below average or poor skills declines when moving from respondents at Entry Level 1 or below to Level to or above. However, the correlation is far from perfect. In particular, there are large discrepancies between self-assessed competencies and test results at the lower end of the test scale. 54
42
% of the respondents placed at the lowest level by the test results themselves assess that they are very good or fairly good at reading. For writing and numeracy skills, the corresponding figures are 46% and 67%. The discrepancy may in part result from the fact that the self-assessment questions do not seek to measure skills levels against an external yardstick but against a subjective yardstick of “the needs in daily life”. So in effect, the test and the self-assessment items do not measure the same construct. The discrepancy may also in part be due to measurement errors mentioned in section 3.2.2 above. The social desirability of certain responses as well as self-esteem factors are likely to have been important, and can explain the fact that the discrepancy is much larger at the lower ends of the skills levels than at the higher levels. The questions were not given a form so as to minimise demand effects. In addition, the anchors in the selfassessment items are inadequate, which most likely has contributed to measurement errors. Finally, a part of the discrepancy may result from validity problems in the test items that were used in connection with the survey.16 Since the discrepancy between test results and self-assessments is much larger at the lowest proficiency levels, this argument presupposes that there were particular validity problems with test items for the lower levels. Data from PISA PISA 2000 and 2003 also contained a number of self-assessment items. Thus, in PISA 2003, students’ “self-concept” was an index variable constructed out of 5 different selfassessment questions concerning students’ relationship with mathematics. The relation between the value of the index variable, which can range from -1 to +1 and mathematics performance scores is displayed in Table 6. As can be seen, there is a positive relation between self-concept and performance: The stronger the self-confidence of the student, the higher the performance score. However, the index variable still explains only a little more than 10 per cent of overall variance in student performance scores on average. Moreover, both the average student score on the self-concept variable and the explanatory power of the variable varies considerably between countries. The self-assessment scores of Japanese and Korean students are, for instance, considerably lower than for other countries. Yet the students from these two countries were among the best test performers. One possible methodological explanation is anchoring problems, as students from East Asia may adopt a different scale than students from other countries and cultures, cf. King et al. (2004).17 16
However, the predictive validity of test data from the 1996 NCDS survey, which is to a large extent based on the same test methodology as the BCS70 2004 survey, is good: Low test scores in literacy and numeracy is correlated with formal qualifications, employment status and several other relevant outcome variables (Bynner and Parsons 1997). 17 Differences in student self-assessments may also reflect real differences in educational systems: The Japanese and Korean school systems are extremely competitive and test-based compared to Western systems (DTI 2005: 133-134;
43
Table 6: Self-concept and Performance in Mathematics, PISA 2003 Mean score
Change in the score per unit of the index
Explained variance
MATHEMATICS
United States Denmark Germany Sweden Greece Austria Iceland Poland Turkey Finland Italy Netherlands OECD average EU average Belgium Ireland Slovakia Czech Republic Latvia Hungary France Norway Portugal Spain HK China Korea Japan
0,25 0,24 0,15 0,13 0,11 0,07 0,03 0,03 0,02 0,01 0,00 0,00 0,00 -0,01 -0,03 -0,03 -0,05 -0,09 -0,11 -0,15 -0,17 -0,18 -0,18 -0,19 -0,26 -0,35 -0,53
35,1 46,5 22,7 47,0 42,6 25,7 39,7 46,0 34,8 45,5 25,3 22,2 32,4 33,8 23,3 34,4 44,5 39,8 44,6 28,4 28,3 46,6 36,8 31,9 38,4 47,3 21,2
14,6 27,6 7,1 24,4 16,6 8,9 26,4 21,6 11,0 33,0 7,1 6,1 10,8 12,2 4,8 14,1 16,1 15,8 16,7 6,6 10,3 31,6 15,4 13,2 12,1 21,4 4,1
Source: OECD Pisa 2003 data set.
1970 British Cohort Study The combination of tests and self-assessments have also been put to use in the British Cohort Study (BCS70) which follows all people born in a particular week in 1970 in a cohort study. The 2004 survey included tests of literacy and numeracy as well as selfassessments of these domains (Parsons and Bynner 2005). Test items were developed so as to reflect the different levels of the National Qualifications Framework. The results are illustrated in Figures 3 and 4, where the share of respondents reporting difficulties is the share of respondents answering “yes” to at least one of the questions concerning reading, writing and numeracy, respectively.18 Again, there is a clear correKim et al. 2004). “High pressure” school systems may have detrimental effects on self-confidence, even if learning results, as measured by tests, are good. 18 The reading self-assessment questions in the 2004 survey were: 1) Can you read and understand what is written in a magazine or newspaper? [IF YES] Can you read this easily, or is it with difficulty? 2) If you have to, can you usually read and understand any paperwork or forms you would have to deal with in a job? [IF YES] Can you read this easily, or is it with difficulty? 3) If you have to, can you read aloud to a child from a children’s storybook? [IF YES] Can you read this easily, or is it with difficulty? The writing self-assessment questions were: 1) If you need to, can you write a letter to a friend to thank them for a gift or invite them to visit? [IF YES] Can you read this easily, or is it with difficulty? 2) When you try to write something, what is it you find difficult? Do you find it difficult to … a)
44
lation between test results, according to which respondents have been placed on different levels, and self-assessments: the share of respondents who report difficulties in reading and writing decreases as the test results increase. Figure 3. Share of Respondents Reporting Difficulties and Literacy Levels 60
percentage
50 40 30 20 10 0 Reading Entry level 2
Writing Entry level 3
Level 1
Level 2
Total
Source: UK 1970 Cohort Study, 2004 Survey, reported in Bynner and Parsons (2005: 56).
However, although the two are correlated, large proportions whose test performance is very poor do not acknowledge any difficulty. In reading, only 26 per cent of respondents at the lowest test level (Entry Level 2) acknowledge any difficulties, in writing the same figure was 49 per cent. As for numeracy skills, only 29 per cent of men who had performed at the test level Entry Level 2 acknowledged any difficulties. For women, the corresponding figure was 52 per cent. Similarly, though to a lesser extent, some of those who acknowledge a difficulty have average or better scores in the tests. While it cannot entirely be ruled out that validity problems in connection with test items and the definitions of requirements for placing respondents at the different levels play a role for these results, a more likely explanation is measurement errors following from demand effects in relation to the self-assessment questions. This would account for the fact that discrepancies between test scores and self-assessments are much higher at the lower end of the test scale than at the higher end of the scale.
spell words correctly… b) make your handwriting easy to read [yes / no] … c) put down in words what you want to say (never try to write anything) [yes / no]. The self-assessments for maths, numbers, and arithmetic questions were: 1) When you buy things in shops with a five or ten pound note, can you usually tell if you are given the right change? [IF YES] Can you read this easily, or is it with difficulty? 2) What is it you find difficult with numbers and simple arithmetic? Do you find it difficult to…a) recognise and understand numbers when you see them, b) add up c) take away d) multiply e) divide. The reporting scale was yes/no in all cases. All questions were of a simple ‘yes’ or ‘no’ format (Parsons and Bynner 2005: 22-25).
45
Figure 4. Share of Respondents Reporting Difficulties and Numeracy Levels 60
percentage
50 40 30 20 10 0 Men Entry level 2
Women Entry level 3
Level 1
Level 2
Source: UK 1970 Cohort Study, 2004 Survey, reported in Bynner and Parsons (2005: 56).
Based on these and a number of earlier results from the 1970 British Cohort Study and the National Child Development Study (NCDS), which follows a cohort of people born in 1957, Bynner and Parsons (2005) conclude that seemingly, self appraisal is not necessarily grounded in objective evidence of performance but has more to with self understanding and identity. 3.3.2. Language Skills: Experiences with Testing and Self-Reports Data from the DIALANG project sheds light on the relation between test results and self-assessments in the domain of language skills. DIALANG is a web-based diagnostic assessment system intended for language learners who want to obtain diagnostic information about their language proficiency. Foreign language diagnosis can be carried out in 14 different European languages. The system is aimed at adults who want to know their level of language proficiency and who want to get feedback on the strengths and weaknesses of their proficiency. It is not, therefore, developed for the purpose of carrying out large-scale skills assessments of entire populations for policy development purposes. Yet, since the system contains both a self-assessment module and a number of test modules, statistics from its database are highly relevant. In DIALANG the selfassessment module is used – in connection with a placement test module – to assign users to tests at one of three levels of difficulty. The users’ self-assessment results are also compared with their eventual test scores, in order to give users feedback on the correspondence between the two (Alderson 2005: 97-98).
46
Tables 7, 8 and 9 displays the correlations between the overall self-assessed foreign language ability, based on self assessment statements and test scores. The data is from the pilot study of the system, and concerns the self-assessed proficiency and test scores in English as a foreign language. Overall self-assessments were based on 6 overall self assessments for 5 different language skills. Detailed self-assessments were based on 35 statements for Reading, 33 statements for Writing and 43 statements for Listening (Anderson 2005: 98-99).19 Table 7. Correlation between Overall Self-Assessment and Calibrated Ability on Skill Tests Reading Writing Listening Grammar Vocabulary Overall SA 0,544 0,580 0,474 0,522 0,554 N 718 735 606 1084 975 Table 8. Correlation between Detailed Self-Ass. IRT ‘scores’ and Calibrated Skills Test Scores Reading Writing Listening Grammar Vocabulary Detailed SA 0,487 0,550 0,495 Reading SA Listening SA Parameter by 0,504 0,548 Skill Writing SA Writing SA 0,439 0,595 Table 9. Correlations between Self-Assessment and Calibrated Ability on Skill Tests Mother tongue
Reading
Writing
Listening
0,605
0,565
0,461
Danish N Dutch N Finnish N French N German N Icelandic
37
42
36
0,375
0,338
0,29
83
104
59
0,652
0,635
0,434
128
121
98
0,606
0,681
0,646
56
60
44
0,432
0,532
0,305
149
151
148
0,568
0,505
NS
N
17
33
21
Italian
NS
0,836
0,822
8
10
9
0,497
0,58
NS
N Norwegian N Portuguese N Spanish N
34
36
31
0,801
NS
0,776
15
11
14
0,468
0,617
0,622
61
41
34
Source: Anderson (2005). The correlations are between self-assessment scores that are recoded to Common European Framework of Reference for Languages (CEFR) levels and calibrated ability on skill test.
19
For details, see Anderson (2005: 106-109, 111-113).
47
Correlations between overall self-assessment and test scores are generally close to 0,5, irrespective of specific language skills, and irrespective of whether the overall selfassessment statements or the detailed statements are utilised. Table 9 reports the results broken down on the mother tongue of the user. There are significant differences in correlations, with the correlation score between self-assessment results and test scores being highest for Portuguese users (correlation coefficients around 0,8) and lowest for Dutch users (correlations around 0,35). This evidence illustrates that test scores and self-assessment scores are positively correlated, also in the domain of foreign language skills. However, the data also highlights that this correlation is far from perfect. The evidence is compatible with a thesis that respondents tend to overestimate their own skills in self-assessments. 3.3.3. ICT-Skills: Experiences with Testing and Self-Reports The Status of Test Development in Relation to ICT Skills
The testing of ICT skills has been high on the agenda during the past years, both in relation to student assessments, such as PISA, and in relation to adult skills assessment. Since 2003, development work has been carried out to construct a testing framework, and pilot studies have been implemented to assess the feasibility of testing ICT skills in an international context (Lennon et al. 2003). The US Educational Testing Service (ETS) has been centrally placed in these connections. The feasibility study for the PISA ICT literacy assessment included the development of an overall definition of ICT literacy20 as well as the identification of the context areas and processes that are seen as critical components of ICT literacy. The panel behind the study recommended the use of simulated applications and environments rather than using existing software. This decision was driven by a desire to ensure that all students had a consistent set of tools and that no one would be disadvantaged by a lack of prior knowledge about how a particular application functions. Subsequently, a number of test items were developed and tested in three different national settings (the United States, Australia and Japan). The study concluded that while many questions remain to be studied, an ICT literacy test is both feasible and worth doing. In January 2005, ETS launched the first version of its ICT literacy test. A revised version was released in early 2006. The tests are intended for educational purposes, in line with other types of educational testing and assessments.21 In addition, the ETS ICT literacy test is currently foreseen to become a component in the OECD adult skills assessment initiative PIAAC.
20
“ICT literacy is the interest, attitude, and ability of individuals to appropriately use digital technology and communication tools to access, manage, integrate, and evaluate information, construct new knowledge, and communicate with others in order to participate effectively in society� (Lennon et al. 2003: 8). 21 For details, see http://www.ets.org where an online demo of the ICT literacy test is available.
48
Testing ICT Skills in the UK Skills for Life Survey
Since the development of ICT skills tests for large-scale survey purposes is work in progress, limited data is available which can shed light on the statistical relations between self-assessments or self-reports of ICT skills and direct tests of ICT Skills. However, in addition to tests for literacy and numeracy, the UK Skills for Life Survey included an ICT test module. The ICT skills interview comprised two separate assessments. The first test had a similar format to the literacy and numeracy tests in the survey: respondents read questions from the screen and were given a choice of four answers. This test – termed the ‘Awareness Assessment’ - assessed general awareness of information and communications technology and its associated terminology. The second test was very different. The interviewer handed the laptop computer to the respondent who then attempted up to 22 practical Windows-based tasks. As mentioned previously, the UK Skills for Life Survey was developed to directly reflect the Levels in the UK National Qualifications Framework. All the 22 Windows-tasks were set at Level 1, with the assumption that respondents who carried out 11 or more tasks correctly would be classified at Level 1, and anybody completing fewer tasks was be classified at ‘Entry level or below’. This test was referred to as the ‘Practical Assessment’. Any respondent who claimed to have never used a computer before (15 per cent of the sample) was excused this test (DfES 2003: 145-146). Item Response Theory was not employed in determining the difficulty of test items. Statistical Relations between Test Scores and Self-Assessment Scores
The UK Skills for Life Survey also contained a self-assessment module for ICT-skills.22 Table 10 illustrates the statistical relation between responses to the self-assessment question and the results of the ICT-test. As it appears, there is a rather strong statistical relation between test results and selfassessment of ICT skills. Among those who assess their own ICT skills to be very good, 89 per cent are placed at Level 2 or above in the Awareness test, whereas 91 per cent are placed at the Level 1 or above in the Practical test. Among those who themselves assess their ICT skills to be poor, 15 per cent are placed at Level 2 or above in the Awareness test and just 7 percent at Level 1 or above in the practical test. Still, the correlation is not perfect. 11 per cent of respondents, who themselves think their ICT skills are very good, were placed at level 1 or below in the Awareness test, and 54 per cent of the respondents who thought they had poor ICT skills actually test at Level or above in the Awareness test. 9 per cent of the respondents who declared themselves to have very good ICT skills tested at Entry level or below in the practical test. The results of the ICT skills tests must be interpreted cautiously, however. Testing methodology involves multiple choice questions, which has certain methodological drawbacks. Importantly, response rates in the ICT test module were very low, at only 31 per cent of eligible respondents. 22
The question concerning ICT-skills: “How good are you at using computers? For example: word processing, using the internet and sending emails?” The response categories were very good, fairly good, below average and poor (DfES 2003: 263).
49
Table 10. ICT Skills – Tests vs. Self-Assessed Results. UK Skills for Life Survey Very good (n=629) Awareness test
%
Fairly good (n=1946) %
Below average (n=764) %
Poor (n=379)
Total (n=4464) %
%
Entry level or below
3
8
29
46
25
Level 1
8
26
38
39
25
Level 2 or above
89
66
33
15
50
Practical test
%
%
%
%
%
9
36
76
93
53
Level 1 or above
91
64
24
7
47
Level 2 or above Awareness & Level 1 or above Practical skills
85
53
15
5
39
Entry level or below
Source: DfES (2003: 153)
3.3.4. Advantages and Disadvantages of Self-Reporting and Testing There are advantages and disadvantages of testing and self-reports in adult skills assessment as it appears from the results in the above sections. Table 11 contains an overview interpretation of the results. Table 11. An Overview Assessment of Relative Strengths and Weaknesses of Tests and Self-Reports in Adult Skills Assessments Tests Self-Reports Validity + Reliability + 0 Skills range ++ Development costs + Implementation costs 0 Implementation requirements + Political and ethical risks 0 ++ Very strong + Strong 0 Neither strong nor weak - Weak -- Very weak
The table should be seen as an heuristic device only, as it is not possible, based on existing information, to present advantages and disadvantages in precise quantitative terms. Moreover, there are evidently both good and bad tests as well as good and bad practices of assessing adult skills using self-reports. The assessments in the table are based on the practices that qualify as “good practice� in connection with testing and self-reports, respectively. A number of the issues underlying the judgements in Table 11 concerns development and implementation (development costs, implementation costs, implementation requirements and political and ethical risks). Other issues specifically concern validity
50
and reliability and the range of skills than can (potentially) be assessed using one or the other approach. Validity and Reliability
The question of validity and reliability is crucial. If measurement does not measure what it is intended to measure, then all other issues are basically irrelevant. It may be very easy to administer a questionnaire based self-assessment of a particular skill domain, but if the resulting data is highly invalid, these advantages become irrelevant. The validity of both tests and self-reports can be questioned, and has indeed been so from various perspectives. Tests are not “objective” if by this concept it is meant that their results reflect the “real” skills of test-takers. The process of moving from theoretical definitions and concepts to operational test items necessarily involves an element of judgement and assessment, for which reason the construct validity of tests can always be contested. •
•
•
•
However, the validity of skills tests is potentially higher than the validity of assessments using self-reports. Self-reports can be refined by using state-of-the art anchoring methods and by making full use of research on survey methodology, for instance as regards the precision and wording of questions, but self-reports can never escape the fundamental fact that only the respondent knows whether the reported response is correct or not, and it does not seem possible to eliminate entirely social desirability effects in self-report items. Moreover, on the basis of the tests and self-reports /self-assessments that we have reviewed in connection with the present study, the actual validity of the best tests – such as those developed in connection with IALS, ALLS and PISA – seem to yield results that are more valid and reliable than the reviewed self-reports / selfassessments. This finding is, however, affected by the fact that the self-reporting items that we have reviewed do not live up to methodological best practice in the field, just as some self-assessment items measure self-assessed skills relative to requirements rather in relation to an external anchor or yardstick. Third, the statistical relation between self-reported / self-assessed skills is clearly there, but it is far from perfect, and there is a clear tendency for respondents at the lower end of the skills ranges to rate their own skill levels higher than what is suggested by test results. Self-assessments therefore do not seem very well-suited to generate valid information on skills among segments of the population with low skills levels. Self-reports using anchors and items that minimise social desirability effects, such as the job-requirements approach may be better suited to this end, but can measure only skills among the employed part of the population, and are subject to other validity problems, cf. below. Fourth, it is likely that the question of intercultural validity is more serious for selfreports or self-assessments than for well-developed and well-administered tests. Anchoring by vignette is, however, a possible, even if partial, solution to intercultural validity problems, but will add significantly to development costs as the method needs to be tested in connection with skills assessments.
51
Skills Range
As regards the range of skills that can potentially be assessed using self-reports, this is considerably broader than what is presently possible using direct tests. Applying the job requirements approach, the British Skills Surveys as mentioned collected data on a range of generic workplace skills. Moreover, self-reports using the job requirements approach is relevant for shedding light on the distribution of skills, and there may be a policy interest in knowledge about the distribution of skills across the population, in certain segments of the labour market, etc., in addition to knowledge about the absolute level of skills. The job requirement approach has other disadvantages, however, as skills that are required in the job of the respondent do not necessarily reflect the real skill level of the individual, and as only the skills of the working population can be measured. Development and Implementation Costs
Testing is more costly than self-assessments and self-reports. The development of new test items with a high degree of validity requires considerable development work. New test items must be pilot tested and calibrated, and the implementation costs of administering large-scale skills assessment tests are also higher than what the case is for selfreports. Adult skills assessments using test must necessarily be administered as household surveys, with personal visits by interviewers, and more time is required for interview and testing than is the case with self-reports. The costs of developing and implementing large-scale tests multiply if tests must translated and validated in several languages. Self-assessments and self-reports are not so expensive compared to this. Self-reports are, however, very sensitive to correct translation of concepts and anchors, which will add to development and implementation costs. Implementation Requirements
It is a key experience from previous large-scale international tests that clear and concise implementation procedures must be in place, that the participating states must accept and adhere to them, and that adherence must, as far as possible, be checked and validated. These requirements are not found to be quite so demanding for self-reports or self-assessments, first of all because the data collection set-up is less complex for questionnaire based surveys than for direct tests. The Final Choice: Depends on Needs and Purposes
The final decision as regards the choice of testing methodology hinges on the objectives and priorities of adult skills assessments. Testing is the Superior Methodology If the highest possible standards of validity and reliability of data is to be achieved, direct testing is undoubtedly the superior methodology – provided that sufficient resources are devoted to development, tests and implementation of state-of-the-art tests, or alternatively that well-tried testing methods and items are utilised.
52
However, at present, testing can only be applied to a rather narrow range of skills (literacy, numeracy, ICT skills and some foreign language skills). In other domains, considerable development work will be required, and a several domains, it will not be possible to develop valid and reliable tests at all. Moreover, there are some rather formidable practical obstacles that must be addressed if direct testing is to be administered in as many as 4 broad skills domains simultaneously (cf. DTI et al. 2004). The Potential Use of Self-Reports and Self-Assessments For which purposes could it be feasible to make use of self-reports or self-assessments in adult skills assessment? There are several possibilities. •
Self-reports using the job-requirements approach and external anchoring methods. Methodologically, self-reports using the job-requirements approach and ex-ante expert anchoring or inter-subject anchoring of scales are superior to selfassessments, where a respondent is asked to assess him or herself. It is feasible to develop and implement this approach so that it covers a number of different skills domains, and covers all the EU member states. Development and implementation costs will be considerable, however, and the results must be presented and interpreted cautiously. The approach will yield information on the types of skills required in different types of jobs across the EU, just as information can be provided concerning the types and levels of skills on the one hand and other variables such as wages, education, etc. However, it will be problematic to infer from skill requirements in jobs to the skill levels of respondents, as there may be over- or underutilization of skills in the labour market. Over- or underutilization of skills will be more important for some skills than others, and it would be possible and useful to develop indicators of skills mismatch in each domain in order to facilitate drawing inferences about personal competences from the job requirements data. Another drawback is that the job requirements approach does not yield information on groups that have been outside the labour market for long periods. For the section of the population in work, it would be possible to combine direct testing with elements of the job requirements approach, and thereby broaden the range of skills assessed. To do that, a considerably reduced questionnaire would have to be developed and tested, drawing on ONET and on the already shorter British skills surveys, with due regard to the issue of scale anchoring and the need to obtain international comparability. The OECD’s PIAAC initiative includes plans for early development of an instrument to be used in conjunction with international skills tests.
•
Self-assessments. Self-assessments, where respondents are asked directly to assess their own level of skills, are methodologically sounder if response scales make use of ex-ante expert anchoring or alternatively inter-subject anchoring. However, the social desirability effects of self-assessment items will remain considerable, and it cannot be recommended to make use of this method as a stand-alone approach.
53
However, self-assessments can be useful in connection with direct tests, as they can provide information on respondents’ self-perception. The latter has been found to be related to the motivation of respondents for engaging in learning activities, which is also useful information to gather for policy makers. However, if that becomes the objective it would be better to collect information about motivation directly, with specially designed questions on these subjects.
54
4. Computer Based Assessment The theme in this section is benefits and disadvantages of computer based assessment of skills and the feasibility of testing the whole scope of skills recognised at the EU level as politically relevant using computer based assessment. 4.1. Definitions and Developments Various terms are used to describe the use of a computer for assessment purposes. These include: 1) Computer-Assisted Assessment or Computer-Aided Assessment (CAA), 2) e-assessment, 3) Computer-Mediated Assessment (CMA), 3) ComputerBased Assessment (CBA), 4) Computer Adaptive testing (CAT) and 5) online assessment. What is Computer Based Assessment?
Although these terms are commonly used interchangeably, they have distinct meanings. Computer assisted/Mediated Assessment refers to any application of computers within the assessment process; the role of the computer may be extrinsic or intrinsic. It is, therefore, a synonym for e-assessment which also describes a wide range of computerrelated activities. Within this definition the computer often plays no part in the actual assessment of responses but merely facilitates the capture and transfer of responses between candidate and human assessor. Computer-Based Assessment refers to assessment which is built around the use of a computer; the use of a computer is always intrinsic to this type of assessment. This can relate to assessment of IT practical skills or more commonly the on screen presentation of knowledge tests. The defining factor is that the computer is marking or assessing the responses provided from candidates. Computer-based assessment may or may not be adaptive. In adaptive assessments, the task presented to the respondent is a function of previous responses. In this case, the most precise term is Computer Adaptive Testing (CAT). In order to reap the full benefits of the approach, CAT is most frequently developed on the basis of item response theory (IRT), but IRT is not a necessary precondition for adaptive tests Online assessment refers to assessment activity which requires the use of the internet. In reality few high stakes assessment sessions are actually conducted online in real time but the transfer of data prior to and after the assessment session is conducted via the internet. There are many examples of practice and diagnostic tests being run real time over the internet. Figure 4.1 illustrates the relations between the different concepts and approaches. As the figure illustrates, Computer Adaptive Testing using Item Response Theory is a subset of Computer Adaptive Testing. Some, but not all, Computer Based Assessments are adaptive. Computer Based Assessment, in turn, constitutes a subset of CAA/CMA or eassessments. Finally, all of these approaches may be administered online – making use of the internet – or in stand-alone environments.
55
In addition, Computer Assisted Telephone Interviewing (CATI) and Computer Assisted Personal Interviewing (CAPI) should be mentioned. CATI is a telephone surveying technique in which the interviewer follows a script provided by a software application. The software is able to customise the flow of the questionnaire based on the answers provided, as well as information already known about the participant. CAPI is similar to Computer Assisted Telephone Interviewing, except that the interview takes place in person instead of over the telephone. Neither CATI nor CAPI should be considered computer based assessment, however. Thus, the following sections concentrate on computer based assessment, as defined above. The focus is on the benefits and disadvantages of skills assessment in which the use of computers is intrinsic to the assessment process. Moreover, particular attention is paid to adaptive computer based assessment, as the potential benefits of computer adaptive testing are significantly larger than is the case for computer based assessments using linear tests. Figure 4.1. Types of Assessment Making Use of Computers Computer Assisted Assessment / Computer Mediated Assessment / e-assessment Computer Based Assessment (CBA) Computer Adaptive Testing (CAT) CAT using IRT
The Evolution of Computer Based Assessment
Computer based testing was initially developed in the education sector, with a view to realising the potentials of new technology for testing efficiency and objectivity. The need for more objectivity in language testing, for instance, gradually led pedagogues to the use of computers as precise measurement tools. The first language tests using computers were called computer assisted testing. Computers were used as word processors equipped with a dictionary and/or a thesaurus, and students were able to use sources of references via computer during their writing test. Computers were also used for fast computation of grades, assisting the testers in their calculation (Meunier 2002: 24; Larson and Madsen 1985).
56
Then, paper-and-pencil tests were digitalized and turned into electronic tests. Such tests, however, showed few real differences from conventional tests, only that they were administered through a non-standard medium. Yet, computer-based delivery of tests has several advantages. For example, it allows for testing on demand, i.e. whenever and wherever an examinee is ready to the test. Also, the power of modern PCs as well as their ability to control multiple media can be used to create innovative item formats and more realistic testing environments (Clausen-May 2005). The Evolution of Computer Adaptive Testing
The idea of adapting the selection of the items to the examinee is certainly not new. The idea of adaptive testing is as old as the practice of oral examinations. Good oral examiners have always known to tailor their questions to their impression of the examinees’ knowledge level. In the Binnet-Simon (1905) intelligence test, the items were classified according to mental age, and the examiner was instructed to infer the mental age of the examinee from the earlier responses to the items and to adapt the selection of the subsequent items to his or her estimate until the correct age could be identified with sufficient certainty. The more recent development of adaptive testing in connection with the use of computers – starting in the mid-1980s – takes the use of computers in testing a big step further than as a simple tool for delivering traditional paper-and-pencil tests and computing scores: The function of an adaptive test is to present test items to an examinee according to the correctness of his or her previous responses. If a student answers an item correctly, a more difficult item is presented; and conversely, if an item is answered incorrectly, an easier item is given. In short, the test “adapts" to the examinee's level of ability. The computer's role is to evaluate the student's response, select an appropriate succeeding item and display it on the screen. The computer also notifies the examinee of the end of the test and of his or her level of performance (Larson 1989: 278). Computer Adaptive Testing Using Items Response Theory
The development of item response theory (IRT) in the 1960s has provided a sound psychometric footing for CAT. The key characteristics of Item Response Theory are best understood by comparing IRT with classical test theory (CTT). The ultimate aim of both classical test theory and item response theory (IRT) is to test individuals. Hence, their primary interest is focused on establishing the position of the individual along some latent dimension, often called ability. Classical Test Theory The usual approach taken to measure ability is to develop a test consisting of a number of test items. Each of these items measures some facet of the particular ability of interest. When the item response is determined to be correct, the examinee receives a score of one; an incorrect answer receives a score of zero. The examinee’s raw test score would be the sum of the scores received on the items in the test. Classical Test Theory utilises two main statistics - Facility and Discrimination. Facility is essentially a measure of the difficulty of an item, arrived at by dividing the mean mark obtained by a sample of candidates and the maximum mark available. As a whole, a test should aim to have an overall facility of around 0.5, however it is acceptable for
57
individual items to have higher or lower facility (ranging from 0.2 to 0.8). Discrimination measures how performance on one item correlates to performance in the test as a whole. There should always be some correlation between item and test performance, however it is expected that discrimination will fall in a range between 0.2 and 1.0. The main problem with Classical Test Theory is that conclusions drawn depend very much on the sample used to collect information. There is an inter-dependence of item and candidate. Thus, the overall facility of a test depends on the performance of the sample, who has undertaken the test, and whether a test item is categorized as difficult or not quite so difficult similarly depends on the performance of the sample of testtakers in relation to this particular test item. Item Response Theory As opposed, the most distinct feature of IRT is that it adopts explicit models for the probability of each possible response to a test - so its alternative name, probabilistic test theory, may be the more precise one. IRT derives the probability of each response as a function of the latent trait (the “abilityâ€?) and some item parameters. The same model is then used to obtain the likelihood of ability as a function of the actually observed responses and, again, the item parameters. The ability value that has the highest likelihood becomes the ability estimate. For all this to work, the IRT model has to be (more or less) true, which among other things means that the latent trait or ability that is being measured must have only one dimension (the assumption of unidimensionality), and the item parameters known. Any attempt at testing is therefore preceded by a calibration study: the items are given to a sufficient number of test persons whose responses are used to estimate the item parameters. When the model is appropriate and the estimates of the item parameters are reasonably accurate, IRT promises that the testing will have certain attractive properties. •
Most importantly, we can ask different examinees different items, and yet obtain comparable estimates of ability. As a result, tests can be tailored to the needs of the individual while still providing objective measurement (Partchev 2004; Baker 2001).
The Item Response Function IRT can be used to create a unique plot for each test item (the Item Response Function). The function is a plot of Probability that the Item will be answered correctly against Ability. In the figure below, the x-axis represents student ability and the y-axis represents the probability of a correct response to one test item. The s-shaped curve, then, shows the probabilities of a correct response for students with different ability (theta) levels.
58
Figure 4.2. The Item Response Function in Item Response Theory
The shape of the function reflects the influence of the three factors: • • •
Increasing the difficulty of an item causes the curve to shift right - as candidates need to be more able to have the same chance of passing. Increasing the discrimination of an item causes the gradient of the curve to increase. Candidates below a given ability are less likely to answer correctly, whilst candidates above a given ability are more likely to answer correctly. Increasing the chance raises the baseline of the curve.23
Creating and Expanding Item Banks with IRT Using IRT models thus allows items to be characterised and ranked by their difficulty and this can be exploited when generating Item Banks of equivalent questions. These linked or coordinated questions, which all serve to define a given variable, e.g. reading comprehension, provide a pool of items from which alternate tests forms can be generated without compromising accuracy. Using an anchoring procedure, computer software makes it possible to calibrate a new set of items to the difficulty level of those in the bank. This is done by including in the new test a small number of selected items (about 10 to 20) from the initial, calibrated test administered to a different group of examinees. While a few of the new items might not fit and have to be discarded, the potential for expanding the item bank is virtually unlimited. It may even become feasible to share items among institutions. Principles of Adaptivity at the Item Level
The item sequencing of computer adaptive tests is based on a continuum of difficulty, and items are calibrated according to their difficulty index. When the difficulty continuum has been established, the item bank is set, and the test is administered according to pre-established indexes of increment and decrement (Meunier 2002: 26). Operationally, this means that the following principles generally apply to adaptive testing:24 23
There are three different probabalistic models that can be applied to test item banks (cf. Meunier 2002: 25): The One Parameter Model (Rasch 1960) is based exclusively on a difficulty continuum, i.e. on the ranking of test items according to their difficulty. The Two Parameter Model considers not only difficulty but discrimination. The three parameter model takes into account difficulty, discrimination and sources of measurement errors, whether standard or random, such as guessing. Two and three parameter models need to rely on large population samples (2,000 or more) for an accurate estimation of the parameters, whereas the one parameter model needs only 100 to 200 subjects. 24 For a simulation of a Computer Adaptive Test that can be tried online, see http://edres.org/scripts/cat/startcat.htm.
59
1) Each time a correct answer is obtained, the theta estimate (estimate of ability) is increased. Similarly, an incorrect answer leads to a decrease in estimated theta. 2) The differences between successive theta estimates decreases as the test proceeds, indicating that the test is converging on the examinee’s theta level. 3) The standard error around the estimate of theta (ability) tends to decrease as the test proceeds, since additional item responses generally improve the estimation of theta. 4) As the test progresses, the examinee tends to alternate between correct and incorrect answers. This is the result of the convergence process that underlies CAT. The result, typically, is that each examinee will answer a set of questions on which he/she obtains 50% correct, even though each examinee will likely receive a set of questions of differing difficulty. In a sense, this characteristic of a CAT tends to equalize the “psychological environment” of the test across examinees of different trait levels. By contrast, in a conventional test the examinee who is high on the trait will answer most items correctly and the low trait examinee will answer most of the items incorrectly. 4.1.1. The Growth of Computer Adaptive Testing Whereas the key elements in Items Response Theory were developed during the 1960s (Birnbaum 1968), the further development and fine tuning of the psychometric techniques needed to implement computer adaptive testing took several decades. The first computers were slow and did not allow for ability estimation in real time. For this reason early research was directed at finding approximations or alternative formats that could be implemented in a traditional paper-and-pencil environment. Examples include the “two-stage testing format” (Cronbach and Gleser 1965), Bayesian item selection with an approximation to the posterior distribution of the ability parameter (Owen 1969), the “up-and-down method of item selection” (Lord 1970), the “RobbinsMonro algorithm” (Lord 1971a), the “flexilevel test” (Lord 1971b), the “stradaptive test” (Weiss 1973), and “pyramidal adaptive testing” (Larkin and Weiss, 1975). Computer Adaptive Testing in the United States
With the advent of more powerful computers, application of CAT using Item Response Theory in large-scale high-stakes testing programs became feasible (Van der Linden and Wim 2000). The US Department of Defense with its Armed Services Vocational Aptitude Battery (ASVAB) was a pioneer in this field. After a developmental phase, which began in 1979, the first CAT version of the ASVAB became operational in the mid 1980s.25 The ASVAB contains nine sections: General Science; Arithmetic Reasoning; Word Knowledge; Paragraph Comprehension; Mathematics Knowledge; Electronics Information; Auto & Shop; Mechanical Comprehension; and Assembling Objects.
25
An account of the development of the CAT-ASVAB is given in Sands, Waters, and McBride (1997).
60
However, the migration from paper-and-pencil testing to CAT truly began when the US National Council of State Boards of Nursing launched a CAT version of its licensing exam (NCLEX/ CAT) and was followed with a CAT version of the Graduate Record Examination (GRE), the test which is used as an entry exam for graduate (postbachelor) university studies (cf. Mills and Steffen 2000). The GRE General Test measures critical thinking, analytical writing, verbal reasoning, and quantitative reasoning skills that have been acquired over a long period of time and that are not related to any specific field of study. The GRE Subject Tests gauge undergraduate achievement in eight specific fields of study: Biochemistry, Cell and Molecular Biology; Biology; Chemistry; Computer Science; Literature in English; Mathematics; Physics; and Psychology. Ever since, many other large-scale testing programs have followed, for instance: •
•
•
Achievement tests such as a) Measures of Academic Progress (MAP) tests, a range of computer adaptive tests developed by the Northwestern Evaluation Association for the test of school children’s’ reading, mathematics, and language usage,26 and b) MAPP (Measure of Academic Proficiency and Progress), developed by the Educational Testing Service (ETS).27 MAPP, which replaces the earlier Academic Profile test, is a measure of college-level reading, mathematics, writing, and critical thinking in the context of the humanities, social sciences, and natural sciences. There are a number of other examples. Placement and admission tests, such as the Test of English as a Foreign Language (TOEFL) and the GMAT (Graduate Management Admission Test). Other prominent and relevant examples: Together with the US College Board, the Educational Testing Service has developed ACCUPLACER Online, a battery of placement tests for incoming college students test delivered via the internet as adaptive tests. ACCUPLACER covers aspects of English literacy, of mathematics and of English as a second language, and the tests are designed to be used for low- or medium-stakes purposes.28 The Center for Applied Linguistics (CAL), a private non-profit organization, has developed and administers a computer adaptive test of English oral proficiency and literacy called Bestplus.29 Special purpose tests. A large number of computer based tests are available for specific, often job-related, purposes, provided by private institutions. Some of these are computer adaptive.30
Today, the majority of large-scale testing programmes in the United States have been computerized, and an increasing number are computer adaptive. - and Australia
Principles of adaptive testing have also been used in school tests in Victoria, Australia under the auspices of the Victorian Curriculum and Assessment Authority (VCAA), which offers a range of tests both for formative and for summative purposes. 26
http://www.nwea.org/assessments. The online version of MAPP has not yet been developed into an adaptive test. 28 http://www.collegeboard.com/highered/apr/accu/accu.html. 29 http://www.cal.org/bestplus. 30 For an overview, see http://www.assess.com/panmain.htm. 27
61
The VCAA’s online assessment system provides a web application that can be used by teachers to download assessment resources on demand as well as the Achievement Improvement Monitor Online Statewide Tests from the VCAA for students from their school to sit. The VCAA Assessment Online system consists of: 1) On Demand Testing, 2) AIM Online and 3) VCE Online.31 •
•
•
The ”On Demand” Testing System has been developed to enable teachers to assess the strengths and weaknesses of student achievement through access to calibrated tests prepared by the VCAA. This program is available for students up to the 10th grade. The ”AIM online” assessment system is used by schools that participate in the AIM Online Statewide Testing program to download and deliver adaptive tests and writing tasks to students at their schools. At this stage, AIM Online is being used for the AIM Statewide tests at school year 3, 5 and 7. Adaptive tests are available for English and Mathematics. The tests are adaptive at the level of groups of items: Student answers to an initial group of five short-answer items determine the level of difficulty of subsequent items and so on throughout the test. Each set of items conforms to a set pattern. For example at higher CSF 5 Maths (this level refers to the Australian Curriculum and Standards Framework, to which tests refer), each group of five questions has one Number, one Chance and Data, one Space, one Measurement and one Algebra question. All similar groups of English items have two questions on Writing conventions and three Reading comprehension questions (VCAA 2004). The VCE Online function of the Assessment Online system is still under development but it will be used to deliver online assessment resources for the VCE program for Years 11 and 12 in the future (VCE is the main secondary completion certificate in Victoria).
Europe: Rapid Developments in Computer Based Language Testing
In the EU, the use of computerized adaptive testing is not quite so common in connection with large-scale testing programmes. However, adaptive approaches are increasingly being used in connection with language tests and are being implemented for educational purposes in more and more settings. BULATS – The Business Language Testing Service BULATS (The Business Language Testing Service) is a language assessment service specifically for the use of companies and organisations.32 It is a collaborative venture between University of Cambridge ESOL Examinations and three leading European language institutes: the Alliance Française, the Goethe-Institut and the Universidad de Salamanca. There is an adaptive computer based test available, where test scores are reported in relation to the Common European Framework for Languages. The computer based version of the BULATS test allows testing for listening and reading, and knowledge of grammar and vocabulary.33 Writing and speaking is not (yet) covered.
31
http://www.aimonline.vic.edu.au. Demonstration tests are available at http://www.aimonline.vic.edu.au/visitor.asp. http://www.bulats.org. 33 A demonstration version can be accessed at http://www.bulats.org/tests/computer_test.php. 32
62
DIALANG The DIALANG project is a second prominent example of computer based language testing (Anderson 2005). It is an assessment system intended for language learners who want to obtain diagnostic information about their language proficiency. The system is thus aimed at adults who want to know their level of language proficiency and who want to get feedback on the strengths and weaknesses of their proficiency. DIALANG also provides the learners with advice about how to improve their language skills. The system does not issue certificates. DIALANG is presently delivered via the Internet free of charge, although a special software application must be downloaded and installed for users to be able to use the system.34 The system can manage diagnostic tests in 14 different European languages, and has tests in five aspects of language and language use: Reading, Listening, (indirect) Writing, Grammar and Vocabulary, for all 14 languages. Version 1 of DIALANG is partly computer adaptive, in the sense that it is “test level adaptive”: Upon entering the system, users are presented with a Vocabulary Size Placement Test (VSPT), consisting of a list of 75 words, of which 25 are pseudo-words. The user receives one point for each word correctly identified as real or pseudo. Following the VSPT, users are offered the opportunity of self-assessing their ability in the language and skills that are to be tested on. Results of the VSPT and the selfassessment are combined to decide which level of test the users will be given. At present, there are three levels of difficulty of test – ‘easy’, ‘medium’ or ‘difficult’. Version 2 of DIALANG will be adaptive at the item level for those languages for which sufficient responses have been gathered, and for which a suitably large bank of calibrated items is available. TestDAF Specifically in Germany, the test of German as a foreign language (TestDAF, modelled on TOEFL), was launched in 2001 and there is now a computer based version of the test available. However, due to lack of resources, the test is generally still administered as a paper-and-pencil test. Development of CBA for Educational Purposes in Europe
Outside the realm of language testing, Cambridge Assessment (formerly the University of Cambridge Local Examinations Syndicate, UCLES) has developed a number of tests that are used for educational placement purposes, for instance the BioMedical Admissions Test (BMAT) and Thinking Skills Assessment test (TSA). These tests are not yet administered as computer based tests, however. “Achieve” is the title of a formative test for primary and secondary education, which is delivered online. The UK National Foundation for Educational Research (NFER) has developed adaptive versions of the UK mental mathematics test and several other tests. In 2005, the UK Qualifications and Curriculum Authority piloted the key stage 3 (KS3 – 11 to 14 year olds) on-screen test for information and communication technology (QCA 2005). The test in several ways resembles the ICT test which has been developed by the US Educational Testing Service. The test in itself is not adaptive. However, prior to the test, the teacher of each pupil assigns the pupil to one of two levels. In addition to assessing the correctness of responses, the test also gathers information about how each 34
http://www.dialang.org.
63
test-taker approached each tasks (process information). The marking of the test is computerized. From 2008, participation in the ICT test is to statutory. The initiative “World Class Arena” is meant to help to identify and challenge able students.35 It was devised by the British government Department for Education and Skills (DfES), and World Class Arena items have been trialled by teachers and students in the UK, Australia, New Zealand and the US. The tests are aimed at 9 and 13 year-olds. Tests include both a paper and computer based component. The computer based components have been devised to test problem solving and mathematics. In the private realm, the Netherlands based company CITO has played a significant role for the development and implementation of computer based tests in relation to the Dutch education system and increasingly also in relation to the education systems of other countries.36 The company also offers computer-based tests of Dutch as a second language, as well as of several other languages. Finally, computer-based tests can be tailor made in relation to the test of competences, where the competence requirements of specific job functions make up the starting point for the development of tests. “Skills for Life”: An Example of a Partly Computer Adaptive Large-scale Skills Assessment Skills Survey
In addition, in June 2002 – May 2003, computer adaptive testing was, for the first time in Europe, put to use in connection with a large-scale skills assessment skills survey. The British Skills for Life Survey, which was mentioned in Chapter 3 of the report, was administered on lap-tops and its test module was – partly – adaptive (IRT calibration of test items was not used, and adaptivity was not at the level of individual items). In the literacy test, the difficulty of items that were administered to test takers was adjusted after a number of items (DfES 2003: 221). The test begins with a number of progressively more difficult screening questions beginning at a low level of difficulty (Entry Level 2 in the UK Qualifications Framework). Two further banks of questions then follow, each providing opportunities to sift and refine the final judgements of a respondent’s ability. A respondent’s aggregate performance for each item bank determined whether they were subsequently routed at each stage. If a respondent answered more than 70% of questions correctly in a layer, she/he went up a level. If she/he answered between 41% and 69% correctly, she/he stayed at the same level, and if the respondent answered 40% or less correctly, he/she was routed down one level. In the numeracy test, respondents were presented with items in seven groups or ‘steps’ (DfES 2003: 233-238; 243). Each of these seven steps targets different aspects of numeracy. In the first step, all respondents met the same four items, two at the Entry Level 1 (the lowest level in the UK National Qualifications Framework) and one each at Entry Levels 2 and 3. These were deliberately chosen so as to present familiar and straightforward tasks to all respondents. Based on their performance, respondents were then directed to one of three overlapping groups of five items, forming Step 2, with items ranging from Entry Level 1 to Level 2 in the UK Qualifications Framework. Depending on their performance on these, the algorithm takes respondents to two items of 35 36
http://www.worldclassarena.org. http://www.cito.com.
64
an appropriate level in Step 3; these range from two at Entry Level 1 to two at the top level – Level 2. Again depending on their performance on these, the algorithm takes respondents to two appropriate items in Step 4. this is repeated up to Step 7 so that each respondent encounters 19 numeracy items in all, from 48 numeracy items altogether. The ICT skills awareness test (which tested knowledge about ICT) also employed a routing algorithm to channel respondents through different groups of items in response to individual performance on each section of the assessment (DfES 2003: 248-249). A diagnostic section of test items was followed by a “level determination” section and a “level confirmation” section. The test items that were presented to the test-taker in each section depended on his/her performance in the previous section. The practical ICT skills assessment did not employ routing algorithms and adaptivity in the use of test items. Research in Connection with CBA Software Platforms
Research and development work on computer based assessment, including adaptive testing, is going on in a number of places across Europe. The TAO project (Testing Assisté par Ordinateur) situated at the University of Luxembourg can be mentioned. This project seeks to develop a software platform that can enable the development of computer based tests in an open-source environment.37 The rationale for the project argues that future evaluation needs will imply the collaboration among a large number of stakeholders situated at different institutional levels and with very different needs for assessment tools. For this reason the TAO framework has the ambition to provide “a modular and versatile framework for collaborative distributed test development and delivery with the potential to be extended and adapted to virtually every evaluation purpose that could be handled by the means of computerbased assessment”. In order to reach this goal, the testing problem is being been broken down into its constitutive parts (item development and management, test development and management, subject management, group management, test delivery and results management) and implemented under the form of multiple specialized modules connected in a peer-topeer network. As the TAO architecture should fit into many different use cases, there is no central node in the system (the whole network of TAO nodes is an open-system with no global consistency) and it is assumed that there cannot be an a priori model for each data domain that reaches consensus within all the actors. 4.2. Experiences from Computer Based and Computer Adaptive Testing With the introduction of computer based assessment in ever more contexts, there is currently a broad base of experiences that are relevant in connection with the assessment of the relevance of CBA in connection with adult skills assessment. The methodo37
http://www.tao.lu.
65
logical, practical and operational experiences with computer based assessment, and in particular computer adaptive testing, can be structured under the general headings of advantages and disadvantages or challenges. Advantages of Computer Based Assessments
Compared to ordinary paper-and-pencil tests, there are a number of potential advantages of computer based assessments, and there are even more potential advantages in relation to adaptive computer based assessments (cf. Educational Testing Services 1994; Van der Linden and Glass 2000: vii-xii; Meunier 2002; DfES 2003; Warburton and Conole 2003; Ridgway, McCusker and Pead 2004). Practical and Other Benefits for Test Takers There are a number of practical benefits for test takers in connection with Computer Based Assessment and/or Computer Adaptive Testing: 1) Convenient scheduling may be possible. CBA and CAT make it possible for students or other types of test takers to schedule tests at their own convenience. Depending on the specific contents and conditions of the tests, high stakes assessments may, however, require that tests are carried out under some kind of supervision. Many high states tests have to be taken in dedicated testing suites, so that booking a session and travel are necessary. 2) Comfortable settings possible. CBAs /CATs can be taken in a more comfortable setting and with fewer people around than in large-scale paper-and-pencil administrations. Since the administration of tests is via computers and / or the internet, test administrators do not necessarily have to gather test takers in large groups in order to be able to manage the process. The negative aspect of this advantage is that CBAs and CATs require computers, which are often a scarce resource. The ability of paper tests to be administered to very large numbers with low numbers of proctors is one of the advantages of this approach. 3) Digital processing of the test data and reporting of scores is faster. Contemporary computer based or adaptive testing applications frequently includes the possibility of an immediate feedback or test score report to the test taker, sometimes in connection with other forms of feedback and advice. The DIALANG language diagnosis system provides elaborate feedback, which both places the user in relation to the levels of the Common European Framework of Reference for Languages, but which also comments upon differences or similarities in the test scores and self-assessment of the user. Once users have finished the test, they see a screen which offers them a variety of feedbacks: Four different types of results, and two sets of advice (Anderson 2005: 34-35).38 38
The feedback options are: “Your level” which informs users of their test results in terms of the six levels of the Common European Framework of Reference for Languages; “Check your answers” which provides the user with a graphic presentation of which answers they got right and which they answered incorrectly (red being incorrect and green being correct); “Placement Test” which informs the user about the score of the result of the vocabulary size placement test which determined the level of difficulty in the remainder of the test; “Self assessment-feedback” takes users to a comparison of their test score (reported in terms of CEFR levels) with their self-assessed CEFR level; “About self-assessment” provides a number of reasons why self-assessment and test results may not match, both in general terms, and in detail under a number of more specific hyperlinks; and “Advice”, which presents the user with
66
4) Shorter tests. Computer adaptive tests are shorter compared to paper-and-pencil versions of the same test – which is not necessarily the case for computer based assessment in general. Since computer adaptive tests adapt to the test taker’s level, questions that are above or below the test taker’s ability level will not be submitted. As such, a CBA can be administered in a shorter period of time while still reaching precise information about the student's level (Meunier 2002: 27). In the US General GRE Test, the computer adaptive version contains only about half the number of items as appear on the traditional form of the assessment (Mills and Steffen 2000: 77). In connection with computer adaptive language testing, Madsen (1991) reported that over 80% of the students required fewer than 50% of the reading items normally administered on the paper-and-pencil test" (Madsen 1991: 250). 5) Motivational benefits. Computer based tests tend to create a more positive attitude toward testing. There is evidence that users feel more in control; interfaces are often judged to be friendly; and the wider range of possible test contents in computer based assessments sometimes lead test-takers to associate tests with learning environments or with recreational activities, such as games (Richardson et al. 2002; Ripley 2004). As mentioned, computer adaptive testing is shorter and the test items are not very frequently much too easy or much too difficult. For this reason, test takers are less likely to feel bored and frustrated in the test-taking situation (Meunier 2002: 27). However, as mentioned earlier, adaptive tests frequently operate with a success rate of 50% for test takers – but normally humans prefer higher success rates (around 80%) to be motivated. There may thus be a trade-off between motivation, according to which test takers should be successful 80% of the time, and psychometric testing efficiency, according to which a 50% success rate is the best option.39 The experiences with computer based adaptivity in the British “Skills for Life” survey were generally very positive. In particular, it was noted that there were considerable advantages connected with adaptivity, as it enabled learners to relax and perform at or close to her/his true potential, and that adaptive testing therefore reassured the test person and gave them confidence in their own ability (DfES 2003: 227, 239). Among other things, this meant that the great majority of the respondents in the first round of tests (concerning literacy and numeracy) were also willing to take part in the second test round (concerning ICT skills) – regardless of their performance in first round. This was the case for 89% of respondents who take part in the first test round, and 80% of those classified at Entry level 3 or below in the literacy test (DfES 2003: 216; 228). The UK Skills for Life Survey also reports that there are distinct motivational advantages in the use of computer based assessment, regardless of the adaptivity of information some of the differences between the level the user was assigned to according to test results, and the next CFER level above and below. 39 Observation provided by Chris Whetton.
67
the test. Typically, the laptop computer was seen as a neutral question-setter with the interviewer being viewed as ‘on the same side’ as the respondent, rather than as a question-setter or expert (DfES 2003: 239). Partly because respondents in this survey are not required to operate the laptop themselves, the fact that many respondents have no personal experience of working from computers in this way was not seen as a barrier. Indeed, “the modern image portrayed by the use of the computer is welcomed and appears to raise the status of the whole activity, distancing it from previous learning experiences” (DfES 2003: 239). 6) Self-pacing. Computer based and computer adaptive tests are most frequently not limited in time, although some programmes take into consideration the time spent by test takers per test item as part of the test result. Even so, test takers are generally not directly pressured by time, and as such, time related anxiety does not affect test results (Meunier 2002: 27). In the current generation of CBA and CAT programmes, the above-mentioned advantages have to a high extent been realized and are generally seen to be appreciated by the test takers: When offered the choice between a paper-and-pencil test and a CBA or CAT version of the same test, typically most examinees chose the CBA or CAT version (van der Linden and Glas 2000: viii). Methodological Advantages of Computer Based Assessment In addition to these practical benefits for test-takers, there are a number of potential methodological benefits as well. 7) Wider ranges of questions and test content can be put to use. For instance, computer based assessments offer different response options than simple multiplechoice solutions are possible such as the possibility of “drag-and-drop” answering modes, and interactivity allows whole new types of test items. Through the inclusion of computer graphics, sound animated images and video clips, computer-based assessment has the potential to offer varied, innovative assignments. These possibilities extend the boundaries of paper-based assessments (Bull and McKenna, 2004: 7). As for problem solving, Means et al. (2000) assert that the best way to assess this domain is to trigger and/or make use of situations that demand test-takers to employ the required skills. Computer-based assessment of problem solving tasks seems to be suitable solution to meet this objective. Moreover, this method permits the analysis of a set of steps and procedures that offer an insight into performance, as we shall return to below (Nunes et al., 2003: 375). The computer based ICT skills test which has recently been developed by the US Educational Testing Service makes use of a scenario-based approach, in which test takers are to solve different tasks in scenarios constructed in an artificial ICTenvironment. Computer based assessments in other words offer more possibilities for creating relevant test environments. CBA therefore in principle holds the promise of improved construct validity in comparison to traditional paper-and-pencil tests.
68
8) New possibilities for measuring process information. Computer based assessments open up the possibility to obtain information about other variables than just the testtakers answer to the test item. In connection with feasibility studies of large-scale skills assessment ICT skills, it was noted that pilot test had made it possible to capture process information, such as for instance the sequence of actions in relation to complex tasks, the specific methods used to solve tasks, or the use of time for different types of tasks, rather than simple right/wrong responses. This informed the understanding of performance and proficiencies (Lennon et al. 2003: 59-60). However, it was also noted that despite positive initial results, issues related to the capture and scoring of process-related behaviours still required additional work. 9) Measurement precision. Adaptive testing adjusts the difficulty of test items to the ability of the test-taker, as measured by answers to previous items. Provided that the item test bank contains a sufficient number of test items at all difficulty levels, this entails greater measurement precision. For instance, the situation can be avoided where it is not possible to discriminate between the abilities of two different test takers since both have answered correctly to all test items. Adaptive tests will increase the difficulty level of test items until the proportion of incorrect answers increases. At least in the area of language testing, the superiority of CBA to conventional tests has been established in terms of high reliability and validity indexes (e.g. Madsen 1991). 10) Improved test security in adaptive computer based tests. Since the ability level of every test taker is different, and since every test taker is given an individualized test, no information that would directly help other test takers can be distributed. However, there is a risk that certain test items are overexposed in adaptive tests, since they are particularly good at discriminating between test takers’ levels of ability. These overexposed items may become known among test takers, cf. below. Challenges and Difficulties in Connection with CBA and CAT
However, the growth in the use of CBA and CAT has also given rise to a number of new questions, difficulties and challenges. Below we list some important concerns and challenges.40 1) Item security and item exposure. In high-stakes testing programmes, item security quickly became a problem when CBA and CAT were introduced. The capability of examinees to memorize test items as well as their tendency to share them with future examinees (collusion) appeared to be much higher than anticipated. As a consequence, the need arose for effective methods to control for item-exposure - in light of the tendency for some test items to appear relatively frequently in adaptive tests - as well as to detect items that have been compromised (van der Linden and Glas 2000: viii-ix; Mills and Steffen 2000: 78).
40
For an elaboration of some of the issues, and discussion of further issues, see the website CAT Central, http://www.psych.umn.edu/psylabs/catcentral.
69
In the US GRE tests, it has been attempted – with significant success – to counter these problems by creating separate, but partly overlapping test item pools. One conclusion that is drawn against the experiences of the GRE is that rotating pools of test items effectively reduces the impact of collusion to very small levels. Moreover, in the GRE the maintenance of the quality of test item pools has been given considerable attention. Arbitrary rules have been developed regarding the frequency of item pool development. These rules stipulate that recently used items are not eligible for inclusion in a new pool, that items exceeding certain usage rates in the past cannot be included in new pools, and that items exceeding a maximum usage level are retired from use in the operational programme (Mills and Steffen 2000: 79-83; Way, Steffen and Anderson 1998). The question of item security and item exposure is probably not a big problem in connection with low stakes tests such large-scale skills assessment adult skills surveys, but the problem cannot be ignored entirely. A European adult skills assessment survey may, for instance, generate public interest and test items may be made public in the media. 2) Open ended questions cannot be handled efficiently. Today's software cannot efficiently handle the systematic scoring of all possible language or graphical production in open ended questions. Even though the field is developing rapidly, CBA is limited to such test formats as for instance multiple choice questions and scenariobased tasks, and in particular with respect to language testing: cloze activities and jumbled sentences (Meunier 2002: 28). In CATs that are adaptive at the level of individual test items by way of using Item Response Theory, computational procedures for ability estimation and item selection must be carried out in real time. This means that there are still large limitations in the permissible responses of the test-takers, as responses need automatic scoring. Most adaptive testing algorithms assume that the responses to items are dichotomously scored, correct or incorrect (Eggen 2004: 4). A number of approaches have been taken to the problem of automatic scoring. One is based on analysis of the surface features of the response, such as the number of characters entered, the number of sentences, sentence length and the like (Cohen, Ben-Simon and Hovav 2003). A second approach is to define a range of acceptable responses (e.g. Sukkarieh, Pulman and Raikes 2003). However, while these approaches can reduce the resources required for human scoring of answers, they are not so reliable and valid that they make possible automatic scoring for open ended questions.41 3) The scoring of incomplete computer adaptive tests. An important operational question is how to handle incomplete CATs. Which score is to be awarded if the respondent has answered for instance only 70% of the test items? There are several 41
However, the US Educational Testing Service is, introducing automatic scoring of essay writing in the MAPP test, using e-raterÂŽ (an automated essay scoring engine). Similarly, the essay writing test in ACCUPLACER Online, called WriterPlace Plus Electronic, scores student's responses electronically using IntelliMetric, an artificial intelligence based writing sample scoring tool. On IntelliMetric, see http://www.vantage.com/demosite/demo.html.
70
possible options, but none of them are without methodological difficulties. In the GRE CAT, an initial decision was made that a score would be awarded if 80% of test items in a section had been answered. However, this policy lead to significant changes in test behaviour, and other options had to be considered (Mills and Steffen 2000: 87-95). The scoring of incomplete CATs is probably a larger problem in connection with high stakes tests than in connection with low stakes tests such as would be the case for adult skills surveys. 4) Construct validity in connection with tests of multidimensional skills. As mentioned, IRT presupposes the unidimensionality of the latent trait or ability that is being measured. This raises the question how feasible it is to test multidimensional constructs using this approach. The question has been particularly hotly debated in connection with language testing, where it has been argued that language competence is the result of more than one construct, e.g. grammatical competence, textual competence, illocutionary competence, and sociolinguistic competence. However, in recent years a consensus appears to be emerging, according to which language skills are seen to have five different dimensions (reading, writing, listening, grammar and vocabulary) that must be tested separately. This has allowed the use of Item Response Theory in connection with language testing (cf. Anderson 2005). The methodological assumption of unidimensionality in Item Response Theory does, however, remain highly relevant in connection with the possible test of other types of skills that may have multiple dimensions.42 5) Resources. Test design for CBA and CAT requires considerable resources. Efficient use of CBA requires that item authors and content specialists are given training in writing questions for computerized tests – as previous experience from paper-andpencil tests may hold back their imagination. In order to develop a computer adaptive test based on Item Response analysis, a large sample of test items first needs to be calibrated; then time is needed to sequence items based on a continuum of difficulty, and finally a pilot test is required prior to administering the final version of a computer adaptive test. Administering the final test may require the continuous maintenance and rotation of the pool of test items, depending on how high the stakes are of the test in question. “Adaptive style� testing can be developed without the use of Item Response Theory and item response analysis, as is exemplified by the British Skills for Life Survey (DfES 2003). Some of the advantages of adaptive testing can be reaped in this manner for a lower cost than the development and implementation of state-of-the-art adaptive testing using IRT. However, this also means that the methodological advantages of item response theory cannot be realized, comparability of test results using different test items being one of the most important.
42
It should be mentioned that there are statistical tests available to assess the dimensionality of item response data.
71
Online multiple choice tests can be cheap to administer and score. However, if the potentials of ICT for skills assessments are to be utilized – for instance by making use of interactivity in test items, and by making tests adaptive – test development is a very comprehensive task (cf. Ridgway et al. 2004). 5) Implementation. There are frequently technical difficulties in connection with mass testing on many PCs or servers, using internet-based or other forms of networked solutions. Ensuring that the tests run and data is collected is not problematic when the designers can control the hardware and its configuration. Where this is not possible, there may be many sites where the test does not run.43 These technical difficulties are, however, expected to be overcome rather soon. 6) Testable skills domains. Far from all possible skill domains can be assessed using computer based and/ or computer adaptive tests. Compared to traditional paper-andpencil tests, the fact that open ended answers are not possible defines one limit. On the other hand, the interactivity of the computer opens up new possibilities. Still, far from all skills can be tested in standardized and individualised test set-up. We shall return to this point in the next section. 4.3. CBA in Adult Skills Assessment: Challenges and Possibilities This section analyses the implications of the theoretical and empirical findings that are presented in the sections above. How relevant and feasible is computer based assessment and computer adaptive testing in relation to adult skills assessment on a European scale? Limited Experiences with CBA and CAT in Connection with Adult Skills Surveys
Whereas computer based assessment and computer adaptive testing are rapidly growing approaches to assessment of skills and competences, we have identified only one largescale skills assessment survey that has made use of these technologies for collecting information about the levels and distribution of skills in an entire population. •
The UK Skills for Life Survey is the only example of a computer based and (partly) adaptive test, which has been administered to a large sample in order to collect information on the level and distribution of skills in a population.
Other assessment programmes are in the planning stage; they are used for educational achievement or placement purposes or vocational placement purposes; or they are developed and used for individual formative or summative purposes: • •
43
The OECD PIAAC adult skills assessment initiative is currently foreseen to include a computer based test of ICT skills, based on the ETS ICT skills test. However, this initiative is still only in the planning phase. There are a number of examples of computer based assessments that are administered in relation to the assessment of student achievement or student placement in Observation by Chris Whetton, UK National Foundation for Educational Research.
72
• • •
the educational system. These tests currently cover English, Mathematics, Science, problem solving, critical thinking, analytical writing, verbal reasoning and a number of more specific educational domains. There are general aptitude tests that have been computerized and rendered computer adaptive, for instance the US Armed Services Vocational Aptitude Battery. Computer based tests that are taken for individual purposes, outside the realm of the education system, can test the foreign language proficiency in a number of languages, as well as specific job related skills. The foreign language tests DIALANG and BULATS are the only examples of computer based skills tests that have been developed with a transnational perspective from the onset.
The Relevance of CBAs and CATs for the Full Range of Politically Relevant Skills
Is it possible to test the full range of politically relevant skills among adults using Computer Based Assessment? A range of skills on the European agenda There is no firm European agenda for adult skills assessment. A range of skills is being discussed, with specific attention being paid to language skills, digital skills (e-skills/ ICT-skills), and literacy. Skills such as entrepreneurial skills and the broader “soft” skills such as social skills, learning-to-learn abilities and broad personal competencies are also seen as important (DTI et al. 2004: 25-28). No consensus on the definition of the concept of skills Moreover, it must be taken into account that even though the word “skill” is widely used, there is no consensus as to the precise meaning of the term. Skill as a label, which carries with it the prospect of labour market rewards, has historically been a contested concept among employers and employees, and sometimes between men and women, in the definition of jobs. “Skill” as a concept has also evolved in public discourse, starting from a fairly narrow definition used to refer to quite specific high-level educational qualifications and analytical capacities or to “hard” technical abilities or vocational competencies associated with particular occupations. Over the past decade or more, the term has increasingly been used to refer to a broader range of qualities that individuals possess that are thought to be germane to workplace productivity. Skills may refer to technical, job specific competencies, but also to personal characteristics, attitudes, “generic” competencies (e.g., problem-solving, teamwork, communication) and even “aesthetic skills”(Stasz, 2001; Payne, 1999). Standardised, reliable assessments not available or feasible for all skills A recent study, which created a framework for a European adult skills assessment strategy, concluded that standardised, reliable assessments are available or feasible only for some of the whole range of skills that are considered politically relevant (DTI et al. 2004: 9-10): • • • •
Literacy skills (prose, document) Numeracy skills (quantitative literacy) Problem solving/analytical skills Foreign language competency
73
• •
Job- or work-related “generic skills” Information and communication technology (ICT) skills
Of these, there were some limitations in direct assessment of the latter two categories — generic skills and ICT skills — but indirect assessment of these skills linked to work context is possible. Since the finalization of the 2004 report, development of tests in the field of ICT skills has progressed. The 2004 study also concluded that in the context of a European-wide adult assessment initiative, it is not presently feasible to directly assess some skills of interest to policy makers, for definitional, methodological, and to some extent also political reasons. These include: • • •
Entrepreneurial skills Some social skills Learning-to-learn skills or the ability to learn.
CBA and CAT Can be Applied to Testable Skills – But Will Require Many Resources
To which extent does the evolution of computer based and computer adaptive testing affect these conclusions? Several points are important in this connection: • • •
•
•
For all of the 6 categories of skills where the DTI (2004) study concluded that skills assessment is feasible, barring the job related “generic skills”, computer based and computer adaptive assessments currently exist. Computerized tests entail methodological as well as practical advantages. Indeed, any meaningful test of ICT skills presupposes that the test is computer based. Using advanced scenario-based techniques, computer based and computer adaptive tests may in the longer term make possible the test of some skills that are currently not testable, for instance certain social skills. However, at present we do not find evidence that computer based assessment make possible the assessment of politically interesting skills such as entrepreneurial skills or social skills such as teamworking skills. Developing computer based skills assessment and computer adaptive testing on an international basis will be very requiring in terms of time and resources. The most resource demanding process will be the development of computer adaptive testing. The process involved in the construction of the DIALANG system is highly illustrative (Anderson 2005). This process involved the development of a CAT system for 14 languages, but only for one skill category, namely language skills. The required efforts will be multiplied if additional skills categories such as literacy, numeracy, problem solving and ICT skills are also to be assessed. A possible shortcut could be to make use of existing computer based tests such as those developed by the US Educational Testing Services and adapt these tests to international purposes. However, methodological obstacles in international adaptation may add to costs, and taking an outset in ETS tests may also be politically problematic. On the positive side, however, the high development costs are likely to result in a system which can be used in the future with relatively modest running costs involved. Adaptive tests will probably require some maintenance of test item pools (although the question of test item security and item exposure is first of all relevant
74
•
in connection with high-stakes tests), but apart from this the administration of computer adaptive tests are most likely not significantly higher than the administration of paper-and-pencil tests.44 There are considerable challenges involved in the implementation stage of largescale skills assessment computer-based skills assessment, one important challenge being to limit the time required for participation in the test. However, these challenges must also be confronted in tests that are not computer based, cf. DTI et al. (2004: 17-18). Indeed, as computer adaptive tests are shorter than non-adaptive tests, the timing problem will not be bigger in the computer adaptive version of tests.
44 BTL Group (www.btl.com), the company which was responsible for the transfer of the 2003 UK Skills for Life survey tests to a computer platform, observes that most additional costs related to computer-based assessments are in the development of test contents, including the calibration study which is needed for developing adaptive tests, and in necessary investment in hardware and software. Due to improved software applications, transfer of test contents to a computer platform is nowadays a relatively simple process. As regards the running costs, computer based assessments are more expensive that traditional paper-and-pencil based tests in requiring the maintenance of hardware, possibly fees for software licenses, and the provision of some technical support. However, to some extent these extra costs are outweighed by savings in the processing of test data. In general, the "breakeven" point between on-paper and on-screen can vary greatly depending on implementation and development approaches. Interviews with Nevilly Percy and John Winkely of BTL Group, September 2006.
75
5. The Relevance of International Qualifications Frameworks In international skills assessments, an independent standard is essential. Facing a multilingual audience with different cultural, educational and occupational background it is an important and also a challenging task to establish a stable standard for test results obtained in different countries and target populations. This chapter will analyse the possibilities and challenges relating to setting standards developed on the basis of common frameworks of reference. The following are the main questions of the analysis: • • • •
What are the main steps and challenges in the process of setting an independent standard based on a common framework of reference? What are the main methodological challenges of setting standards? Is it feasible to develop independent standards for tests based on self-assessment? Do ISCED and EQF constitute relevant frameworks of reference for setting independent standards?
The analysis will draw on examples of existing tests and self-assessments which have developed standards founded on common frameworks of reference. 5.1. The Need for Setting Independent Assessment Standards In isolation, test results and assessments scores are of limited informational value. To be useful in a developmental perspective, at individual as well as societal level, the results must be related to a standard against which the results can be compared or measured. Basically, there are two types of standards (Messick 1995): •
•
Content standards, which specify the kind of things a learner should know and be able to do. The Common European Framework of Reference for Languages (CEFR), for instance, offers a generic descriptive scheme of different skills and sub-skills against which one can map and develop content standards for new tests. Performance standards, which define the level of competence a learner should attain at key stages of developing expertise in the knowledge and skills specified by the content standards. Here the CEFR common reference levels offer a generic standard, against which one can map the performance standards of existing tests, and from which one can develop performance standards for new tests.
Setting a performance standard means to relate a learner’s performance to some known and accepted standard of ability or performance. The standard defines the type of the skill or sub skill derived from the framework and at what level the skill/ability is mastered. The crucial question is: At what point can a person’s performance on this test be considered to reflect a mastery of the ability being tested, at the level required?
76
Norm Referencing
An immediate possibility would be the use of so-called norm referencing. This method is simply to compare a learner’s score with the scores of larger population who have already taken the test. One can then decide that if the test taker has reached a score which is higher than a target percentage (for instance 60 per cent) of the population – called the norming population – then the test taker can be considered to have been successful. Hence in norm referencing, norms are set by simple comparisons with some norming or target group, regardless of the actual, “absolute” level of achievement. However, the greater the difference between the characteristics of the norming population and the sample taking the test, the harder it will be to interpret the test scores. Pass Marks
Alternatively, in many settings an arbitrary pass mark is used, for example 50 per cent of the items correctly answered. The problem with this method of setting pass marks is that the standards will vary depending on the difficulty of the test being taken. For example 40 percent correct of a set of very difficult items is better than 40 percent correct of very easy items. Moreover, the difficulty of the test may vary from year to year. DIALANG as a Key Example
The above mentioned problems highlight the need to develop some means of interpreting a test score independently of the people who happen to take the test. An example of key relevance in this context is the DIALANG project which has developed a standard for an international assessment based on the CEFR. Based on the analysis of the processes and challenges confronted in this project, we will discuss the implications and possibilities of using the following frameworks of reference as the foundation of standards for international, computer-assisted assessments: 1. The European Qualifications Framework (EQF) 2. ISCED – The International Standard Classification of Education We first analyse the process of developing standards on the basis of international frameworks of reference. We then introduce the ISCED and the EQF, and carry out an assessment of their relevance for international adult skills assessments. 5.2. The Steps and Challenges of Setting a Standard: Case DIALANG This section will analyse the methodological steps and challenges which the process of setting a standard involves. As mentioned earlier, the DIALANG project is an example of key relevance in this connection. In DIALANG, a standard was developed for the use in cross-national assessment, based on the CEFR. As described in a previous section, DIALANG is a diagnostic language system available via the internet in 14 European languages. It gives learners a variety of feedback on the strong and weak points in their proficiency and advice for further learning. Diagnosis given in DIALANG is not directly related to specific language courses or curricula. Rather, it builds on the specifications of language proficiency presented in the Common European Framework of Reference for languages.
77
The Common European Framework is a descriptive framework of what a user of a language can “do” at six different levels of performance ranging from “basic” (A1, A2) through “independent” (B1, B2) to “proficient” (C1, C2). Underpinning this global definition of performance is a large number of communicative activities and subscales which are grouped under the following overall categories: 1) Reception (spoken, audio visual and written) 2) Interaction (spoken, written) and 3) Production (spoken, written). For all these areas descriptors have been developed at all six levels. DIALANG’s Assessment Framework summarizes the relevant content of the CEFR, including the six-point reference scale, communicative tasks and purposes, themes and specific notions, activities, texts and functions. The main methodological steps and challenges in DIALANG’s development of a standard based on the CEFR are displayed in Figure 5.1 below. In the subsequent sections, the contents of each step are described in more detail. Figure 5.1. Main Steps in the Development of an Standard Based on a Framework 1) Awareness of the Theoretical Assumptions of the Framework • What is the theory of the skills measured?
2) Complementing for the Limitations of the Framework • Does the Framework cover all aspects of the competences which are to be assessed? • Are all sub-skills, functions and topics covered?
3) Explaining the Framework to the Assessment Development Teams • Development of Assessment Framework • Development of written instructions for the production of test items
4) Conservative or Innovative Items? • Do the test items represent a variety of different formats? • Are the possibilities of computer-based tests sufficiently utilized?
5) Piloting of Items • Calibrating the items onto a common scale of difficulty • Does the predicted level of difficulty match the empirical difficulty?
6) Relating Test Items and Resulting Scores to the CEFR levels • Where is the cut-off point on the scale between one CEFR level and another? • Choice of standard-setting procedure
78
Awareness of the Theoretical Assumptions of the Framework
The choice of a given framework to be used as the basis for a standard involves reflection on the theoretical basis and cultural background of the framework. In the case of DIALANG the American Bachman Framework was a potential candidate (Alderson 2005: 39) However, it was considered very much a product of the USA and of language testing researchers. In contrast the CEFR was seen as more adequate as the imperative for the DIALANG project was to use a framework which could be considered truly European. The CEFR was chosen because it had a relevant functional approach to all the 14 DIALANG languages, and because the CEFR was well known and widely accepted in Europe. The CEFR describes the perspective as “action oriented” because language users are seen primarily as social actors who use language in the course of their social tasks and activities (Huta et al. 2002). The CEFR focuses on describing language use, but the authors make the point that this cannot happen independently of the social context in which the language is used, or the individual learner users who take part in the social activities. DIALANG subscribes to this approach even though it can only implement a narrow range of language use situations. This is because the physical assessment context is the computer, and because the assessment procedure involves interaction with texts and tasks but no real-time interaction with other people. Complementing for the Limitations of the Framework
The development of the DIALANG assessment specifications may illustrate a typical methodological challenge in standard setting: The framework of reference does not offer complete information or coverage of all aspects of the competences which are to be assessed. In the case of DIALANG the CEFR was very useful for defining language use situations, functions and topics tested, and DIALANG item writers were provided with checklists to ensure that their items covered these categories thoroughly enough. However, CEFR could not be used to specify the item writing process nor the actual test tasks; instead the project made use of relevant language testing literature. The CEFR discusses some aspects of difficulty, such as text characteristics and types of response, but these only account for a part of task difficulty. The specification of areas or sub-skills that DIALANG test were to cover was not based on the CEFR either. Thus the decision, for example, to define reading and listening in terms of “understanding main ideas”, “understanding specific details” and “making inferences” came from other sources. Moreover, the CEFR had only very little material that could be directly used as the basis of vocabulary and grammar tests. It was thus necessary for the project to design its own language specific specifications for vocabulary and structures.
79
Explaining the CEFR to the Assessment Development Teams
In each country of the 14 languages Assessment Development Teams were set up in order to develop test items in parallel. All team members were experienced test developers. The challenge in this phase was to ensure a consistent interpretation and utilization of the framework across the countries before the drafting of test items was commenced. Across Europe, there is a great variety of different traditions in assessment and testing is not always a well-developed discipline (Alderson 2005: 46). This meant that the initial quality of items produced varied, and considerable effort was needed by Project coordinators to inform, train and monitor the work needed of the Assessment Development Teams. The instruction of the Assessment Development Teams was based on written, operational specifications as well as consultations, including: • • •
DIALANG’s Assessment Framework (DAF) which sets out the theoretical framework for the system The DIALANG Assessment Specifications (DAS) which describes how the framework in translated into concrete tasks. Introductory meetings between the Coordinating Institution (The University of Jyväskyla) and the Assessment Development Teams
This was a lengthy process which entailed thorough and continuing dialogue between the coordinating institution and the Assessment Development Teams: 6 iterative drafts of the DAS were circulated around the partnership for comment and modification. Drafting of Items Based on the CEFR
The production of items took place in a stepwise process, which included drafting, review for content, revision and second review. Item pools for the 14 languages were developed. An item pool is a collection of items that have been reviewed for content but which have not yet been pre-tested and calibrated for difficulty (see below) Test items were developed in five major areas – Reading, Writing, Listening, Grammar and Vocabulary – at all six levels of the CEFR. Conservative or Innovative Items?
DIALANG initially developed items in a fairly traditional format, namely two versions of multiple-choice and two versions of short answer questions. This was in part because these were the easiest to implement and the most familiar to the item writing teams. However, the DIALANG Project was very well aware that these test methods were fairly conservative and did not correspond to the innovative character of the Project, using computer-assisted tests as well as self-assessment and feedback. Nor did they show the potential of a computer based system for interesting innovation. As a separate part of the DIALANG project a series of “experimental items” were developed, which were not used in the test but should demonstrate to the Project’s sponsors and other interested parties what sort of interesting test methods could be developed through computer delivery.
80
On the DIALANG website a total of 18 different “item types” are presented.45 Some examples are: • • •
•
Pictorial Multiple Choice with Sound, which replaces the traditional verbal multiple-choice options with pictures. Video Clips in Listening. Confidence in Response. Users respond to a test item, after which they have to indicate how sure they are about their answer on a four-point scale. After completing the item and their confidence rating, users receive immediate feedback about the correctness of their response and comments on their confidence. Users who choose an incorrect answer and were confident about the correctness of their choice, are told that they had apparently overestimated their skill or knowledge. Drag and Drop Activity, in which users first listen to instructions telling them how to arrange a set of objects that the users can see on a screen. The learners arrange the objects by dragging them with their mouse. This technique can be used for a variety of tasks in which the learners have to understand, on the basis of either spoken or written instructions, how to put together or arrange various gadgets, machines, vehicles, maps, pieces, etc. This activity can also be used to rearrange pieces of texts or even audio or video clips.
Such innovative item types will, coupled with the provision of feedback and tools to aid reflection, self-assessment, confidence rating, etc., enhance the development of diagnostic tests and procedures with a closer link of assessment and learning. Especially the use of confidence rating could be a very relevant item type, which in combination with test questions/tasks could indicate to what extent users over- or underestimate their skills. Piloting of Items
The CEFR is not sufficiently explicit about what exactly a test or a test item at a given level of CEFR should contain. Consequently, item writers having intended an item for a given level may have hit a different level. Therefore piloting is important assess whether the predicted level of difficulty matches the actual empirical difficulty experienced by the test-takers. Moreover piloting of items is also necessary in order to identify unforeseen problems in design, focus or wording of the items. The piloting of a computer assisted international assessment such as DIALANG represented several challenges, primarily of logistic and technical character. One major problem in piloting via the Internet was motivating centres to pilot the tests with their learners, which partly was due to a lack of familiarity with test-taking on computer. Having dedicated organizers was a key to the successes achieved in piloting and the lack of them the reason for many a failure. The technical problems experienced in 1997 in the pilot, such as inadequate equipment, poor network connections are probably of less relevance today where the Internet technology is much more mature. 45
http://www.dialang.org / “New Item Types”.
81
Piloting included the following main steps: • • • •
The Assessment Teams were asked to select 60 items per skill area for piloting. Items were selected across the CEFR levels –based on the Assessment Teams initial pre-pilot judgement of their difficulty. The selected items were reviewed for their adherence to the DAF and DAS For each language 12 pilot booklets were designed, each with 50 items covering the 5 skill areas. (each item appeared in two separate booklets to ensure greater representativeness of the pilot population) Each test-taker responded to one booklet, assigned randomly. In addition each testtaker responded to a set of self-assessment statements.
The aim of the pilot was to conduct an initial analysis of items using IRT (Item Response Theory) statistics, once 100 candidates per item had been achieved. Using IRT, the results of the pilot were used to calibrate the items onto a common scale of difficulty. The empirically estimated difficulty of the piloted items was compared with the judges’ initial estimation (Author’s Best Guess). Most of the items (ranging from 85%100% between the 6 levels A1-C2) successfully matched the calibrated level in before and after the pilot. In sum, to find out whether a given item is at the level of difficulty intended, pretesting/piloting on suitable samples of test takers in crucial. Even with their best intentions, item writers might well have produced items, although intended to be at a given level of the CEFR, might actually have been a level or more higher –or lower. The results of such piloting are central element in deciding whether an item meets the standard intended. Relating Test Items and Resulting Scores to the CEFR levels
Pre-testing and calibrating items onto a common scale of difficulty is not enough, since it does not provide guidance as to where on the scale a person can be said to A2 rather than A1 or C1 rather than B2. What is needed is some way of deciding where the cutoff point on the scale comes on the scale of difficulty between one CEFR level and another. One way to do this standard setting is to use the judgements of experts, who are very familiar with the framework of reference and who knows what is meant by each level of the framework Two alternative approaches to standard setting can be gathered: 1) The Person-centred standard-setting procedure and 2) The Item-centred procedure (Alderson 2005: 64-65). The Person-centred method includes the following steps: • •
A set of candidates whose language abilities are known very well by the judges are gathered. The judges, who must be very familiar with the CEFR, assess the candidates according to level descriptors.
82
• • •
Each candidate thereby receives a “judged” CEFR level. Each candidate takes the test and their performance on the test is compared with the previous judgement about their ability on the CEFR scale. By looking at a sufficient number of candidates and comparing their scores and their CEFR levels as judged by the experts, it is possible to estimate of the point on the scale of scores where a person can be said to be at a given CEFR level, rather than one below or above.
The Item-centred approach looks at items in stead of persons: • • • • • •
Each judge’s task is to look at each item that the test contains, and decide whether it can be answered correctly by a person at a given level of the CEFR. The Items are presented in random order, and the procedure is repeated in two rounds. This is to identify incidents of inconsistency where a judge judges the same item to be at different levels in the two rounds respectively. The judgements of experts are compared with each other statistically. “Outlier” judges, who are out of line with the majority, may be screened out. The aggregated level of each item is then placed on a scale. The empirically established difficulties of the same items are then compared against the judged CEFR levels, -and the point on the scale of difficulty at which items change from being at one level to a higher level can in principle be taken to be the cut-score between the two levels.
Comparing the two procedures, it can be argued that the Item-centred approach is more practical than the Person-centred because it may be very difficult to gather a sample of the required range of people to take the test. Certainly this would be the case in international assessments like DIALANG. An important precondition for the Item-centred approach is the quality of judges especially that they have great familiarity with the framework of reference. Therefore, in the DIALANG project, familiarization procedures where conducted in the form of sessions where the judges discussed their individual sorting of the items. Self-Assessment Statements Building on CEFR
The Self-assessment Statements were taken almost directly from the CEFR, which contains a large number of can-do statements in each skill area and at each level. The wording was changed from “can do” to “I can” and some statements were simplified for their intended audience. DIALANG also developed sets of “can do” statements for vocabulary and grammar. Here is an example of the way in which a statement from the CEFR was simplified: CEFR: “Can understand short, simple texts on familiar matters of a concrete type which consist of high frequency everyday language or job-related language.” This text was split into two simpler self-assessment statements:
83
DIALANG: “I can understand short, simple texts written in common everyday language. DIALANG: “I can understand short simple texts related to my job”. The self-assessment statements underwent a piloting procedure similar to the test items. The correlation between their calibrated values of “difficulty” and the original CEFR levels were very high (0,911-0,928). This indicates that the DIALANG self assessment statements correspond closely to the original CEFR levels. It was also demonstrated that the self-assessment statements were equivalent across different languages (Alderson 2005: 101-102). Although in DIALANG only a modest correlation has been established between the self-assessment and the skill test scores, the results above indicate that it is possible to establish internal consistency and correlation between the levels of the independent framework of reference (the CEFR) and the self-assessed levels. 5.3. ISCED – The International Standard Classification of Education The International Standard Classification of Education (ISCED) was designed by UNESCO in the early 1970’s and approved by the member states to serve as an instrument suitable for assembling, compiling and presenting statistics of education both within individual countries and internationally (United Nations 1997). ISCED’s Definition of Educational Activities Covered
ISCED does not intend to provide an internationally standardized concept of education, or to reflect its cultural aspects. For any given country the interplay of cultural traditions, local customs, socio-economic conditions, will have resulted in a concept of education in many ways unique to that country. However, for the purposes of ISCED, it is necessary to prescribe the scope and coverage of the educational activities to be covered by the classification. Within the framework of ISCED, the term education is thus taken to comprise all deliberate and systematic activities designed to meet learning needs. Whatever the name given to it, education is understood to involve organized and sustained communication designed to bring about learning. Consequently ISCED’s definition of education excludes communication that is not designed to bring about learning. It also excludes various forms of learning that are not organized. Thus, while all education involves learning, many forms of learning are not regarded as education. For example, incidental or random learning which occurs as a by-product of another event, such as something that crystallizes during the course of a meeting, is excluded because it is not organized i.e. does not result from a planned intervention designed to bring about learning.
84
The basic unit of classification in ISCED remains the educational programme. Educational programmes are defined on the basis of their educational content as an array or sequence of educational activities which are organized to accomplish a predetermined objective or a specified set of educational tasks. Objectives can, for example, be preparation for more advanced study, qualification for an occupation or range of occupations, or simply an increase of knowledge and understanding. The Levels and Fields of ISCED
Educational programmes are cross-classified by levels and fields of education, each variable being independent. Thus, every educational programme can be classified into one and only one cell in the level-field matrix. The levels are related to the degree of complexity of the content of the programme. The classification of the levels of education is undertaken within an overall taxonomic framework that considers the educational system as a whole. The levels include the following 6 categories: • • • • • • •
Level 0 - Pre-primary education Level 1 - Primary education or first stage of basic education Level 2 - Lower secondary or second stage of basic education Level 3 - (Upper) secondary education Level 4 - Post-secondary non-tertiary education Level 5 - First stage of tertiary education Level 6 - Second stage of tertiary education
These categories represent broad steps of educational progression from very elementary to more complex experiences with the more complex the programme, the higher the level of education. Curricula are far too diverse, multi-faceted and complex to permit international, unambiguous determinations that one curriculum for students of a given age or grade belongs to a higher level of education than another. International curricula standards that are needed to support such judgements do not as yet exist. Therefore the ISCED manual provides several criteria (typical entrance qualification, minimum entrance requirement, minimum age, staff qualification, etc) which can help point to the level of education into which any given educational programme should be classified. Flexibility is, however, required when applying the criteria to determine the level of education of an educational programme. The fields of education include 25 fields of education, divided into 10 broad groups composed of fields of education having similarities: General programmes; Education; Humanities and Arts; Social Sciences, Business and Law; Science; Engineering and Engineering Trades; Agriculture; Health and Welfare; and Services. Experience over the years with the application of ISCED by national authorities and international organizations has shown the need for its updating and revision, taking into account new developments and changes in education and anticipating future trends in the various regions of the world. The present classification, which is known as ISCED
85
1997, has been innovated as regards the fields of education. One of the innovations is the establishment of broad groups composed of fields of education having similarities. One such example is the broad group Health and Welfare comprising educational programmes in medicine, medical services, nursing, dental services and social services. 5.4. The European Qualifications Framework The EQF, which is currently being discussed in the EU member states on the basis of an initial proposal from the Commission (Commission 2005b), aims to be a metaframework, enabling national and sectoral frameworks and systems to relate and communicate to another. The framework will facilitate the transfer, transparency and recognition of qualifications – understood as learning outcomes assessed and certified by a competent body at national or sectoral level. The core of an EQF is intended to be a set of reference points defined by learning outcomes that will relate to qualifications frameworks (national and sectoral) that are in use across Europe. In the Commission’s initial proposal, 8 common reference levels are proposed, based on research carried out to support the development of a credit transfer system for VET. A principal function of this framework would be to strengthen mutual trust between the different stakeholders involved in lifelong learning and to facilitate the transfer and recognition of the qualifications of individual citizens. This is considered to be a necessary precondition for reducing barriers to learning and for making better use of existing knowledge, skills and wider competence. The EQF is not seen to be able to encompass detailed descriptions of particular qualifications, learning pathways or access conditions. This is intended to be the task of qualifications frameworks at national and/or sectoral level. The Basic Concepts: Learning, Learning Outcomes and Competences
The overall justification of the EQF is that lifelong learning has become a necessity in a Europe characterised by rapid social, technological and economic change. An ageing population accentuates these challenges underlining the need for a continuous updating and renewal of knowledge, skills and wider competences. The realisation of lifelong learning is however complicated by the lack of communication and co-operation between education and training providers and authorities at different levels. Consequently, the key purpose of an EQF is to support lifelong learning and to make sure that the outcomes of learning are properly valued and used. The EQF can be used as a frame of reference to define the learning outcomes and the competences that the individual acquire in formal as well as informal settings and what the individual is expected to know, understand and/or be able to do after the learning. The EQF is based on the following definition of learning:
86
“Learning is a cumulative process where individuals gradually assimilate increasingly complex and abstract entities (concepts, categories, and patterns of behaviour or models) and/or acquire skills and wider competences. This process takes place informally, for example through leisure activities, and in formal learning settings which include the workplace”. The term learning outcome is defined as: “The set of knowledge, skills and/or competences an individual has acquired and/or is able to demonstrate after completion of a learning process. Learning outcomes are statements of what a learner is expected to know, understand and/or be able to do at the end of a period of learning”. Hence learning outcomes can be formulated for a number of purposes; in relation to individual courses, units, modules and programmes. The term competence defines the ability of individuals to combine – in a self-directed way, tacitly or explicitly and in a particular context – the different elements of knowledge and skills they possess, including the following components: • • • •
cognitive competence involving the use of theory and concepts, as well as informal tacit knowledge gained experientially; functional competence(skills or know-how), those things that a person should be able to do when they are functioning in a given area of work, learning or social activity; personal competence involving knowing how to conduct oneself in a specific situation; and ethical competence involving the possession of certain personal and professional values.
The concept is thus used in an integrative manner; as an expression of the ability of individuals to combine – in a self-directed way, tacitly or explicitly and in a particular context – the different elements of knowledge and skills they possess. Eight Reference Levels
The above described understanding of competences, which highlights the ability of an individual to deal with complexity, unpredictability and change, is reflected in the EQF 8 reference levels defining learning outcomes. Each of the 8 levels in the EQF (see Table 1 in appendix, displaying the level descriptors) includes three types of learning outcomes: • • •
knowledge; skills; and wider competences described as personal and professional outcomes, including:
87
1) autonomy and responsibility; 2) learning competence; 3) communication and social competence; and 4) professional and vocational competence The aspect of self direction is critical to the concept as this provides a basis for distinguishing between different levels of competence. Acquiring a certain level of competence can be seen as the ability of an individual to use and combine his or her knowledge, skills and wider competences according to the varying requirements posed by a particular context, a situation or a problem. As to the competence-component knowledge, for instance, at level 1 an individual can: “recall basic general knowledge” while at level 8, an individual has a much more advanced use of knowledge : “Use specialised knowledge to critically analyse, evaluate and synthesise new and complex ideas that are at the most advanced frontier of a field” “Extend or redefine existing knowledge and/or professional practice within a field or at the interface between fields”. The eight levels with descriptors (Table 1 in annex) that focus on learning outcomes are intended to be the core of a possible EQF: the levels are the reference points that will be the tools of articulation between different national and sectoral systems. To support users who do not need the full table of learning outcomes a summary indicator of what each level in the EQF means has been developed (see table 2 in annex). The reference levels of learning outcomes are intended to function as a tool which should be used to by Member States, national authorities, sectoral bodies and training providers to review existing qualifications and programmes and to ensure that they can be understood as learning outcome-based qualifications, thus enabling them to be referenced to a European Qualifications Framework. The 8 levels should not be interpreted as defining the precise set of outcomes for each specific level, because a particular qualification issued at national or sectoral level may very well span more than one EQF level. The reference levels are therefore offering the opportunity of a ‘best fit’ match of national and sectoral qualifications to a level. The Link between the EQF Levels and Formal Educational Systems
The development of supporting and explanatory information relating inputs and systems to the EQF will be the responsibility of each Member State. To support this process the EQF includes a table of level-related examples and explanatory information about aspects of qualifications systems that are not directly related to learning outcomes such as programme delivery and progression in employment and learning that is normally associated with a level of qualification.
88
As to level 1 the supporting information specifies that “Learning is normally developed during compulsory education […]”, while learning at level 8 “mostly takes place in specialist higher education institutions. […] Learning at this level (8) is mostly independent of formal learning programmes and takes place through self-initiated actions guided by other high level experts.” The diversity of practices across Europe and across sectors makes it impossible to be definitive about such aspects of qualification systems. Therefore, the supported information is intended to be taken as generalised and indicative when used in any specific setting. 5.5. The Relevance of ISCED and EQF as Frameworks of Reference The two frameworks ISCED and EQF constitute two different approaches to setting standards for the categorisation of competences and qualifications. The ISCED framework, which can be characterised as the older and “traditional” approach, focuses on formal educational activities designed to meet learning needs. Consequently, the framework excludes various forms of learning that are not organized and the basic unit and analytical focus of the ISCED framework is the single educational programme, especially its scope (e.g. field) and level. The standard of ISCED defines 6 levels of education and 25 different fields. In contrast, the EQF represents a more modern approach linked to the context of lifelong learning. The EQF define learning as taking place in formal as well as informal settings. The analytical unit and focus of the framework is the learning outcome which defines the competences of an individual at 8 different levels of reference. EQF More Relevant as a Framework of Reference for Skills Assessments than ISCED
The EQF is more relevant as a framework for skills assessment than the ISCED framework.. •
First, the analytical focus of the ISCED framework is the qualities (level and field) of the educational programme, which is to be classified into a cell in the level-field matrix, whereas the analytical focus of the EQF is the individual. Consequently, using the ISCED as a framework for expressing or measuring the skill level of an individual in an assessment is questionable. It can be argued that there is no close and universal relationship between the programmes a participant is enrolled in and actual educational achievement. The educational programmes an individual has participated in or even successfully completed are, at best, a first approximation to the skills and competences he or she has actually obtained. It is reasonable to assume that educational activities will result in an increase of skills and competences for an individual so that the pathway of an individual through the education system can be understood as an ordered increase in the educational attainment. However, the underlying educational programmes can often be ordered only to a limited extent: individuals can arrange their educational pathways in many ways. To respond to this, education systems provide multiple branching
89
paths, alternative programme sequences, and ‘second chance’ provisions. There is also an increase in ‘horizontal’ movements through education systems in which a participant can broaden his or her education with only a partial increase in the ‘level’ of education. •
Second, the analytical unit of the EQF is the learning outcome which defines what the individual is expected to know, understand and/or be able to do after the learning at a given level. With a view to the required methodological steps in the DIALANG standard setting presented above, the EQF framework is more feasible for developing descriptors for tests items as well as “can do” statements for selfassessments.
•
Third, the EQF regards competences of the individual as the total sum of learning the individual has acquired including both informal as well as formal learning activities. Furthermore, the EQF clearly distinguishes between the term “competences” and the term “qualifications”. A qualification is achieved when a competent body decides that an individual's learning has reached a specified standard of knowledge, skills and wider competences. However, learning and assessment for a qualification can take place through a programme of study and as well as work place experience. In contrast, the ISCED framework excludes informal learning and unorganised learning activities.
The EQF Represents Potential Benefits as an International Assessment Standard
What are the potential benefits that could be reaped, if the EQF had been fully implemented? And what are the benefits specifically in relation to adult skills assessments? There are several potential benefits. Overall, if we envisage the full implementation of the European Qualifications Framework at both the side of learning providers and at the side of individual learners, in adult skills assessment new types of information will result, with significant policy relevance. General advantages Consistent use of the EQF as a framework of reference in a particular competence domain would allow the demand side of competences (employers, occupations) and the “supply side” (learning providers) to communicate directly with each others through a common medium. For example, in relation to ICT user skills the framework could be used both within and across EU Member States as a standard to which all educational programmes and employers could relate: Using the EQF levels and descriptors, each educational programme is able to specify the minimum level of proficiency, within the field of ICT user skills, which the students should achieve during the programme. Correspondingly, employers and industry organisations are enabled to specify minimum levels of proficiency, as regards ICT user skills, for meeting the requirements of a given occupational profile. The framework would thus represent a useful structure for the labour market which could help employers understand more clearly what an educational certificate represents. For example, a framework that helped clarify the “positioning” of the many ICTrelated qualifications, in terms of their type, scope, level would help employers to be
90
able to assess more meaningfully the attractiveness of job applications. This would be of particular relevance as regards the labour market of ICT practitioners in which there is high international mobility. Education and training providers, on the other hand, would have a clear, agreed set of targets for knowledge, skills and competences at which course design and provision could be “aimed”, and in relation to the EQF, there would be a clear, agreed set of benchmarks against which all ICT user qualifications could be measured. Advantages specifically in relation to adult skills assessment There are specific potential benefits in relation to adult skills assessment and the use of information gathered from international adult skills assessments: •
If adult skills assessment test scores can be presented and interpreted in terms of EQF skills levels, the informational value of the results are increased. The results are then, in fact, related to an external standard where the level of the assessed skill can be described in absolute terms. External standards for the interpretation of test scores can be developed independently of the EQF, of course, witness for instance the levels defined in the OECD PISA studies. However, making use of the levels of a well-known existing framework would increase the immediate intelligibility of results.
•
In the case of a fully implemented EQF in a specific skills domain, competence profiles of populations and segments of populations, as described by data from adult skills assessments, can be translated directly into the need for specific educational programmes and/or other forms of training measures. For instance, if an adult skills assessment reveals that 30% of the population possesses ICT user skills at level 2, and there is a political objective of moving these 30% to level 3, the description of the educational system in terms of learning outcomes would in principle make it easy to identify the sort of educational programmes which should be expanded and mobilised towards the target group, including the level of skills at which educational activities should aim to start and the level of skills at which education should aim to end.
•
Assuming a fully implemented EQF in a given domain of competences, the formal educational background of a given person could be translated into an EQF level of competences. This predicted competence can be compared to the competence level revealed by testing in an adult skills assessment. If a higher level is revealed in tests than would be predicted by the formal education programme which the test taker has gone trough, the difference suggests that informal learning has contributed to competences. The significance of informal learning for learning outcomes could in this way be described and - where relevant – compared across countries.
EQF is a Meta Framework and its Use for Assessment Requires Much Development
It must be emphasised that the EQF is a meta framework and its main strength lies in its generic character and its ability to provide a common point of reference for many frameworks and systems. The proposed EQF is envisaged as providing common refer-
91
ence points – in particular for levels of knowledge, skills and wider competences – against which two main types of framework can be “mapped”: • •
(Member State) National Qualifications Frameworks, and Sector frameworks, systems and qualifications
The EQF is an organising system that enables users to see clearly how qualifications embedded in quite different national and sectoral systems relate to one another. Accordingly, the EQF cannot be understood as an accumulation of existing national and/or sector frameworks; neither is it ‘the sum’ nor ‘representative average’ of national/sector frameworks. The EQF cannot encompass detailed descriptions of particular qualifications. This is the task of qualifications frameworks at national and/or sector level. The EQF does not carry the functions of detailed equating of specific qualifications one to another or any of the regulatory, legal, wage bargaining and quality assurance functions that are often deemed necessary at national or sectoral level. Consequently, the use the EQF within a given field of skills, for example ICT skills, requires development of a specific framework within this field. Compared to the EQF, the CEFR constitutes a specific framework within a specific field, foreign language competences, containing quite operational descriptors defining learning outcomes. With a view to all the methodological challenges of the DIALANG project using the CEFR it is evident that a comprehensive effort will be needed in order to be able to use the EQF as a frame of reference for skills assessment within a specific field. In principle, the eight levels of reference of the EQF may function as a global scale against which an individual’s level of performance in any given field of skills can be measured and interpreted. The scale measures how advanced the individual is at using his or her knowledge and skills – and how advanced the individual’s personal and professional competences are. However, a number of significant methodological challenges must be encountered. These challenges concern “vertical” as well as “horizontal” aspects of the EQF. “Vertically”, the application of the 8 reference levels to a standard for skills assessment will require development and testing of descriptors that consistently defines what separates one level from another. The development of test items within any given field will require a careful test procedure similar to the process in the DIALANG project, where items were sorted into levels by expert judges who were very familiar with EQF. Such procedures are needed to ensure interpersonal and intra-personal consistency as to what performance is expected at level 5 versus level 6, and so on. “Horizontally”, the definition and categorization of different skill types needs further specification, especially as regards the categorisation of personal and professional competences. How is the “professional and vocational competence” distinguished from “skills” at each level?
92
Some Categories of Competences Covered by EQF are Difficult to Test
Moreover, it is unrealistic to expect that operational descriptors and – in particular – relevant tests can be developed for all the categories of competences that are covered in principle by the European Qualification Framework. As mentioned in several connections earlier in this report, standardized and reliable tests are, at present, only available for a relatively narrow range of skills. For a number of social and context dependent skills, it will be difficult to develop valid and reliable tests in the foreseeable future. The relevance of the EQF’s categories and levels with respect to “personal and professional competence” for large-scale adult skills test purposes is therefore probably relatively limited. At present, only tests of problem solving ability comes close to covering the category “professional and vocational competence”, and existing problem solving tests focus on “generic” problem solving skills rather than on problem solving abilities in relation to vocational contexts and tasks, as does the EQF. Self-reports using the job requirements approach will be better suited to provide some information on the “personal and professional competences” of the EQF, with the methodological limits and challenges entailed by this approach. 5.6. Example: Using the EQF in Assessing ICT Competences The magnitude and implications of the challenges involved in making use of the EQF as a framework of reference in relation to adult skills assessment can be illustrated with an example: Assessment of ICT skills. Developing Descriptors of Learning Outcomes for ICT Educational Programmes
As described above, the EQF is still in a developmental phase. The realization of its potential benefits - in relation to skills assessment as well as in other respects - requires that educational institutions and authorities within each Member State develop descriptions of the output of educational programmes focused on ICT in terms of learning outcomes. Learning outcome descriptors are statements on what a learner is expected to know, understand and/or are able to demonstrate after completion of a process of learning. For learning outcome descriptions to be relevant for adult skills assessment, they must be accompanied by appropriate assessment criteria, which can be used to judge that the expected learning outcomes have been achieved. Clarify the Scope and Purpose of the Specific Framework and the Skills Measured
In itself, the term “ICT-skills” is much too broad to define the type of skills that are to be assessed. In order to be useful the context and purpose of the specific framework must be clarified. As regards ICT-skills, an overall distinction could be drawn between “ICT user skills” and “ICT practitioner skills”.
93
The term “ICT practitioner skills” defines the capabilities required for researching, developing and designing, managing, the producing, consulting, marketing and selling, the integrating, installing and administrating, the maintaining, supporting and service of ICT systems (European E-Skills Forum 2004). While ICT practioners skills concern requirements for ICT professions, the term “ICT-user skills” defines more general ICT skills that are required in non-ICT professions or everyday situations of citizens. “ICT user skills” define the capabilities required for effective application of ICT systems and devices by the individual, in different situations and for different purposes. ICT users apply systems as tools in support of their own work, which is, in most cases, not work concerning the development of ICT. In contrast, “ICT practitioner skills” define the skills involved in developing, testing and implementing ICT. Hence, ICT practitioner skills and ICT user skills constitute two very different types of skills. Arguably, it would not be meaningful to locate them within the same, single framework where the lower levels defined the ICT user skills citizens in general whereas the upper levels defined the more advanced and “expert level” ICT skills of ICT practitioners. In other words, the concept of ICT skills is – at least – two dimensional. For the EQF to realise its potential in connection with large-scale skills assessment adult skills assessments of ICT skills, this has operational consequences: • •
It means that the learning outcomes of educational programmes should ideally specify outcomes in terms of both ICT user skills and – where relevant – also ICT practitioners skills. Similarly, in developing an assessment framework, a decision must be made as to whether the focus is on ICT user skills, ICT practitioner skills or both, and test items should be developed that pertain to these different dimensions and the different descriptors at the various levels.
Only if these two requirements are fulfilled will it be possible to relate assessment results directly to intended learning outcomes and hence to identify the required additional learning effort in the education system, and – if there is an interest in this subject – also the extent of informal learning in the field, as suggested by the data. Develop a Specific Assessment Framework, Defining Generic Activities and Sub-Skills
As mentioned, the EQF is a meta framework that does not encompass specific skills and competences but should rather be used as a generic framework for guiding the development of a specific framework within a field. According to the principles of the EQF, the specific framework must have a structure involving vertical (level) and horizontal (functional activity) descriptors. In parallel to the Common European Framework of Reference for Languages (CEFR), which defined a generic descriptive scheme of different skills and sub-skills such as reading, writing, and speaking, generic ICT sub skills have to be defined. This is a difficult challenge.
94
As regards ICT practitioner skills a recent report (CEN 2006) emphasizes that horizontal descriptors of ICT practitioner occupational competence profiles are essential for the establishment of a European e-Competence Reference Framework, as only they will provide the specific context required to describe and assess competences. The “development lifecycle” of an ICT application - software system - could be an example of an attempt to define such generic activities/sub–skills: 1. Clarification of Business Need and relevant ICT capability 2. Specification of System Requirements 3. Production of System Specification 4. Design of System (Architecture) 5. Development of System in accordance with Design (+ possible refinements) 6. Installation/Implementation/Integration/Commissioning of Developed System, 7. Operational Management and Maintenance of System 8. Review of System performance and evolving business need However, no agreement exists on such a single skills/competence framework at the European level. A main explaining factor is the continuing changes in occupational structure in ICT work which pose continuing challenges for those attempting to find adequate occupational classifications that span the economy as a whole -in particular, at the international level through the International Standard Classification of Occupations - ISCO7. Defining Competence Levels Matching Educational Levels
Using the example of ICT user skills, table 5.1 illustrates how, in principle, it could be possible to develop learning outcome descriptors in relation to different types and levels of educational programmes that would be directly relevant for adult skills assessments of ICT user skills. The example is fictitious and based on the supporting information of the EQF. As outlined above, EQF descriptors as to ICT user skills or ICT practitioner skills will have to be developed for each of the 8 levels. Horizontally, ICT user skills will be further specified into sub skills. To make the skills assessment levels correspond with the learning outcomes of the educational system, the educational institutions in each member state must define the minimum levels of proficiency as regards ICT user skills which its pupils or students are expected to achieve. Based on the supporting information and guidelines of the EQF, it is the responsibility of each Member State to develop of supporting and explanatory information relating inputs and systems. Possible Disadvantages of Immature Generic Frameworks
Although agreement on a common European competence framework of ICT practitioners skills or ICT user skills could bring certain benefits, the CEN report (CEN 2006) highlights that there are also accompanying risks of attempting to establish such a framework before the desired stability and consensus is achieved.
95
Table 5.1. The Potential Relation between Educational levels and settings and learning outcome descriptors (ICT user skills) EQF level Educational levels / settings Learning outcome descriptor / ICT user skills (overall proficiency) 1 Compulsory education Can use ICT to carry out simple tasks. Schools, adult education centres, colleges, training centres.
2
Learning contexts are simple and stable Compulsory education Schools, adult education centres, colleges, training centres.
3
Learning contexts are stable and the focus is the broadening of basic skills Upper secondary education or adult education (including popular adult education labour market training
4
5
Completion of upper secondary education and some formal learning in post Compulsory education, adult education including labour market training and popular adult education. Post secondary learning programme, such as apprenticeship together with post programme experience in a related field.
Can use ICT to carry out tasks where action is governed by rules defining routines and strategies.
Can use a range of specific ICT skills/technologies to carry out tasks and show personal interpretation through selection and adjustment of methods, tools and materials. Can develop strategic approaches to tasks using ICT. Can apply specialist knowledge and expert sources of information
Can use ICT to develop strategic and creative responses in researching solutions to well defined concrete as well as abstract problems.
6
Higher education institutions
Demonstrate mastery of methods and tools in a complex and specialised field and demonstrate Innovation in terms of methods using ICT.
7
Specialist higher education institutions Learning is usually associated with independent working with other people at the same level or higher.
Can use ICT to create a research based diagnosis to problems. Can use ICT to integrate knowledge from new or inter disciplinary fields and make judgements with incomplete or limited information.
8
Specialist higher education institutions.
Can use ICT to research, conceive, design, implement and adapt projects that lead to new knowledge and new procedural solutions
Learning is mostly independent of formal learning programmes and takes place through self-initiated actions
These include both a likely loss of support from those committed to existing national frameworks and the potential “fossilising ” (and growing loss of relevance) of activity based on such a framework that will be prone to decay. It is recognized that the widening and renewing of “buy-in” from employers and other stakeholders is essential.
96
Literature Alderson, J. C., 2005. Diagnosing Foreign Language Proficiency. The Interface between Learning and Assessment. London: Continuum. Allen, J. and R. v. d. Velden, 2005. “The Role of Self-Assessment in Measuring Skills”. Paper for the Transition in Youth Workshop 8-10 September. Ashton, D., B. Davies, A. Felstead and F. Green, 1999. Work Skills In Britain. Oxford, SKOPE, Oxford and Warwick Universities. Bainbridge, S, Julie Murray, Tim Harrison and Terry Ward, 2004. Learning for employment. Second report on vocational education and training in Europe, CEDEFOP Reference Series; 51. Baker, Frank B., 2001 The Basics of Item Response Theory. ERIC Clearinghouse on Assessment and Evaluation. Birnbaum, A., 1968. “Some latent trait models and their use in inferring an examinee’s ability”. In F. M. Lord & M. R. Novick (eds.), Statistical theories of mental test scores. Reading: Addison-Wesley. Belbin, Meredith, 2003. Management Teams. Why They Succeed or Fail, London: Butterworth Heinemann. Benton, L. and Noyelle, T. (eds.), 1992. Adult illiteracy and economic performance. Paris: Organisation for Economic Co-operation and Development, Centre for Educational Research and Innovation. Binet, Alfred, 1903. Etude expérimentale de l'intelligence. Paris. Bjørnåvold, J., 2000. Making learning visible. CEDEFOP. Bull, J. and McKenna, C., 2003. Blueprint for Computer-Assisted Assessment. London: Routledge. Bynner J. and S. Parsons, 1997. It Doesn’t Get Any Better. The Impact of Poor Basic Skills on the Lives of 37 Year Olds. London: The Basic Skills Agency. Bynner, J. and S. Parsons, 2005. New Light on Literacy and Numeracy. NRDC, University of London, September. Cantril, Hadley, 1965. The Pattern of Human Concerns. New Brunswick, NJ: Rutgers University Press.
97
CEN, 2006. “European ICT Skills Meta-Framework - State-of-the-Art review, clarification of the realities, and recommendations for next steps”, European Committee for Standardisation. Cohen, Y. Ben-Simon, A. and Hovav, M., 2003. ”The effect of specific language features on the complexity of systems for automated essay scoring”. Paper presented at the Annual Conference of the International Association for Educational Assessment. Coles M. and T. Oates, 2005. European reference levels for education and training : promoting credit transfer and mutual trust. Thessaloniki: CEDEFOP Commission, 2002. “European Report on Quality Indicators of Lifelong Learning”. Report based on the work of the Working Group on Quality Indicators, Brussels, June. Commission, 2003a. “Education and Training 2010. The Success of the Lisbon Strategy Hinges on Urgent Reforms”, SEC(2003) 1250. Commission, 2005a. “Progress towards the Lisbon Objectives in Education and Training. Commission Staff Working Paper”, Brussels, March. SEC (2005) 419. Commission, 2005b. “Towards a European Qualifications Framework for Lifelong Learning”, Staff Working Document, SEC(2005) 957, Brussels. Commission, 2006. “Conclusions from the meeting of national experts of the Member States of the European Union and Acceding Countries on EU data needs on adult skills”. Brussels. Clausen-May, Tandy, 2005. “Developing Digital – Opportunities and dangers in the development of electronic test questions”, National Foundation for Educational Research, September. Cronbach, L. J. & Gleser, G. C., 1965. Psychological test and personnel decisions (2nd Ed). Urbana: University of Illinois Press. DfES, Department for Education and Skills, 2003. The Skills for Life survey. A national needs and impact survey of literacy, numeracy and ICT skills. DfES Research Report 490. DTI et al. (Danish Technological Institute, RAND Europe and SKOPE Oxford), 2004. Defining a Strategy for the Direct Assessment of Adult Skills. Aarhus. DTI, Danish Technological Institute 2005. Explaining Student Performance. Evidence from the international PISA, TIMSS and PIRLS surveys. Aarhus. Dykema, J. and N.C. Schaeffer, 2000. Development of a Personal Event Schema. CDE Working Paper 99-27, Madison: Center for Demography and Ecology, University of Wisconsin-Madison.
98
Educational Testing Service, 1994. Computer-based tests: Can they be fair to everyone? Princeton, NJ: Lawrence Erlbaum Associates. Eggen, T. J. H., 2004. Contributions to the Theory and Practice of Computerized Adaptive Testing. Doctoral Thesis, University of Twente. European Council, 2000. Presidency Conclusion, Lisbon. European Council, 2002. Presidency Conclusions, Barcelona. European Committee for Standardization, 2006. “European ICT Skills Meta Framework - State-of-the-Art review, clarification of the realities, and recommendations for next steps”, Brussels European E-Skills Forum, 2004. “e-Skills For Europe: Towards 2010 and beyond”, Synthesis Report, Brussels. Eurostat, 2005. “Task Force Report on Adult Education Survey”, Luxembourg. Felstead, A., D. Gallie and F. Green, 2002. Work Skills In Britain 1986-2001. Nottingham, DfES Publications. Fischer, D. L., 1981. ”Functional literacy tests: A model of question-answering and an analysis of errors”, Reading Research Quarterly, 16, pp. 418-448. Føllesdal, D. and L. Walløe, 2000. Argumentionsasjonsteori, språk og vitenskabsfilosofi. Oslo: Universitetsforlaget. Goodstein, L. D. and Lanyon, R. I., 1975. Adjustment, Behavior and Personality. Addison Wesley. Guthrie, J. T., 1988. Locating information in documents: A computer simulation and cognitive model”. Reading Research Quarterly, 23, pp. 178-199. Hofstetter, R., Sticht, T., and Hofstetter, C., 1999. Knowledge, literacy, and power. Communication Research, 26, 58-80. Huta, Ari, et al., 2002. “DIALANG – a diagnostic language assessment system for adult learners” p. 130-144 in “Common European Framework of Reference for Languages: Learning, Teaching, Assessment –Case studies”, Council of Europe Publishing 2002. Kim et al., 2004. “Trends and Differentials in School Transition in Japan and Korea”, paper prepared for presentation at the conference “Inequality and Stratification: Broadening the Comparative Scope” of Research Committee 28 (Social Stratification and Mobility) of the International Sociological Association, August 7-9, Rio de Janeiro, Barzil, 2004.
99
Kirsch, I. S. and Guthrie, J. T., 1984. „Adult reading practices for work and leisure”, Adult Education Quarterly, 34 (4), pp. 213-232. Kirsch, I., Jungeblut, A., Jenkins, L., and Kolstad, A., 1993. Adult literacy in America: A first look at the results of the National Adult Literacy Survey. Washington, DC: U. S. Government Printing Office. Kirsch, I. S., Jungeblut, A. and Mosental, P. B., 1998. ”The measurement of adult literacy”, in T. S. Murray, I. S. Kirsch and L. Jenkins (eds.), Adult literacy in OECD countries: Technical Report on the first international adult literacy survey. Washington D.C.: US Department of Education, National Center for Education Statistics. Kirsch, Irwin, 2005. “Prose Literacy, Document Literacy and Quantitative Literacy: Understanding What Was Measured in IALS and ALL” Kolstad, A., 1996. The response probability convention embedded in reporting prose literacy levels from the 1992 National Adult Literacy Survey. Washington, DC: National Center for Education Statistics. Kolstad, A, Cohen, J., Baldi, S., Chan, T., DeFur, E., & Angeles, 1998. The response probability convention used in reporting data from IRT assessment scales: Should NCES adopt a standard?. Washington, DC: National Center for Education Statistics. Kourilsky M.L. and Esfandiari M., 1997. “Entrepreneurship Education and Lower Socioeconomic Black Youth: An Empirical Investigation”. The Urban Review, Volume 29, Number 3, September, pp. 205-215. Krosnick, J. A. and D. F. Alwin, 1987. “An Evaluation of a Cognitive Theory of Response Order Effects in Survey Measurement”, Public Opinion Quarterly, vol. 51, no. 2. Krosnick, J. A., 1991. “Response Strategies for Coping with the Cognitive Demands of Attitude Measures in Surveys”, Applied Cognitive Psychology, vol. 5, no. 2. Kubiszyn, T. and G. Borich 2005. Educational Testing and Measurement: Classroom Application and Practice. John Wiley and Sons. Larkin, K. C. and Weiss, D. J., 1975. An empirical comparison of two-stage and pyramidal adaptive ability testing (Research Report, 751). Minneapolis: Psychometrics Methods Program, Department of Psychology, University of Minnesota. Larson, J. W., and H. S. Madsen, 1985. "Computerized Adaptive Language Testing: Moving Beyond Computer Assisted Testing." CALICO Journal 2, 32-35. Larson, J.W., 1989. “S-CAPE: A Spanish Computerized Adaptive Placement Exam." Modern Technology in Foreign Language Education: Application and Projects, edited by F. Smith. Lincolnwood, IL: National Textbook.
100
Lennon, M., Kirsch, I. Von Davier, M., Wagner, M. and Yamamoto, K., 2003. Feasibility Study for the PISA ICT Literacy Assessment, ACER, NIER, and ETS. Lisbon-to-Maastricht Consortium, 2004. Achieving the Lisbon Goal: The Contribution of VET, London. Lord, F. M., 1970. Some test theory for tailored testing. In W. H. Holtzman (ed.), Computer assisted instruction, testing, and guidance, pp. 139-183. New York: Harper and Row. Lord, F. M., 1971a. Robbins-Monro procedures for tailored testing. Educational and Psychological Measurement, 31, 231. Lord, F. M., 1971b. The self-scoring flexilevel test. Journal of Educational Measurement, 8, 147151. Madsen, H. S., 1991. “Computer Adaptive Testing of Listening and Reading Comprehension: The Brigham Young University Approach.”, Computer Assisted Language Learning and Testing: Research Issues and Practice, edited by P. Dunkel. New York, NY: Newbury House. Marsh, C., 1982. The Survey Method: The Contribution of Surveys to Sociological Explanation. London. Means, B., Penuel, B. and Quellmalz, E., 2000. Developing Assessments for Tomorrow's Classrooms, in The Secretary's Conference on Educational Technology - Measuring Impacts and Shaping the Future. SRI International. Messick, S.M., 1995. “Standards of validity and the validity of standards in performance assessment” in Educational Measurement: Issues and Practice 14/4. Meunier, Lydie E., 2002. “Computer Adaptive Language Tests Offer A Great Potential for Functional Testing. Yet Why Don’t They?” Calico Journal 11, 4, 23-39. Mills, C. N and M. Steffen, 2000. “The GRE Computer Adaptive Test: Operational Issues”, in Van den Linden and Glass 2000, pp. 75-100. Mosental, P. B. and Kirsch, I. S., 1991. “Towards an explanatory model of document process”, Discourse Processes, 14, pp. 147-180. Murray, T.S., 2003. Reflections on International Competence Assessments, In: D.S. Rychen & L.H. Salganic (eds.) Key Competencies for a Successful Life and a Wellfunctioning Society, Göttingen: Hogrefe & Huber, pp. 135-159. Murray, T. S.; Y. Clearmont and M. Brinkley, (eds.), 2005. International Adult Literacy Survey. Measuring Adult Literacy and Life Skills: New Frameworks for Assessment, Statistics Canada.
101
Nunes, C. A. A., Nunes M. M. R. and Davis, C., 2003. "Assessing the Inaccessible: Metacognition and Attitudes", Assessment and Education, Volume 10, No. 3, pp. 375388 OECD, 1995. Literacy, economy, and society. Paris: Organization for Economic Cooperation and Development, Centre for Educational Research and Innovation. OECD, 2005. “International Assessment of Adult Skills: Proposed Strategy”, Paris: September, COM/DELSA/EDU(2005)4. OECD and Statistics Canada, 2005a. Learning a Living. First results of the adult literacy and life skills survey. Paris. OECD, 2005b. Learning for Tomorrow’s World. First Results from PISA 2003. Paris. OECD and Statistics Canada, 2000. Literacy in the Information Age: Final Report on the International Adult Literacy Survey database. Paris. Olsen, H., 1998. Tallenes talende tavshed. Måleproblemer i survey undersøgelser. København: Akademisk Forlag. Owen, R. J., 1969. A Bayesian approach to tailored testing (Research Report 6992). Princeton, NJ: Educational Testing Service. Parsons, S. and J. Bynner with V. Foudouli, 2005. Measuring basic skills for longitudinal study. London: National Research and Development Centre for Adult Literacy and Numeracy. Payne, J., 1999. “All things to all people: Changing perceptions of ‘skill’ among Britain’s policy makers since the 1950s and their implications”. SKOPE Research Paper No. 1. ESRC funded Centre on Skills, Knowledge and Organisational Performance, Oxford and Warwick Universities. Partchev, Ivailo, 2004. A visual guide to item response theory. Friedrich-SchillerUniversität Jena. Peterson, N., Mumford, M., Borman, W., Jeanneret, P., & Fleishman, E., 1995. Development of prototype Occupational Information Network (O*NET) content model (Vols. I & II). Salt Lake City, UT: Utah Department of Workforce Services. QCA – Qualifications and Curriculum Authority, 2005. Report on the 2005 KS3 ICT test pilot, London. Rasch, G., 1960. Probalistic Models for Some Intelligence Attainment Tests. Chicago: University of Chicago Press. Ravitch, Diane, 2002 “Testing and Accountability, Historically Considered” in W. M. Evers, and H. J. Walberg, School Accountability, Hoover Press.
102
Reder, S., 1998. Dimensionality and construct validity of the NALS assessment. In M. C. Smith (ed.) Literacy for the twenty-first century. Westport, CT: Praeger. Richardson, M., Baird, J., Ridgway, J. Shorrocks-Taylor, D., and Swan, M., 2002. “Challenging minds? Students’ perceptions of computer-based World Class Tests of problem solving”. Computers and Human Behaviour, vol. 18 (6), pp. 633-649. Richter, L, and P.B. Johnson, 2001. “Current Methods of Assessing Substance Use: A Review of Strengths, Problems, and Developments”. Journal of Drug Issues, 31, 4, 809-832. Ridgeway, J. McCusker, S. & Pead, D., 2004. “Literature Review of E-assessment”. NESTA Futurelab Series, Report no. 10. Rock, D., 1998. Validity generalization of the assessment across countries. In: Murray, T., Kirsch, I. & Jenkins, L. Adult literacy in OECD countries: Technical report on the first International Adult Literacy Survey. Washington, DC: National Center for Education Statistics. Salganik, Laura Hersh and Maria Stephens, 2003. “Competence Priorities in Policy and Practice”, in Rychen, D. S. and L. H. Salganik (eds.), Key Competencies for a Successful Life and a Well-Functioning Society, pp. 13-40. Sands, W. A., Waters, B. K., and McBride, J. R. (eds.), 1997. Computerized adaptive testing: From inquiry to operation. Washington DC: American Psychological Association. Stasz, C., 1997. “Do Employers Need the Skills they Want?: Evidence from technical work”. Journal of Education and Work, 10 (3), 205-223. Stasz, C., 2001. “Assessing skills for work: Two perspectives”. Oxford Economic Papers, 3, 385-405. Stevens, S. S., 1946 “On the theory of scales of measurement”, Science 103. Sticht, T. G. and Armstrong, W. B., 1994. Adult Literacy in the United States: A compendium of quantitative data and interpretive comments. Washington, DC: National Institute for Literacy. Sticht, T. G., Hofstetter, C. R. and Hofstetter, C. H., 1996. Assessing adult literacy by telephone. Journal of Literacy Research, 28, 525-559. Sticht, Thomas G., 2005. “The New International Adult Literacy Survey (IALS): Does it meet the Challenges of Validity to the Old IALS?” Modified version of: Sticht, T. G. 2001. “The International Adult Literacy Survey: How well does it represent the literacy abilities of adults?”, The Canadian Journal for the Study of Adult Education, 15, 19-36.
103
Sukkarieh, J. Z., Pulman, S. G., and Raikes, N., 2003. “Auto-marking: Using computational linguistics to score short, free text responses”. Paper presented to the 29th Annual Conference on the International Association for Educational Assessment. Trier, U. P., 2003. “Twelve Countries Contributing to DeSeCo: A Summary Report. In D.S: Rychen, L.H. Salganik and M.E. McLaughlin (eds.), Selected Contributions to the 2nd DeSeCo Symposium. Neuchatel, Switzerland: Swiss Federal Statistical Office. Undervisningsministeriet, 2002. Projektintroduktion, Det Nationale Kompetenceregnskab. Juni. København. Undervisningsministeriet, 2005. Det Nationale Kompetenceregnskab, hovedrapport. December. København. United Nations, 1997. “International Standard Classification of Education ISCED 1997” Van der Linden and Wim J., 2000 Computerized Adaptive Testing: Theory and Practice. Hingham, MA, USA: Kluwer Academic Publishers. VCAA (Victorian Curriculum and Assessment Authority), 2004. P-10 Supplement, No. 11, Issue 3, March. Venezky, R. L., Wagner, D. A., & Ciliberti, B. S., (eds.), 1990. Toward defining literacy. Newark, NJ: International Reading Association. Venezky, R. L., 1992. Matching literacy testing with social policy: What are the alternatives? Philadelphia, PA: National Center on Adult Literacy. Victorin K., M. Haag Grönlund and S. Skerfving, 1998. Methods for Health Risk Assessment of Chemicals - Are they Relevant for Alcohol? Alcoholism Clinical and Experimental Research, 22 (7) suppl, 270S-276S. Way, W. D., Steffen, M., and Anderson, G. S., 1998. “Developing, maintaining, and renewing the item inventory to support computer-based testing”. Paper presented at the colloquium Computer-Based Testing: Building the Foundation for Future Assessments, Philadelphia, PA. Warburton, B. and G Conole, 2003. “Key findings from recent literature on computeraided assessment”, ALT-C 2003, University of Southampton. Ward, M., L. Gruppen, and G. Regehr, 2002. “Measuring Self-assessment: Current State of the Art”. Advances in Health Sciences Education, 7, 63-80. Weiss, D. J., 1973. The stratified adaptive computerized ability test (Research Report 733). Minneapolis: University of Minnesota, Department of Psychology.
104
Working Group, 2003. “Basic Skills, Entrepreneurship and Foreign Languages�, Progress Report, November. Brussels, European Commission.
105
Appendix 1: Example Questionnaire for Self-Reported Job-Related Skills (O*NET) Instructions for Making Skills Ratings These questions are about work-related skills. A skill is the ability to perform a task well. It is usually developed over time through training or experience. A skill can be used to do work in many jobs or it can be used in learning. You will be asked about a series of different skills and how they relate to your current job—that is, the job you hold now. Each skill in this questionnaire is named and defined. For example:
Communicating effectively in writing as appropriate for the needs of the audience.
Writing
You are then asked two questions about each skill:
A
How important is the skill to the performance of your current job?
For example: How important is WRITING to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
Mark your answer by putting an X through the number that represents your answer. Do not mark on the line between the numbers.
*If you rate the skill as Not Important to the performance of your job, mark the one [ 1] then skip over question B and proceed to the next skill.
B
What level of the skill is needed to perform your current job?
To help you understand what we mean by level, we provide you with examples of job-related activities at different levels. For example: What level of WRITING skill is needed to perform your current job? Take a telephone message
1
2
Write a memo to staff outlining new directives
3
4
Write a novel for publication
5
6
7 Highest Level
Mark your answer by putting an X through the number that represents your answer. Do not mark on the line between the numbers.
106
Understanding written sentences and paragraphs in work-related documents.
1. Reading Comprehension
A. How important is READING COMPREHENSION to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of READING COMPREHENSION is needed to perform your current job?
1
Read a scientific journal article describing surgical procedures
Read a memo from management describing new personnel policies
Read step-by-step instructions for completing a form
2
3
4
5
6
7 Highest Level
Giving full attention to what other people are saying, taking time to understand the points being made, asking questions as appropriate, and not interrupting at inappropriate times.
2. Active Listening
A. How important is ACTIVE LISTENING to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of ACTIVE LISTENING is needed to perform your current job? Answer inquiries regarding credit references
Take a customer’s order
1
2
3
4
Preside as judge in a complex legal disagreement
5
6
7 Highest Level
107
3.
Communicating effectively in writing as appropriate for the needs of the audience.
Writing
A. How important is WRITING to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of WRITING is needed to perform your current job? Take a telephone message
1
2
Write a memo to staff outlining new directives
3
4
Write a novel for publication
5
6
7 Highest Level
4.
Talking to others to convey information effectively.
Speaking
A. How important is SPEAKING to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of SPEAKING is needed to perform your current job? Greet tourists and explain tourist attractions
1
2
Interview applicants to obtain personal and work history
3
4
Argue a legal case before the Supreme Court
5
6
7 Highest Level
108
Using mathematics to solve problems.
5. Mathematics
A. How important is MATHEMATICS to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of MATHEMATICS is needed to perform your current job? Calculate the square footage of a new home under construction
Count the amount of change to be given to a customer
1
2
3
4
Develop a mathematical model to simulate and resolve an engineering problem
5
6
7 Highest Level
Using scientific rules and methods to solve problems.
6. Science
A. How important is SCIENCE to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of SCIENCE is needed to perform your current job?
1
2
Conduct analyses of aerodynamic systems to determine the practicality of an aircraft design
Conduct product tests to ensure safety standards are met, following written instructions
Conduct standard tests to determine soil quality
3
4
5
6
7 Highest Level
109
Using logic and reasoning to identify the strengths and weaknesses of alternative solutions, conclusions, or approaches to problems.
7. Critical Thinking
A. How important is CRITICAL THINKING to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of CRITICAL THINKING is needed to perform your current job? Determine whether a subordinate has a good excuse for being late
1
2
Evaluate customer complaints and determine appropriate responses
3
4
Write a legal brief challenging a federal law
5
6
7 Highest Level
Understanding the implications of new information for both current and future problem-solving and decision-making.
8. Active Learning
A. How important is ACTIVE LEARNING to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of ACTIVE LEARNING is needed to perform your current job? Think about the implications of a newspaper article for job opportunities
1
2
Determine the impact of new menu changes on a restaurant’s purchasing requirements
3
4
Identify the implications of a new scientific theory for product design
5
6
7 Highest Level
110
Selecting and using training/instructional methods and procedures appropriate for the situation when learning or teaching new things.
9. Learning Strategies
A. How important are LEARNING STRATEGIES to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of LEARNING STRATEGIES is needed to perform your current job?
1
2
Apply principles of educational psychology to develop new teaching methods
Identify an alternative approach that might help trainees who are having difficulties
Learn a different method of completing a task from a coworker
3
4
5
6
7 Highest Level
Monitoring/assessing performance of yourself, other individuals, or organizations to make improvements or take corrective action.
10. Monitoring
A. How important is MONITORING to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of MONITORING is needed to perform your current job?
Proofread and correct a letter
1
2
3
Monitor a meeting’s progress and revise the agenda to ensure that important topics are discussed
Review corporate productivity and develop a plan to increase productivity
4
6
5
7 Highest Level
111
Being aware of others’ reactions and understanding why they react as they do.
11. Social Perceptiveness
A. How important is SOCIAL PERCEPTIVENESS to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of SOCIAL PERCEPTIVENESS is needed to perform your current job? Notice that customers are angry because they have been waiting too long
1
2
Be aware of how a coworker’s promotion will affect a work group
3
4
Counsel depressive patients during a crisis period
5
6
7 Highest Level
Adjusting actions in relation to others’ actions.
12. Coordination
A. How important is COORDINATION to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of COORDINATION is needed to perform your current job?
Schedule appointments for a medical clinic
1
2
Work as director of a consulting project calling for interaction with multiple subcontractors
Work with others to put a new roof on a house
3
4
5
6
7 Highest Level
112
Persuading others to change their minds or behavior.
13. Persuasion
A. How important is PERSUASION to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of PERSUASION is needed to perform your current job? Convince a supervisor to purchase a new copy machine
Solicit donations for a charity
1
2
3
4
Change the opinion of the jury in a complex legal case
5
6
7 Highest Level
Bringing others together and trying to reconcile differences.
14. Negotiation
A. How important is NEGOTIATION to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of NEGOTIATION is needed to perform your current job? Present justification to a manager for altering a work schedule
1
2
Contract with a wholesaler to sell items at a given cost
3
4
Work as an ambassador in negotiating a new treaty
5
6
7 Highest Level
113
Teaching others how to do something.
15. Instructing
A. How important is INSTRUCTING to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of INSTRUCTING is needed to perform your current job? Instruct a new employee in the use of a time clock
1
Instruct a coworker in how to operate a software program
2
3
Demonstrate surgical procedure to interns in a teaching hospital
4
5
6
7 Highest Level
Actively looking for ways to help people.
16. Service Orientation
A. How important is SERVICE ORIENTATION to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of SERVICE ORIENTATION is needed to perform your current job? Ask customers if they would like cups of coffee
1
2
Make flight reservations for customers, using airline reservation system
3
4
Direct relief agency operations in a disaster area
5
6
7 Highest Level
114
17. Complex Problem Solving
Identifying complex problems and reviewing related information to develop and evaluate options and implement solutions.
A. How important is COMPLEX PROBLEM SOLVING to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of COMPLEX PROBLEM SOLVING is needed to perform your current job?
Lay out tools to complete a job
1
2
Develop and implement a plan to provide emergency relief for a major metropolitan area
Redesign a floor layout to take advantage of new manufacturing techniques
3
4
5
6
7 Highest Level
Analyzing needs and product requirements to create a design.
18. Operations Analysis
A. How important is OPERATIONS ANALYSIS to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of OPERATIONS ANALYSIS is needed to perform your current job? Suggest changes in software to make a system more user friendly
Select a photocopy machine for an office
1
2
3
4
Identify the control system needed for a new process production plant
5
6
7 Highest Level
115
Generating or adapting equipment and technology to serve user needs.
19. Technology Design
A. How important is TECHNOLOGY DESIGN to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of TECHNOLOGY DESIGN is needed to perform your current job? Adjust exercise equipment for use by a customer
1
Redesign the handle on a hand tool for easier gripping
2
3
4
Create new technology for producing industrial diamonds
5
6
7 Highest Level
Determining the kind of tools and equipment needed to do a job.
20. Equipment Selection
A. How important is EQUIPMENT SELECTION to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of EQUIPMENT SELECTION is needed to perform your current job? Select a screwdriver to use in adjusting a vehicle’s carburetor
1
2
3
Choose a software application to use to complete a work assignment
Identify the equipment needed to produce a new product line
4
6
5
7 Highest Level
116
Installing equipment, machines, wiring, or programs to meet specifications
21. Installation
A. How important is INSTALLATION to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of INSTALLATION is needed to perform your current job? Install a new air filter in an air conditioner
1
Install new switches for a telephone exchange
2
3
Install a “one of a kind� process production molding machine
4
5
6
7 Highest Level
Writing computer programs for various purposes.
22. Programming
A. How important is PROGRAMMING to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of PROGRAMMING is needed to perform your current job? Write a program in BASIC to sort objects in a database
1
2
Write a statistical analysis program to analyze demographic data
3
4
Write expert system programs to analyze ground radar geological data for probable existence of mineral deposits
5
6
7 Highest Level
117
Conducting tests and inspections of products, services, or processes to evaluate quality or performance.
23. Quality Control Analysis
A. How important is QUALITY CONTROL ANALYSIS to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of QUALITY CONTROL ANALYSIS is needed to perform your current job? Inspect a draft memorandum for clerical errors
1
Measure new part requisitions for tolerance to specifications
2
3
4
Develop procedures to test a prototype of a new computer system
5
6
7 Highest Level
Watching gauges, dials, or other indicators to make sure a machine is working properly.
24. Operations Monitoring
A. How important is OPERATIONS MONITORING to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of OPERATIONS MONITORING is needed to perform your current job? Monitor completion times while running a computer program
1
2
Monitor and integrate control feedback in a petrochemical processing facility to maintain production flow
Monitor machine functions on an automated production line
3
4
5
6
7 Highest Level
118
Controlling operations of equipment or systems.
25. Operation and Control
A. How important is OPERATION AND CONTROL to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of OPERATION AND CONTROL is needed to perform your current job? Adjust the settings on a copy machine to make reduced size photocopies
1
2
Adjust the speed of assembly line Control aircraft approach equipment based on the type of and landing at a large product being assembled airport during a busy period
3
4
5
6
7 Highest Level
Performing routine maintenance on equipment and determining when and what kind of maintenance is needed.
26. Equipment Maintenance
A. How important is EQUIPMENT MAINTENANCE to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of EQUIPMENT MAINTENANCE is needed to perform your current job? Add oil to an engine as indicated by a gauge or warning light
1
2
Conduct maintenance checks on an experimental aircraft
Clean moving parts in production machinery
3
4
5
6
7 Highest Level
119
Determining causes of operating errors and deciding what to do about it.
27. Troubleshooting
A. How important is TROUBLESHOOTING to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of TROUBLESHOOTING is needed to perform your current job? Identify the source of a leak by looking under a machine
1
2
Identify the circuit causing an electrical system to fail
3
4
Direct the debugging of control code for a new operating system
5
6
7 Highest Level
Repairing machines or systems using the needed tools.
28. Repairing
A. How important is REPAIRING to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of REPAIRING is needed to perform your current job? Tighten a screw to get a door to close properly
1
2
Replace a faulty hydraulic valve
3
4
Repair structural damage after an earthquake
5
6
7 Highest Level
120
29. Systems Analysis
Determining how a system should work and how changes in conditions, operations, and the environment will affect outcomes.
A. How important is SYSTEMS ANALYSIS to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of SYSTEMS ANALYSIS is needed to perform your current job? Determine how the introduction of a new piece of equipment will affect production rates
Determine how loss of a team member will affect the completion of a job
1
2
3
4
Identify how changes in tax laws are likely to affect preferred sites for manufacturing operations in different industries
5
6
7 Highest Level
Identifying measures or indicators of system performance and the actions needed to improve or correct performance, relative to the goals of the system.
30. Systems Evaluation
A. How important is SYSTEMS EVALUATION to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of SYSTEMS EVALUATION is needed to perform your current job? Determine why a coworker has been overly optimistic about how long it would take to complete a task
1
2
Evaluate the long-term performance problem of a new computer system
Identify the major reasons why a client might be unhappy with a product
3
4
5
6
7 Highest Level
121
Considering the relative costs and benefits of potential actions to choose the most appropriate one.
31. Judgment and Decision Making
A. How important is JUDGMENT AND DECISION MAKING to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of JUDGMENT AND DECISION MAKING is needed to perform your current job? Decide how scheduling a break will affect work flow
1
Evaluate a loan application for degree of risk
2
3
Decide whether a manufacturing company should invest in new robotics technology
4
5
6
7 Highest Level
32. Time Management
Managing one’s own time and the time of others.
A. How important is TIME MANAGEMENT to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of TIME MANAGEMENT is needed to perform your current job? Allocate the time of subordinates to projects for the coming week
Keep a monthly calendar of appointments
1
2
3
4
Allocate the time of scientists to multiple research projects
5
6
7 Highest Level
122
Determining how money will be spent to get the work done, and accounting for these expenditures.
33. Management of Financial Resources
A. How important is MANAGEMENT OF FINANCIAL RESOURCES to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of MANAGEMENT OF FINANCIAL RESOURCES is needed to perform your current job? Take money from petty cash to buy office supplies and record the amount of the expenditure
1
2
Develop and approve yearly budgets for a large corporation and obtain financing as necessary
Prepare and manage a budget for a short-term project
3
4
5
6
7 Highest Level
Obtaining and seeing to the appropriate use of equipment, facilities, and materials needed to do certain work.
34. Management of Material Resources
A. How important is MANAGEMENT OF MATERIAL RESOURCES to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of MANAGEMENT OF MATERIAL RESOURCES is needed to perform your current job? Rent a meeting room for a management meeting
1
2
Determine the computer system needs of a large corporation and monitor use of the equipment
Evaluate an annual uniform service contract for delivery drivers
3
4
5
6
7 Highest Level
123
Motivating, developing, and directing people as they work, identifying the best people for the job.
35. Management of Personnel Resources
A. How important is MANAGEMENT OF PERSONNEL RESOURCES to the performance of your current job? Not Important*
Somewhat Important
Important
Very Important
Extremely Important
1
2
3
4
5
* If you marked Not Important, skip LEVEL below and go on to the next skill.
B. What level of MANAGEMENT OF PERSONNEL RESOURCES is needed to perform your current job? Encourage a coworker who is having difficulty finishing a piece of work
1
2
Direct the activities of a road repair crew with minimal disruption of traffic flow
3
4
Plan, implement, and manage recruitment, training, and incentive programs for a high performance company
5
6
7 Highest Level
124
Appendix 2: The Eight Reference Levels of the European Qualification Framework Level
Knowledge
Skills
1
Recall basic knowledge
Use basic skills to carry out simple tasks.
2
Recall and comprehend basic knowledge of a field, the range of knowledge involved is limited to facts and main ideas.
Use skills and key competences to carry out tasks where action is governed by rules defining routines and strategies.
Apply knowledge of a field that includes processes, techniques, materials, instruments, equipment, terminology and some theoretical ideas.
Select and apply basic methods, tools and materials. Use a range of field-specific skills to carry out tasks and show personal interpretation through selection and adjustment of methods, tools and materials.
3
4
Use a wide range of fieldspecific practical and theoretical knowledge.
Evaluate different approaches to tasks. Develop strategic approaches to tasks that arise in work or study by applying specialist knowledge and using expert sources of information Evaluate outcomes in terms of strategic approach used.
Personal and professional competence (i) Autonomy and responsibility Complete work or study tasks under direct supervision and demonstrate personal effectiveness in simple and stable contexts.
(ii) Learning competence
Take limited responsibility for improvement in performance in work or study in simple and stable contexts and within familiar, homogeneous groups.
Seek guidance on learning.
Solve problems using information provided.
Take responsibility for completion of tasks and demonstrate some independence in role in work or study where contexts are generally stable but where some factors change.
Take responsibility for own learning.
Solve problems using well known information sources taking account of some social issues.
Manage role under guidance in work or study contexts that are usually predictable and where there are many factors involved that cause change and where some factors are interrelated.
Demonstrate self-direction in learning.
Solve problems by integrating information from expert sources taking account of relevant social and ethical issues.
Accept guidance on learning.
(iii) Professional and vocational competence Demonstrate awareness of procedures for solving problems.
Make suggestions for improvement to outcomes. Supervise routine work of others and take some responsibility for training of others.
125
5
6
Use broad theoretical and practical knowledge that is often specialised within a field and show awareness of limits to knowledge base.
Use detailed theoretical and practical knowledge of a field. Some knowledge is at the forefront of the field and will involve a critical understanding of theories and principles.
Develop strategic and creative responses in researching solutions to well defined concrete and abstract problems Demonstrate transfer of theoretical and practical knowledge in creating solutions to problems
Demonstrate mastery of methods and tools in a complex and specialised field and demonstrate Innovation in terms of methods used Devise and sustain arguments to solve problems
Manage projects independently that require problem solving where there are many factors some of which interact and lead to unpredictable change.
Evaluate own learning and identify learning needs necessary to undertake further learning.
Show creativity in developing projects. Manage people and review performance of self and others. Train others and develop team performance. Demonstrate administrative design, resource, and team management responsibilities in work and study contexts that are unpredictable and require that complex problems be solved where there are many interacting factors.
Use highly specialised theoretical and practical knowledge some of which is at the forefront of knowledge in the field. This knowledge forms the basis for originality in developing and/or applying ideas. Demonstrate critical awareness of knowledge issues in the field and at the interface between different fields.
Create a research-based diagnosis to problems by integrating knowledge from new or inter disciplinary fields and make judgements with incomplete or limited information.
Demonstrate leadership and innovation in work and study contexts that are unfamiliar, complex, and unpredictable and that require solving problems involving many interacting factors.
Develop new skills in response to emerging knowledge and techniques.
Review strategic performance of teams.
Demonstrate experience of operational interaction within a field. Make judgements based on knowledge of relevant social and ethical issues.
Consistently evaluate own learning and identify learning needs.
Gather and interpret relevant data in a field to solve problems. Demonstrate experience of operational interaction within a complex environment. Make judgements based on social and ethical issues that arise in work or study.
Show creativity in developing projects and show initiative in management processes that includes the training of others to develop team performance. 7
Formulate responses to abstract and concrete problems.
Demonstrate autonomy in the direction of learning and a high-level understanding of learning processes.
Solve problems by integrating complex knowledge sources that are sometimes incomplete and in new and unfamiliar contexts. Demonstrate experience of operational interaction in managing change within a complex environment. Respond to social, scientific and ethical issues that are encountered in work or study.
126
8
Use specialised knowledge to critically analyse, evaluation and synthesize new and complex ideas that are at the most advanced frontier of a field. Extend or redefine existing knowledge and/or professional practice within a field or at the interface between fields.
Research, conceive, design, implement and adapt projects that lead to new knowledge and new procedural solutions.
Demonstrate substantial leadership, innovation and autonomy in work and study contexts that are novel and require the solving of problems that involve many interacting factors.
Demonstrate capacity for sustained commitment to development of new ideas or processes and a high level understanding of learning processes.
Critical analysis, evaluation and synthesis of new and complex ideas and strategic decision-making based on these processes. Demonstrate experience of operational interaction with strategic decision-making capacity within a complex environment. Promote social, and ethical advancement through actions.
127