Volume 36 / Number 1 / 2020
Volume 36 / Number 1 / 2020
European Journal of
Psychological Assessment
European Journal of Psychological Assessment
Editor-in-Chief Samuel Greiff Associate Editors Mark Allen Juan Ramón Barrada Nicolas Becker Gary N. Burns Laurence Claes Marjolein Fokkema Penelope Hasking Dragos Iliescu Stefan Krumm Lena Lämmle Anastasiya Lipnevich Marcus Mund René T. Proyer John F. Rauthmann Ronny Scherer Eunike Wetzel Matthias Ziegler
Official Organ of the European Association of Psychological Assessment
Psychological Test Adaptation and Development Official Open Access Organ of the European Association of Psychological Assessment (EAPA)
“PTAD will be an important outlet for everyone interested in assessment!” Mathias Ziegler, Editor-in-Chief, Humboldt University Berlin
Volume 1 / Number 1 / 2020
Psychological Test Adaptation and Development
Psychological Test Adaptation and Development
Editor-in-Chief Matthias Ziegler
Official Open Access Organ of the European Association of Psychological Assessment
OA New al Journ
About the journal PTAD is the first open access, peer-reviewed journal publishing papers which present the adaptation of tests to specific needs (e. g., cultural), test translations or the development of existing measures. Moreover, the focus is on the empirical testing of the psychometric quality of these measures. The journal provides a paper template, and registered reports are strongly encouraged. It is a unique outlet for research papers portraying adaptations (e. g., translations) and developments (e. g., state to trait) of individual tests – the backbone of assessment. The expert editor-in-chief is supported by a stellar cast of internationally renowned associate editors. A generous APC waiver program is available for eligible authors. Benefits for authors: • Clear guidance on the structure of papers helps you write good papers • Fast peer-review, aided by the clear structure of your paper • With the optional registered report format you can get expert advice from seasoned reviewers to help improve your research • Open access publication, with a choice of Creative Commons licenses • Widest possible dissemination of your paper – and thus of qualified information about your test and your research • Generous APC waiver program and discounts for members of selected associations The journal welcomes your submissions! All manuscripts should be submitted online via Editorial Manager, where full instructions to authors are also available: https://eu.hogrefe.com/j/ptad
European Journal of
Psychological Assessment Volume 36 / Number 1 / 2020 Official Organ of the European Association of Psychological Assessment
Editor-in-Chief
Samuel Greiff, Cognitive Science and Assessment, ECCS unit, 11, Porte des Sciences, 4366 Esch-sur-Alzette, Luxembourg (Tel. +352 46 6644-9245, E-mail samuel.greiff@uni.lu)
Editors-in-Chief (past)
Karl Schweizer, Germany (2009–2012), E-mail k.schweizer@psych.uni-frankfurt.de Matthias Ziegler, Germany (2013–2016), E-mail zieglema@hu-berlin.de
Editorial Assistant
Lindie van der Westhuizen, Cognitive Science and Assessment, ECCS unit, 11, Porte des Sciences, 4366 Esch-sur-Alzette, Luxembourg, (Tel. +352 46 6644-5578, E-mail ejpaeditor@gmail.com)
Associate Editors
Mark Allen, Australia; Juan Ramón Barrada, Spain; Nicolas Becker, Germany; Gary N. Burns, USA/Sweden; Laurence Claes, Belgium; Marjolein Fokkema, The Netherlands; Penelope Hasking, Australia; Dragos Iliescu, Romania; Stefan Krumm, Germany; Lena Lämmle, Germany; Anastasiya Lipnevich, USA; Marcus Mund, Germany; René Proyer, Germany; John F. Rauthmann, Germany; Ronny Scherer, Norway; Eunike Wetzel, Germany; Matthias Ziegler, Germany
Editorial Board
Rebecca Pei-Hui Ang, Singapore Roger Azevedo, USA R. Michael Bagby, Canada Yossef S. Ben-Porath, USA Nicholas F. Benson, USA Francesca Borgonovi, France Janine Buchholz, Germany Vesna Busko, Croatia Eduardo Cascallar, Belgium Mary Louise Cashel, USA Carlo Chiorri, Italy Lee Anna Clark, USA Paul De Boeck, USA Scott L. Decker, USA Andreas Demetriou, Cyprus Annamaria Di Fabio, Italy Christine DiStefano, USA Stefan Dombrowski, USA Fritz Drasgow, USA Peter Edelsbrunner, Switzerland Kadriye Ercikan, USA Rocı́o Fernández-Ballesteros, Spain Marina Fiori, France Brian F. French, USA Arthur C. Graesser, USA Patrick Griffin, Australia Jan-Eric Gustafsson, Sweden
Founders
Rocı́o Fernández-Ballesteros and Fernando Silva
Supporting Organizations
The journal is the official organ of the European Association of Psychological Assessment (EAPA). The EAPA was founded to promote the practice and study of psychological assessment in Europe as well as to foster the exchange of information on this discipline around the world. Members of the EAPA receive the journal in the scope of their membership fees. Further, the Division for Psychological Assessment and Evaluation, Division 2, of the International Association of Applied Psychology (IAAP) is sponsoring the journal: Members of this association receive the journal at a special rate (see below).
Publisher
Hogrefe Publishing, Merkelstr. 3, D-37085 Göttingen, Germany, Tel. +49 551 999-500, Fax +49 551 999-50111, E-mail publishing@hogrefe.com, Web http://www.hogrefe.com North America: Hogrefe Publishing, 361 Newbury Street, 5th Floor, Boston, MA 02115, USA, Tel. +1 866 823-4726, Fax +1 617 354-6875, E-mail customerservice@hogrefe-publishing.com, Web https://www.hogrefe.com
Production
Regina Pinks-Freybott, Hogrefe Publishing, Merkelstr. 3, D-37085 Göttingen, Germany, Tel. +49 551 999-500, Fax +49 551 999-50111, E-mail production@hogrefe.com
Subscriptions
Hogrefe Publishing, Herbert-Quandt-Strasse 4, D-37081 Göttingen, Germany, Tel. +49 551 50688-900, Fax +49 551 50688-998, E-mail zeitschriftenvertrieb@hogrefe.de
Advertising/Inserts
Melanie Beck, Hogrefe Publishing, Merkelstr. 3, D-37085 Göttingen, Germany, Tel. +49 551 999-500, Fax +49 551 999-50111, E-mail marketing@hogrefe.com
ISSN
ISSN-L 1015-5759, ISSN-Print 1015-5759, ISSN-Online 2151-2426
Copyright Information
Ó 2020 Hogrefe Publishing. This journal as well as the individual contributions and illustrations contained within it are protected under international copyright law. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without prior written permission from the publisher. All rights, including translation rights, reserved.
Publication
Published in 6 issues per annual volume (new in 2017; 4 issues from 2004 to 2016)
Subscription Prices
Calendar year subscriptions only. Rates for 2020: Institutions – from US $483.00/€370.00 (print only; pricing for online access can be found in the journals catalog at hgf.io/journalscatalog); Individuals – US $264.00/€199.00 (print & online). Postage and handling – US $16.00/€12.00. Single copies: US $85.00/€66.50 + postage and handling. Special rates: IAAP/Colegio Oficial de Psicólogos members: €129.00, US $164.00 (+ €18.00, US $24.00 postage and handling); EAPA members: Included in membership
Payment
Payment may be made by check, international money order, or credit card, to Hogrefe Publishing, Merkelstr. 3, D-37085 Göttingen, Germany, or, for North American customers, to Hogrefe Publishing, 361 Newbury Street, 5th Floor, Boston, MA 02115, USA.
Electronic Full Text
The full text of the European Journal of Psychological Assessment is available online at https://econtent.hogrefe.com and in PsycARTICLES.
Abstracting/Indexing Services
The journal is abstracted/indexed in Current Contents / Social & Behavioral Sciences (CC/S&BS), Social Sciences Citation Index (SSCI), Social SciSearch, PsycINFO, Psychological Abstracts, PSYNDEX, ERIH, and Scopus. 2018 Impact Factor 2.225, 5-year Impact Factor 2.447, Journal Citation Reports (Clarivate Analytics, 2019)
European Journal of Psychological Assessment (2020), 36(1)
Ronald K. Hambleton, USA William Hanson, Canada Sonja Heintz, Switzerland Sven Hilbert, Germany Joeri Hofmans, Belgium Therese N. Hopfenbeck, UK Jason Immekus, USA Jan Henk Kamphuis, The Netherlands David Kaplan, USA James C. Kaufman, USA Eun Sook Kim, USA Muneo Kitajima, Japan Radhika Krishnamurthy, USA Klaus Kubinger, Austria Patrick Kyllonen, USA Kerry Lee, Hong Kong Chung-Ying Lin, Hong Kong Jin Liu, USA Patricia A. Lowe, USA Romain Martin, Luxembourg R. Steve McCallum, USA Helfried Moosbrugger, Germany Kevin R. Murphy, Ireland Janos Nagy, Hungary Tuulia M. Ortner, Austria Marco Perugini, Italy K. V. Petrides, UK
Aaron Pincus, USA Kenneth K. L. Poon, Singapore Ricardo Primi, Brazil Richard D. Roberts, USA Willibald Ruch, Switzerland Leslie Rutkowski, Norway Jesus F. Salgado, Spain Douglas B. Samuel, USA Manfred Schmitt, Germany Heinz Schuler, Germany Martin Sellbom, New Zealand Valerie J. Shute, USA Stephen Stark, USA Jonathan Templin, USA Katherine Thomas, USA Stéphane Vautier, France Michele Vecchione, Italy David Watson, USA Nathan C. Weed, USA Alina von Davier, USA Cilia Witteman, The Netherlands Moshe Zeidner, Israel Johannes Zimmermann, Germany Ada Zohar, Israel Bruno Zumbo, Canada
Ó 2020 Hogrefe Publishing
Contents Original Articles
Dimensions of Psychopathic Traits in a Community Sample: Implications From Different Measures for Impulsivity and Delinquency Hedwig Eisenbarth and Luna C. M. Centifanti Longitudinal Measurement Invariance of the Brief Symptom Inventory (BSI)-18 in Psychotherapy Patients Ruth von Brachel, Angela Bieda, Jürgen Margraf, and Gerrit Hirschfeld
12
Intuitive Eating: A Novel Eating Style? Evidence From a Spanish Sample Juan Ramón Barrada, Blanca Cativiela, Tatjana van Strien, and Ausiàs Cebolla
19
Further Evidence for Criterion Validity and Measurement Invariance of the Luxembourg Workplace Mobbing Scale Philipp E. Sischka, Alexander F. Schmidt, and Georges Steffgen
32
Can Serious Games Assess Decision-Making Biases? Comparing Gaming Performance, Questionnaires, and Interviews Kyoungwon Seo, Hokyoung Ryu, and Jieun Kim
44
Identification and Utility of a Short Form of the Pediatric Symptom Checklist-Youth Self-Report (PSC-17-Y) Paul Bergmann, Cara Lucke, Theresa Nguyen, Michael Jellinek, and John Michael Murphy
56
Psychometric Properties of the Strengths and Difficulties Questionnaire in Children Aged 5-12 Years Across Seven European Countries Mathilde M. Husky, Roy Otten, Anders Boyd, Ondine Pez, Adina Bitfoi, Mauro G. Carta, Dietmar Goelitz, Ceren Koç, Sigita Lesinskiene, Zlatka Mihova, and Viviane Kovess-Masfety
65
Pediatric Symptom Checklist-17: Testing Measurement Invariance of a Higher-Order Factor Model Between Boys and Girls Jin Liu, Christine DiStefano, Yin Burgess, and Jiandong Wang
77
Development and Validation of the Multicontextual Interpersonal Relations Scale (MIRS) Melissa Simone, Christian Geiser, and Ginger Lockhart
84
Does Speededness in Collecting Reasoning Data Lead to a Speed Factor? Florian Zeller, Siegbert Reiß, and Karl Schweizer
96
Degrees of Freedom in Multigroup Confirmatory Factor Analyses: Are Models of Measurement Invariance Testing Correctly Specified? Ulrich Schroeders and Timo Gnambs
Ó 2020 Hogrefe Publishing
1
105
European Journal of Psychological Assessment (2020), 36(1)
Multistudy Reports
Brief Reports
Measuring Anxiety-Related Avoidance With the Driving and Riding Avoidance Scale (DRAS) Joanne E. Taylor, Mark J. M. Sullman, and Amanda N. Stephens
114
The Multidimensional Structure of Math Anxiety Revisited: Incorporating Psychological Dimensions and Setting Factors Sofie Henschel and Thorsten Roick
123
‘‘Sweet Little Lies’’: An In-Depth Analysis of Faking Behavior on Situational Judgment Tests Compared to Personality Questionnaires Nadine Kasten, Philipp Alexander Freund, and Thomas Staufenbiel
136
Validation of the Short and Extra-Short Forms of the Big Five Inventory-2 (BFI-2) and Their German Adaptations Beatrice Rammstedt, Daniel Danner, Christopher J. Soto, and Oliver P. John
149
Personality Across the Lifespan: Exploring Measurement Invariance of a Short Big Five Inventory From Ages 11 to 84 Naemi D. Brandt, Michael Becker, Julia Tetzner, Martin Brunner, Poldi Kuhl, and Kai Maaz
162
A Meta-Analysis of Test Scores in Proctored and Unproctored Ability Assessments Diana Steger, Ulrich Schroeders, and Timo Gnambs
174
Evaluating the Psychometric Properties of the Short Dark Triad (SD3) in Italian Adults and Adolescents Antonella Somma, Delroy L. Paulhus, Serena Borroni, and Andrea Fossati
185
I Like Myself, I Really Do (at Least Right Now): Development and Validation of a Brief and Revised (German-Language) Version of the State Self-Esteem Scale Almut Rudolph, Michela Schröder-Abé, and Astrid Schütz
196
Perfectionism in Italy and the USA: Measurement Invariance and Implications for Cross-Cultural Assessment Sean P. M. Rice, Yura Loscalzo, Marco Giannini, and Kenneth G. Rice
207
Reexamining the Factorial Validity of the 16-Item Scale Measuring Need for Cognition Ying Zhang, Eric Klopp, Heike Dietrich, Roland Brünken, Ulrike-Marie Krause, Birgit Spinath, Robin Stark, and Frank M. Spinath
212
European Journal of Psychological Assessment (2020), 36(1)
Ó 2020 Hogrefe Publishing
Original Article
Dimensions of Psychopathic Traits in a Community Sample Implications From Different Measures for Impulsivity and Delinquency Hedwig Eisenbarth1 and Luna C. M. Centifanti2 1
Department of Psychology, University of Southampton, United Kingdom
2
Department of Psychological Sciences, University of Liverpool, United Kingdom
Abstract: There are valid measures of psychopathic traits in youth, such as the Youth Psychopathic Traits Inventory (YPI). However, it is unclear how another self-report measure, which is based on a different conceptualization of psychopathy relates to the YPI in youth and to antisocial behavior. We therefore, compared the construct validity of two measures: the personality-based Psychopathic Personality InventoryRevised (PPI-R) and the YPI – based on adult antisocial personality traits. First, both measures showed sufficient model fit and some overlap in their variance, particularly YPI impulsive-irresponsible and grandiose-manipulative factors with PPI-R self-centered impulsivity, as well as YPI callous-unemotional with PPI-R coldheartedness. We found that although overall delinquency was correlated with PPI-R and YPI subscales, only the self-centered impulsivity factor of the PPI-R and only the Impulsive/Irresponsibility domain of the YPI were statistically predictive of self-reported antisocial behavior. Thus, the PPI-R and the YPI both show moderate construct validity and criterion validity for use among young community adults. Keywords: young adults, psychopathic traits, self-report, delinquency
The validity of assessing psychopathy in young people has been the subject of research attention. The aim has been to identify these traits in young people at early stages to aid prevention and intervention specifically for criminal behavior. Much of this research is based on self-report. Yet, although research has shown these measures are valid, the comparisons between measures with different theoretical bases have been less often investigated in youths. We compare a well-validated measure of youth psychopathic traits with another measure that was developed for adults.
ity, impulsivity, irresponsibility, dishonest charm, remorselessness, and thrill-seeking. Thus, three subscales are behavioral – impulsivity, irresponsibility, and thrill-seeking – and so are likely to be related to criminal behavior. Consistently, the YPI total score as well as the different factors have been shown to be a valid predictor of problem behaviors for youths and young adults (e.g., Poythress, Dembo, Wareham, & Greenbaum, 2006).
Relations Between YPI and PPI-R Valid Measures of Psychopathy in Young Adults The Youth Psychopathic Traits Inventory (YPI; Andershed, Kerr, Stattin, & Levander, 2002) was designed for youths between the ages of 12 and 18 years and was modeled based on the adult conceptualization of the Psychopathy Checklist Revised (PCL-R; Hare, 2003), to reflect 10 core personality traits that are relevant for psychopathy: grandiosity, lying, manipulation, callousness, unemotional-
Ó 2018 Hogrefe Publishing
There is another inventory that has been created to examine psychopathic traits in adults, and varies in the way that it focuses on personality traits that are separate from purely behavioral features. The Psychopathic Personality Inventory Revised (PPI-R; Lilienfeld & Widows, 2005) has item content that taps varied affective and interpersonal personality traits (like the YPI’s callous-unemotional and grandiositymanipulativeness subscales), and also includes behavioral items measuring impulsivity. Although this measure was developed mainly for adult populations, its personality-based
European Journal of Psychological Assessment (2020), 36(1), 1–11 https://doi.org/10.1027/1015-5759/a000478
2
H. Eisenbarth & Luna C. M. Centifanti, Psychopathy Dimensions in Young Adults
approach may be arguably relevant for young individuals as it uses a wide range of temperamental or personality concepts; thus, it may get at personality traits earlier in development facilitating intervention strategies if antisocial behavior has not yet manifested. On the other hand, young individuals could fail to engage with descriptions of themselves when asked about abstract personality traits, such that these require good introspection. The validity of the PPI-R in young adults is unknown and has not been compared to the more widely used (in youths) YPI. There is at least one study, which compared the two inventories. In an adult community sample (Mage = 35 years), the PPI-R and the YPI have been shown to be moderately correlated (r = .70–.20 for the subscales; Uzieblo, Verschuere, Van den Bussche, & Crombez, 2010), suggesting there may be good criterionrelated validity for the PPI-R across a wide age range. Theories about the factor structure of psychopathy and the related measures also differ: (1) psychopathy is either conceptualized as a personality-based disorder where antisocial behavior is a byproduct, for example, PPI-R or (2) antisocial behavior is conceptualized as part and parcel of psychopathy, for example, the YPI (see e.g., Brazil, van Dongen, Maes, Mars, & Baskin-Sommers, in press; Lilienfeld, Watts, Francis Smith, Berg, & Latzman, 2015). These differences contribute to the content of the respective items and to the structure of the measures that have been created. A recent study showed that adolescents’ sensation seeking trait measurement varied based on how items were conceptualized from a theoretical perspective (Altmann, Liebe, Schönefeld, & Roth, 2017). Thus, variance in responses to psychopathy questionnaires may be based on different theoretical bases for the construction of item content.
Construct Validity of the YPI and PPI-R Within a variety of samples, validation studies of selfreported psychopathy, such as the PPI-R and the YPI, rely on their statistical prediction of delinquency (i.e., criminal behavior in young people). This is regardless of the relevance of delinquency for juveniles and adolescents for validating the construct of psychopathy (Fox, Jennings, & Farrington, 2015; see e.g., Salekin & Frick, 2005; Salihovic, Kerr, & Stattin, 2014). In a general population sample of youths (Mage = 16.4 years), the PPI-R total score was moderately correlated with self-reported proactive (r = .59) and reactive aggression (r = .40; Taubner, White, Zimmermann, Fonagy, & Nolte, 2013). Relatedly, in a specialized sample of children (ages 17–19 years) in foster care, PPI-R scores were related to more diverse forms of criminal behavior and subsequent involvement with the criminal justice system (Vaughn, Litschge, Delisi, & Beaver, 2008). Furthermore, DeLisi et al. (2013) found that the PPI sum score European Journal of Psychological Assessment (2020), 36(1), 1–11
can differentiate youths high on career delinquency, defined as a compendium of antisocial behavior, substance abuse, and criminal justice system involvement from those low on career delinquency, with most predictive value from the subscales of Blame Externalization, Fearlessness, and Carefree Nonplanfulness. Although variance explained was low at 21%, these subscales relate to disinhibitory and impulsive traits, and so are unsurprisingly related to antisocial behavior. Regarding the YPI, a significant but moderate correlation (r = .35) between the YPI total score and selfreported delinquency was found in a general population of adolescents (Chabrol, Leeuwen, Rodgers, & Séjourné, 2009) as well as a small but significant correlation between criminal offenses and the YPI callous-unemotional factor (r = .18) as well as the impulsive-irresponsible factor (r = .24; Neumann & Pardini, 2014). For a female sample, the affective factor of the YPI has been found predictive of self-reported criminal and violent behavior (Chauhan et al., 2012). Thus far, the construct validity of both the PPI-R and the YPI has not been investigated in youths. The PPI-R, although designed for adult samples was suggested to be used in youth as well, but there are only few investigations on criterion-related validity regarding delinquency. In addition, it may be that some additive combination of the two inventories could be better at predicting delinquency than any single inventory.
Relating Psychopathic Traits to Delinquency in Juveniles In this study, we aimed to compare responses to two different self-report measures of psychopathic traits: the PPI-R and the YPI in a community sample of young adults. To test the overlap and discrepancies, we compare the measures and their factors descriptively and for model fit. To investigate the validity of the two measures, we test the strength of the relations between the YPI and self-reported delinquency as possibly being different from the strength of the relation between the PPI-R and self-reported delinquency. Finally, we examine the predictive ability of both measures in a single statistical model.
Method Participants The sample consisted of 339 students at a vocational training school (79 female, 260 male). After excluding participants who showed inconsistent responding in the PPI-R (IR < 30, IRA < 60; Lilienfeld & Widows, 2005), we used data from 270 participants (69 female, 201 male; age: M = 19.02, SD = 2.51, range: 15–34). Participants were Ó 2018 Hogrefe Publishing
H. Eisenbarth & Luna C. M. Centifanti, Psychopathy Dimensions in Young Adults
3
Table 1. Means, standard deviations, and reliabilities for main study variables
Age
All (SD)
Male (SD)
Female (SD)
N = 308
n = 231
n = 77
p
18.63 (2.12)
19.04 (2.60)
18.74 (2.73)
.39
Cronbach’s α
Study variables Violent crime
2.17 (2.89)
2.72 (3.06)
0.55 (1.38)
<.001
Property damage
2.07 (3.15)
2.53 (3.42)
0.69 (1.46)
<.001
Burglary
2.34 (2.86)
2.67 (3.05)
1.36 (1.91)
<.001
Drug use
2.21 (3.66)
2.57 (3.97)
1.14 (2.22)
.003
Property damage
2.07 (3.15)
2.53 (3.42)
0.69 (1.46)
<.001
Minor delinquency
4.49 (3.91)
4.90 (4.20)
3.26 (2.50)
.001
Delinquency
8.53 (9.92)
10.48 (10.80)
3.74 (4.89)
<.001
.86
YPI CU
10.38 (2.79)
10.84 (2.85)
8.93 (1.92)
<.001
.84
YPI II
11.50 (2.64)
11.78 (2.78)
10.94 (2.24)
.02
.82
YPI GM
9.99 (2.61)
10.34 (2.74)
8.87 (1.98)
<.001
.88
YPI MEAN
10.57 (2.21)
10.92 (2.29)
9.51 (1.54)
<.001
.88
PPI-R SCI
40.00 (6.00)
40.97 (6.36)
38.57 (5.04)
.003
.87
PPI-R FD
35.61 (4.92)
36.32 (4.88)
33.27 (4.74)
<.001
.88
PPI-R CO
31.90 (7.52)
32.85 (7.48)
28.09 (5.71)
<.001
.85
PPI-R MEAN
37.47 (3.95)
38.21 (3.90)
35.27 (3.23)
<.001
.87
Notes. M = mean, SD = standard deviation, YPI CU = callous unemotional, YPI II = impulsive irresponsible, YPI GM = grandiose manipulative, PPI-R SCI = self-centered impulsivity, PPI-R FD = fearless dominance, PPI-R CO = coldheartedness.
recruited and gave consent at a vocational training school after informed consent was obtained from their parents. Female and male participants differed in all categories of self-reported delinquency as well as on all YPI and PPI-R factors (see Table 1). For 88.8% German was the mothertongue, the remaining sample described their language skills as sufficient. The majority of the sample (93.7%) received at least 9 years of school education prior to the vocational training school. Most of the participants report to have siblings (n = 264, 86%) and about half of them reported to live with their parents (n = 183, 60%), while 35 participants reported to live on their own (12%) and 72 participants reported to live with either father or mother (24%).
Measures Youth Psychopathy Inventory (YPI) The YPI is a 50-item self-report instrument for adolescents, developed for non-referred youth to measure the three personality dimensions of psychopathy: a GrandioseManipulative (subscales: Dishonest Charm, Grandiosity, Lying, and Manipulation), a Callous-Unemotional (Callousness, Unemotional, and Remorselessness), and an Impulsive-Irresponsible Dimension (Andershed, Kerr, Stattin, & Levander, 2002). Items are answered on a 4-point scale (1 = does not apply at all, 4 = applies very well). The YPI has been validated in different samples, showing positive relaÓ 2018 Hogrefe Publishing
tions with self-reported conduct problems (Andershed, Kerr, & Stattin, 2002; Declercq, Markey, Vandist, & Verhaeghe, 2009; Hillege, Das, & de Ruiter, 2010; Neumann, Kosson, Forth, & Hare, 2006). In young adult offenders the YPI has shown predictable relations with internalizing and externalizing psychopathology and criminal offenses (Neumann & Pardini, 2014). The internal consistency (Cronbach’s α) of the interpersonal dimension ranged from rα = .90 to .91, from rα = .57 to .77 for the affective dimension, and from rα = .82 to .83 for the behavioral dimension (Sherman, Lynam, & Heyde, 2014). The German version has demonstrated high internal consistency as well as convergent validity (Heinzen, Köhler, & Hinrichs, 2008). Reliabilities in this current sample ranged from rα = .92 for the Grandiose-Manipulative factor to .82 for the Impulsive-Irresponsible factor (see Table 1). Psychopathic Personality Inventory Revised (PPI-R) This self-report questionnaire (Lilienfeld & Widows, 2005) has been developed in student samples to assess psychopathic traits as conceptualized by Cleckley. The 154 items, answered on a 4-point Likert scale, can be assigned to eight subscales and three validity scales designed to detect aberrant responding. The content subscales are Blame Externalization, Rebellious Nonconformity, Coldheartedness, Social Influence, Carefree Nonplanfulness, Fearlessness, Machiavellian Egocentricity, and Stress Immunity. These factor-analysis-derived subscales can be assigned to two European Journal of Psychological Assessment (2020), 36(1), 1–11
4
H. Eisenbarth & Luna C. M. Centifanti, Psychopathy Dimensions in Young Adults
factors: Fearless Dominance (FD) and Self-Centered Impulsivity (SCI), also called Impulsive Antisociality (Lilienfeld & Widows, 2005), a structure that does not include the subscale Coldheartedness (CO) but has been replicated across samples (Benning, Patrick, Hicks, Blonigen, & Krueger, 2003; Ross, Benning, Patrick, Thompson, & Thurston, 2009). The German version (Alpers & Eisenbarth, 2008) has demonstrated good internal consistency of rα = .85 for the total score in students and detained samples (Eisenbarth & Alpers, 2015). Reliabilities in the current sample ranged between rα = .90 and .85 (see Table 1). As a measure of validity of the responses, the PPI-R includes a measure of inconsistent responding. We excluded 69 participants from analyses based on the suggested cut-off for inconsistent responding (IR < 30, IRA < 60; Lilienfeld & Widows, 2005). Delinquency We measured delinquency asking participants how often in their life they committed different delinquent acts, based on a German unpublished measure for non-legal behavior (Fragebogen zur Legalbewaehrung; Lewand, 2003). Behaviors that we asked for belonged to four different categories: violent crimes (threat of violence, actual violence, and threat involving a gun), burglary crimes (burglary, car or bike theft, leaving restaurant without paying), drug use crimes (use of different type of drugs), and property damage crimes (damage of private or public property, arson). Each item was answered on a scale ranging from “never” (scored as 0) and “not within the last 12 months” (scored as 2) to “more than 10 times” (scored as 3). Sum scores were computed across all items. For the summary variables, means were calculated for each of the four categories.
Statistical Analyses A confirmatory factor analysis (CFA) was conducted using Mplus 7.3 (Muthen & Muthen, 2010) using maximumlikelihood, which is robust to missing data. Covariance coverage of the data ranged from 0.89 to 1.00, which is higher than the recommended 0.10. To examine whether the model fit explained the data well, we used chi-square: a nonsignificant chi-square indicates good fit. Yet, chisquare with sample sizes as large as that used in the present study (n = 109) is often significant with even trivial deviations from a perfect model. Hence, we used three indices of practical fit as suggested by prior research (comparative fit index, CFI, Bentler, 1990; root mean square error of approximation, RMSEA, Browne & Cudeck, 1993; Tucker-Lewis Index, TLI, Tucker & Lewis, 1973). A CFI and TLI > .90 suggests an acceptable model fit (Bentler & Bonett, 1980) and > .95 suggests a good model fit. A RMSEA < .08, suggests an acceptable fit; an RMSEA < .06 suggests a good fit (Browne & Cudeck, 1993). European Journal of Psychological Assessment (2020), 36(1), 1–11
Negative binomial regression analyses were conducted since the data represented frequency counts based on the frequency categories of delinquent activity. The data included a moderate to high proportion of zeros (ranging from .32 to .58 across property crime, violence, and drug use) reflecting a generally low frequency for most items, as would be expected for this cohort. Thus, zero-inflated negative binomial regression was selected for analysis using Mplus 7.3, because this statistic corrects for severely positively skewed (toward zero) data that are overdispersed (Browne & Cudeck, 1993). The zero-inflated regression analysis generates both a count variable, indicating the variety of delinquency, and a binary latent variable, indicating whether participants endorsed any delinquent activity at any time. Two coefficients were produced by Mplus for each of the three dependent variables; for example, one coefficient to be predicted was the count variable for crime and a binary inflation latent variable – the likelihood of a participant assuming any value except zero (Muthen & Muthen, 2010), an approach similar to other binary regression techniques but beta values are opposite in sign to logistic regression. These analyses provided information about whether psychopathy predicted greater delinquency. Yet, at the same time the analyses provided information about whether psychopathy predicted engagement at all in delinquency. To examine construct validity of the PPI-R and YPI in this sample, the first step of a regression regressed the dependent variables (the three delinquency measures) onto age and sex, in order to control for their variance. Differences in model fit (Log-Likelihood) after the psychopathy scales were entered as predictors were taken as significance of psychopathy in predicting delinquency domains. Separate models examined the PPI-R and the YPI. Since scaled Log-Likelihood estimates (using Maximum Likelihood with Robust standard errors) were employed, Satorra-Bentler correction (Muthen & Muthen, 2010) was consistently applied to adjust for non-normality. The effect size of variance explained in delinquent behavior between the models was informed by the proportion of residual variance (i.e., dispersion) change between models. To interpret effect sizes associated with psychopathy, we included confidence intervals of the unstandardized estimates (i.e., betas). See the Electronic supplemental Materials, ESM 1–11, for the data set (YouthPPI_YPI_EJPA_data.sav) and the output files.
Results Model Fit and Criterion Validity of the PPI-R Within a Young Adult Sample We examined how well the 7 subscales of the PPI-R and the 10 subscales of the YPI were represented by the latent Ó 2018 Hogrefe Publishing
H. Eisenbarth & Luna C. M. Centifanti, Psychopathy Dimensions in Young Adults
factors identified in prior research. That is, without Coldheartedness, which is separated in prior research, we examined the factor loadings of the seven subscales onto their respective factors of Self-centered Impulsivity and Fearless Dominance. Included in the same confirmatory factor analysis (CFA), we included the three latent factors of Grandiose-Manipulative, Callous-Unemotional, and ImpulsiveIrresponsible representing the 10 subscales of the YPI. Including the YPI and PPI-R latent factors in the same CFA allowed for identification of the underlying factors, since some factors have only three indicators. If three indicators would be used, for example, the loadings must be strong across all three, otherwise identification may be poor; a three-indicator factor may behave as if it only had two indicators if one item shows a weak loading. Furthermore, including all factors of both measures into one model does not reduce correlations between factors due to unreliability of the scales. Also, this allows us to investigate shared factor loadings between the two inventories (e.g., the YPI Impulsive-Irresponsible subscale might show an affinity toward loading on the PPI-R self-centered impulsivity factor). Chi-square as a measure of model fit was significant, and the indices of practical fit suggested that the model tested was of inadequate fit, w2(109) = 604.60, p < .001; TLI = .71, CFI = .77, RMSEA = .124, 90% CI [.115, .134]. In Table 2, one can see that the average standard errors for each factor are different. Of note, the Fearless Dominance factor had average standard errors (.07) twice the average of the factor with the lowest standard errors (YPI-II at .03). We investigated the fit for the two inventories separately to unpack the poor fit of the model and the results are beyond the scope of our aims. In brief, the YPI showed good fit, w2(32) = 105.232, p < .001; TLI = .92, CFI = .94, RMSEA = .086, 90% CI [.068, .104], but the fit for the PPI was poor, w2(15) = 217.739, p < .001; TLI = .33, CFI = .53, RMSEA = .208, 90% CI [.184, .233]. This model was specified with some specifications for Fearless Dominance: the estimate of the factor loading and variance for PPI Stress immunity was set to 1.0 and we equated the factor loadings of the other two indicators. The modification indices show that the residual variance of Rebellious nonconformity is associated with the residual variance of indicators of Fearless Dominance. Specifying these in a revised model resulted in an improved fit but far from adequate, w2(13) = 121.718, p < .001; TLI = .59, CFI = .75, RMSEA = .164, 90% CI [.138, .191]. The completely standardized factor loadings are shown in Table 2. None of the standardized factor loadings was under .30, indicating that generally there were moderate to strong relations between indicators and their respective latent factors. The factor correlation was strong between PPI-R Self-centered Impulsivity and YPI Impulsive-Irresponsibility as would be expected (r = .99), but was also Ó 2018 Hogrefe Publishing
5
Table 2. Loadings and standard errors (SE) of confirmatory factor analysis for PPI-R and YPI Factor loading
SE
Blame externalization
.34
.06
Machiavellian egocentricity
.48
.05
Carefree nonplanfulness
.47
.05
Rebellious nonconformity
.88
.03
Fearlessness
.60
.07
Social influence
.61
.07
Stress immunity
.57
.07
PPI-R SCI
PPI-R FD
YPI GM Dishonest charm
.84
.03
Grandiosity
.57
.05
Lying
.64
.04
Manipulation
.88
.02
YPI CU Callousness
.70
.04
Remorselessness
.78
.04
Un-emotionality
.81
.04
Thrill-seeking
.88
.02
Impulsiveness
.73
.03
Irresponsibility
.55
.05
YPI II
Notes. PPI-R SCI = self-centered impulsivity, PPI-R FD = fearless dominance, PPI-R CO = coldheartedness, YPI GM = grandiose manipulative, YPI CU = callous unemotional, YPI II = impulsive irresponsible.
moderately correlated with YPI Callous-Unemotional (r = .61). YPI Grandiose-Manipulative was most highly correlated with PPI-R Self-centered Impulsivity (r = .77) and YPI Impulsive-Irresponsibility (r = .71), but was also correlated with YPI Callous-Unemotional (r = .62). PPI-R Fearless Dominance showed the weakest correlations; it was weakly to moderately correlated with YPI GrandioseManipulative (r = .48), YPI Callous-Unemotional (r = .50), YPI Impulsive-Irresponsibility (r = .31), and PPI-R Self-centered Impulsivity (r = .43). From the CFA, which included both the YPI and PPI-R, the YPI showed stronger psychometric properties than the PPI-R, yet including all subscales in one CFA resulted in a poor fit to the data overall.
Self-Reported Delinquency Participants reported delinquency in five categories: violence, burglary, drugs, damage of property, and minor delinquency (such as driving without permit, and calling the police for no reason). The mean categorical frequencies scores were highest for the minor delinquency scale (M = 4.49, SD = 3.91, range: 0–18) and lowest scores on the property damage scale (M = 2.07, SD = 3.15, range: 0–18), violent European Journal of Psychological Assessment (2020), 36(1), 1–11
6
H. Eisenbarth & Luna C. M. Centifanti, Psychopathy Dimensions in Young Adults
Table 3. Zero-order correlations between main study variables 1
2
3
4
5
6
7
8
9
10
1
Del
–
2
Age
.02
–
3
Sex
.29**
.05
4
PPI-R SCI
.56***
.10
.17***
–
5
PPI-R FD
.13*
.004
.26***
.01
6
PPI-R CO
.29***
.01
.28***
.20***
.17**
–
7
PPI-R MEAN
.56***
.08
.32***
.83***
.52***
.47***
–
8
YPI CU
.31***
.001
.30***
.38***
.23***
.59***
.54***
–
9
YPI II
.50***
.07
.14*
.72***
.05
.14**
.62***
.46***
–
10
YPI GM
.36***
.05
.24***
.56***
.29***
.25***
.63***
.49***
.59***
–
11
YPI MEAN
.47***
.05
.28***
.67***
.25***
.39***
.73***
.77***
.82***
.88***
11
– –
–
Notes. Del = Delinquency, PPI-R SCI = self-centered impulsivity, PPI-R FD = fearless dominance, PPI-R CO = coldheartedness, PPI-R MEAN = PPI-R mean score, YPI CU = callous unemotional, YPI II = impulsive irresponsible, YPI GM = grandiose-manipulative, YPI MEAN = YPI mean score. *p < .05, **p < .01, ***p < .001.
behavior was reported with a mean of 2.17 (SD = 2.89, range: 0–14), drug use with a mean of 2.21 (SD = 3.66, range: 0–15), and burglary with a mean of 2.34 (SD = 2.86, range: –014). The mean delinquency sum was 8.53 (SD = 9.92) and male participants reported more criminal behavior across all categories compared to female participants, t’s(306) = 2.99–6.02, p’s .001 (see Table 1). The overall delinquency score is correlated with gender (male participants reporting higher rates of delinquent behavior), with the PPI-R factors CO and SCI, but not with the FD factor, as well as with all three domains of the YPI (see Table 2).
The Construct Validity of the PPI-R in Predicting Delinquency Over and Above the YPI Table 3 notes the correlations between psychopathy scales (as created by summing items) and the covariates. The YPI CU and GM domains were positively correlated with all three PPI-R factors, while the YPI II domain only with the PPI-R SCI factor. PPI-R SCI was correlated highest with the II domain of the YPI, FD of the PPI-R showed small correlations with both CU and GM domains of the YPI. Coldheartedness showed the highest correlation with YPI CU. We conducted a regression where the three delinquency variables (three latent count variables and three latent binary) were regressed on the three observed PPI-R dimensions. Only age and sex were included in the first step of evaluation of the model. Sex was significant in statistically predicting all delinquency, except for the binary drug use variable, suggesting sex did not differentiate those who, regardless of level of use, did or did not use drugs. However, males trended toward greater delinquency across the count European Journal of Psychological Assessment (2020), 36(1), 1–11
variable of drug use as well as violent delinquency and property delinquency (estimates ranging from 0.52 to 2.08, SEs ranging from .15 to .39). Younger age was associated with a greater tendency to report more incidences of violent delinquency, estimate = .06, SE = .03, 95% CI [ .11, .01]. Including the PPI-R dimensions in the second step of the model specification significantly improved model fit, Satorra-Bentler Δχ 2(18) = 227.19, p < .001. The proportion of residual variance explained including the PPI-R ranged from 46% for violence to 18% for drug use (37% for property crime). Thus, including psychopathy as measured by the PPI-R was a significant and meaningful addition to the model. As shown in Table 4, Self-centered Impulsivity significantly predicted all forms of delinquency, both count and binary variables, suggesting greater Self-centered Impulsivity resulted in reports of engaging in delinquency at all (binary variables) and of engaging in more incidences (count variables). Fearless Dominance and Coldheartedness were related to non-violent forms of delinquency. Fearless Dominance significantly predicted reporting any engagement in property crime and reporting any drug use, and Coldheartedness predicted greater levels (count) of property crime and reported drug use. Adding the YPI subscales to the original model (including sex and age) significantly improved model fit, SatorraBentler Δχ 2(18) = 179.28, p < .001 (see Table 5). Although age statistically predicted property delinquency as above, sex predicted violence (both count and binary) and count measures of property and drug use. However, the proportion of residual variance explained including the YPI domains ranged from 44% for violence to 23% for drug use (and 27% for property crime). Although these effects were lower (in absolute terms) than what was found for the PPI-R, these effects are significant. The variance explained by all the variables in the model was similar to that using the PPI-R: R2 ranged from .23 to .44 for the YPI compared to the range Ó 2018 Hogrefe Publishing
Ó 2018 Hogrefe Publishing Violence
R2 = .46
.01/.04
.003/ .19, .08
.11
.03, .02/ .15, .01
.01, .05/ .28,
.002, .03/ .06, .03
.84, .10/.92, 2.66
.09,
95% CI
Property
.01/.05
0.02/ 0.14* R2 = .37
.01/.04
.01/.02
.15/.43
.02/.08
SE
0.07*/ 0.18*
0.02*/ 0.002
0.31*/0.46
0.05*/ 0.12
Estimate
.02/ .38, 1.29
.05, .01/ .23,
.05, .09/ .26, .05
.10
.003, .03/ .05, .04
.60,
.08, .01/ .27, .04
95% CI
0.01/ 0.06*
0.04*/ 0.12*
0.03*/0.02
0.24/ 0.19
0.05*/ 0.19*
Estimate
Drug use
R2 = .18
.02/.03
.02/.03
.01/.02
.23/.40
.03/.07
SE
.03/.11
R2 = .44
.03/.09
.05/.90,2.69
.01/ .20, .09
95% CI
.16
.06, .04/ .23, .12
.02, .15/ .58,
.05, .06/ .23, .04
.96,
.10,
Estimate
0.03/ 0.19
0.13*/ 0.38*
0.02/0.03
0.43*/0.57
0.04/ 0.09
R2 = .27
.03/.10
.03/.10
.03/.08
.17/.43
.02/.06
SE
Property 95% CI
.11/ .27,1.41 .18 .03, .08/ .38, .01
.07, .19/ .58,
.04, .07/ .13, .19
.76,
.08, .001/ .22, .03
Estimate
0.04/ 0.08
0.12/ 0.32*
0.03/0.06
0.39/ 0.14
0.05*/ 0.20*
Notes. YPI CU = callous unemotional, YPI II = impulsive irresponsible, YPI GM = grandiose-manipulative; binary latent variables are coded inversely. *p < .05.
0.02/ 0.05
0.08*/ 0.37*
YPI GM
YPI II
.03/.07
0.01/ 0.10
YPI CU
.23/.46
0.50*/1.80*
Sex
SE
.02/.07
0.05*/ 0.05
Age
Estimate
Violence
R2 = .23
.04/.08
.03/.08
.03/.06
.25/.41
.02/.07
SE
Drug use
.06
.04, .02/ .12,
.01, .07/ .17,
95% CI .07
.17 .12, .03/ .23, .07
.06, .19/ .48,
.04, .09/ .06, .18
.89, .11/ .94, .65
.01, .10/ .32,
.004
.06
.01, .05/ .02, .06
.69, .22/ .98, .60
.002, .10/ .32,
95% CI
Table 5. Results of negative binomial regressions with count (left of slash) and binary (right) latent variables created from delinquency subscales and predicted from the YPI subscales
Notes. PPI-R CO = coldheartedness, PPI-R SCI = self-centered impulsivity, PPI-R FD = fearless dominance; binary latent variables are coded inversely. *p < .05.
0.01/ 0.07
PPI-R FD
.01/.04
.01/.03
0.02/ 0.02
0.03*/ 0.20*
PPI-R CO
PPI-R SCI
0.37/1.79*
.02/.07
.24/.45
0.05*/ 0.05
Age
SE
Sex
Estimate
Table 4. Results of negative binomial regressions with count (left of slash) and binary (right) latent variables created from delinquency subscales and predicted from the PPI-R dimensions
H. Eisenbarth & Luna C. M. Centifanti, Psychopathy Dimensions in Young Adults 7
European Journal of Psychological Assessment (2020), 36(1), 1–11
8
H. Eisenbarth & Luna C. M. Centifanti, Psychopathy Dimensions in Young Adults
Figure 1. Results of negative binomial regressions with count and binary (“bin”) latent variables created from delinquency subscales and predicted from both the YPI and PPI subscales. YPI CU = callous unemotional, YPI II = impulsive irresponsible, YPI GM = grandiose-manipulative, PPI-R SCI = self-centered impulsivity, PPI-R CO = coldheartedness, PPI-R FD = fearless dominance; binary latent variables are coded inversely.
of .18 to .46 in the PPI-R regression model. Mirroring the findings including the PPI-R, the YPI impulsive-irresponsibility scale was a significant predictor of all delinquency measures both count and binary, except for the count variable of drug use. No other subscales were significant predictors. Thus, across both measures, unique variance was accounted for by subscales of psychopathy related to impulsivity with a few exceptions using the PPI-R. Finally, we examined whether a model with both inventories was better than any single inventory at predicting delinquency. Including both the YPI and the PPI significantly improved fit beyond the PPI alone, Satorra-Bentler Δχ 2(18) = 49.681, p < .001, and the YPI alone, SatorraBentler Δχ 2(18) = 80.402, p < .001. Figure 1 shows the significant unstandardized regression estimates. YPI impulsive-irresponsibility continued to predict many delinquency measures, except violence. Coldheartedness positively predicted violence and Self-centered Impulsivity was related to greater likelihood to engage in violence at all (i.e., negative predictor of violence binary measure). Fearless Dominance only negatively predicted property delinquency. Thus, the impulsivity factors of the YPI and the PPI are doing the heavy lifting when statistically predicting delinquency.
European Journal of Psychological Assessment (2020), 36(1), 1–11
Discussion We investigated the construct and criterion validity of two forms of psychopathy self-report assessment measures, the PPI-R and the YPI, which – to our knowledge – have not been previously compared in youths. Despite expected correlations between YPI domains and PPI-R factors, we found a better model fit for the YPI factor specification compared to the PPI-R, but an overall moderate model fit for both measures. However, investigating relations with the construct of delinquency, we found surprisingly similar results across the PPI-R and the YPI. Both the PPI-R and the YPI explained significant variance in delinquency within our community sample. This is important, as it establishes the validity of questionnaires designed for youths and adults and these both predict delinquency. Yet, since these measures differ on their relative focus on personality traits and behavior (e.g., impulsivity), we cannot say if the variance accounted for by both the PPI-R and the YPI is due to their developmental focus or to their relative focus on personality versus behavior. At least we can suggest that, although the PPI-R self-centered impulsivity and YPI impulsive-irresponsibility overlapped considerably, each
Ó 2018 Hogrefe Publishing
H. Eisenbarth & Luna C. M. Centifanti, Psychopathy Dimensions in Young Adults
had unique relations with delinquency in the final model. Examining the developmental progression of impulsive psychopathic personality would be an important next step to explore in future studies, given the need for identifying assessments that can be administered across examples as they age to inform developmental psychopathology. In the separate statistical predictive models, we found a strong association between self-reported violent delinquency and self-centered impulsivity (PPI-R) as well as impulsiveirresponsibility (YPI) with 44 and 46% variance explained, respectively; weaker relations for property crime and drug abuse were shown. Fearless dominance and coldheartedness (both from the PPI-R) as well as Callous-unemotional and grandiose-manipulative factors (both YPI) however were less predictive for all three crime categories. Examining two selfreport measures of psychopathy that differed on their relative focus on personality and behavior, we found most of the variance accounted for in statistically predicting delinquency was due to the impulsive, irresponsible, thrillseeking, and self-centered impulsivity factors of both measures. For instance, callous-unemotional-coldheartedness traits and grandiose-manipulative traits did not predict delinquency over and above impulsivity-dominant dimensions. This is contrary to findings by Ansel, Barry, Gillen, and Herrington (2014), as we do not find a strong relation between the fearless and unemotional aspects of psychopathic traits with self-reported violent delinquency. Although diverging from the findings of this recent study by Ansel et al. (2014), our findings match previous results from delinquent samples, in which the impulsive and behavioral traits of psychopathy were most strongly related to criminal behavior (e.g., Vaughn, Edens, Howard, & Smith, 2009). Also, our findings are consistent with Muñoz, Kerr, and Besic (2008) who found the impulsive dimension of the Antisocial Process Screening Device to be most associated with aggression and conduct problems at least concurrently. Interestingly, both the YPI and the PPI-R together improve statistical prediction of delinquency beyond either single measure alone. However, future research should also 0investigate the predictive validity of the PPI-R and YPI, as the two measures might differ in a prospective design. People who were more delinquent endorsed impulsiverelated items, whether reported on the PPI-R or the YPI, yet the PPI-R showed many more associations with delinquency than was true for the YPI dimensions. The PPI-R is a measure of psychopathy designed to tap the personality-based descriptions of Cleckley (1941) while deemphasizing the role of antisocial behavior, which was seen as a byproduct of the callous, cold, manipulative, and self-centered traits related to psychopathy. Despite the strong and unique association between impulsivity and delinquency, people who were higher on Fearless dominance (measured with the PPI-R) endorsed being delinquent rather than not Ó 2018 Hogrefe Publishing
9
(binary measure). Fearlessness, then, may relate to being willing to engage in any delinquency at all in terms of a lower behavioral threshold, while Coldheartedness may relate to a greater engagement in delinquency in terms of a higher number of delinquent and violent activities (the latter shown in the combined predictive model). In terms of the structural validity of the two measures, the CFA derived model including both measures only provided poor model fit. Although reliabilities of the dimensions, subscales and total scores were high, some subscales did not show high factor loadings on their respective dimensions, such as Blame externalization, Machiavellian egocentricity, and carefree non-planfulness being poorly related to self-centered impulsivity. This is mainly the case for the subscales of the PPI-R, to a lesser extent for the YPI, which showed a good factor-analytic fit.
Limitations One major limitation of the present study is the use of selfreported delinquency with a new measure. Including thirdparty information on the behavior of the young adults would improve the relevance and the interpretability of these findings (Falkenbach, Poythress, & Heide, 2003; Roose, Bijttebier, Claes, Decoene, & Frick, 2010). As instrumental violence has been specifically linked to callous-unemotional traits (Fanti, Demetriou, & Kimonis, 2013; White, Gordon, & Guerra, 2015), it may be useful to examine the motivations for delinquency (like has been done for reactive and proactive aggression), as motivations could be found to be premeditated or proactive within people with high psychopathic traits. In addition, our small sample was not representative and had inconsistent responders (18%), which has been found in other studies as well (Sorman et al., 2016); however that reflects that this juvenile sample might be specifically prone to inconsistent responding as the rate is higher compared to studies in adults (e.g., 5.33% in Uzieblo et al., 2010).
Summary In sum, youths’ psychopathic traits as reported with both the YPI and the PPI-R and importantly, their dimensions reflect different correlates of psychopathic personality not only in adults but also in younger adults; thus, the present study adds to the support for the downward extension of psychopathy, including the PPI-R (Forth, Hart, & Hare, 1990). Young adults higher in psychopathic traits also reported engaging in delinquent activities, including violence. The PPI-R was as good as the YPI in robustly explaining delinquency. Thus, people who exhibit the personality dimensions related to psychopathy including fearlessness, European Journal of Psychological Assessment (2020), 36(1), 1–11
10
H. Eisenbarth & Luna C. M. Centifanti, Psychopathy Dimensions in Young Adults
coldhearted behavior, failing to accept blame for one’s actions, and being carefree and rebellious report engaging in decision making that results in delinquency and violence. Electronic Supplementary Materials The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000478 ESM 1. Data (.sav) YouthPPI_YPI_EJPA_data. ESM 2. Data (.spv) SPSS output for descriptive results. ESM 3. Data (.spv) SPSS output for correlations. ESM 4. Table (.csv) Spread sheet containing all YPI and PPI subscales as basis for CFA and models. ESM 5. Table (.csv) Spread sheet containing all YPI and PPI outliers. ESM 6. Data (.out) Mplus output file for the PPI CFA. ESM 7. Data (.out) Mplus output file for the YPI CFA. ESM 8. Data (.out) Mplus output file for the overall CFA. ESM 9. Data (.out) Mplus output file for model predicting delinquency by PPI and YPI. ESM 10. Data (.out) Mplus output file for model predicting delinquency by PPI. ESM 11. Data (.out) Mplus output file for model predicting delinquency by YPI.
References Alpers, G. W., & Eisenbarth, H. (2008). Psychopathy Personality Inventory Revised - Deutschsprachige Version. Testhandbuch [Psychopathy Personality Inventory Revised, German edition, Manual]. Göttingen, Germany: Hogrefe. Altmann, T., Liebe, N., Schönefeld, V., & Roth, M. (2017). The measure matters: Similarities and Differences of the five most important Sensation Seeking Inventories in an adolescent sample. European Journal of Psychological Assessment. Advance online publication. https://doi.org/10.1027/1015-5759/a000398 Andershed, H., Kerr, M., & Stattin, H. (2002). Understanding the abnormal by studying the normal. Acta Psychiatrica Scandinavica, 106, 75–80. https://doi.org/10.1034/j.1600-0447.106.s412.17.x Andershed, H., Kerr, M., Stattin, H., & Levander, S. (2002). Psychopathic traits in nonreferred youths: A new assessment tool. In E. Blaauw & L. Sheridan (Eds.), Psychopaths: Current International Perspectives (pp. 131–158). The Hague, The Netherlands: Elsevier. Ansel, L., Barry, C., Gillen, C. A., & Herrington, L. (2014). An analysis of four self-report measures of adolescent callousunemotional traits: Exploring unique prediction of delinquency,
European Journal of Psychological Assessment (2020), 36(1), 1–11
aggression, and conduct problems. Journal of Psychopathology and Behavioral Assessment, 37, 207–216. https://doi.org/ 10.1007/s10862-014-9460-z Benning, S. D., Patrick, C. J., Hicks, B. M., Blonigen, D. M., & Krueger, R. F. (2003). Factor structure of the Psychopathic Personality Inventory: Validity and implications for clinical assessment. Psychological Assessment, 15, 340–350. https:// doi.org/10.1037/1040-3590.15.3.340 Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238–246. https://doi.org/ 10.1037/0033-2909.107.2.238 Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness-of-fit in the analysis of covariance structures. Psychological Bulletin, 88, 588–606. https://doi.org/10.1037/00332909.107.2.238 Brazil, I. A., van Dongen, J. D., Maes, J. H., Mars, R. B., & BaskinSommers, A. R. (in press). Classification and treatment of antisocial individuals: From behavior to biocognition. Neuroscience and Biobehavioral Reviews. https://doi.org/10.1016/j. neubiorev.2016.10.010 Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing fit. In K. A. Bollen & J. S. Long (Eds.), Testing Structural Equation Models (pp. 136–162). Beverly Hills, CA: Sage. Chabrol, H., Leeuwen, N. V., Rodgers, R., & Séjourné, N. (2009). Contributions of psychopathic, narcissistic, Machiavellian, and sadistic personality traits to juvenile delinquency. Personality and Individual Differences, 47, 734–739. https://doi.org/ 10.1016/j.paid.2009.06.020 Chauhan, P., Ragbeer, S. N., Burnette, M. L., Oudekerk, B., Reppucci, N. D., & Moretti, M. M. (2012). Comparing the Youth Psychopathic Traits Inventory (YPI) and the Psychopathy Checklist-Youth Version (PCL-YV) Among Offending Girls. Assessment, 21, 181–194. https://doi.org/10.1177/1073191112460271 Cleckley, H. M. (1941). The mask of sanity. St. Louis, MO: Mosby. Declercq, F., Markey, S., Vandist, K., & Verhaeghe, P. (2009). The Youth Psychopathic Trait Inventory: Factor structure and antisocial behaviour in non-referred 12–17-year-olds. Journal of Forensic Psychiatry & Psychology, 20, 577–594. https://doi. org/10.1080/14789940802651757 DeLisi, M., Angton, A., Vaughn, M. G., Trulson, C. R., Caudill, J. W., & Beaver, K. M. (2013). Not my fault: Blame externalization is the psychopathic feature most associated with pathological delinquency among confined delinquents. International Journal of Offender Therapy and Comparative Criminology. https://doi. org/10.1177/0306624x13496543 Eisenbarth, H., & Alpers, G. W. (2015). Diagnostik psychopathischer Persönlichkeitszüge bei Straftätern: Interne Konsistenz und differenzielle Validität der deutschen Version des PPI-R im Maßregel- und Strafvollzug [Diagnostics of psychopathic traits in offenders: internal consistency and differential validity of the PPI-R for forensic patients and prisoners]. Zeitschrift für Klinische Psychologie und Psychotherapie, 44, 45–53. https:// doi.org/10.1026/1616-3443/a000286 Falkenbach, D. M., Poythress, N. G., & Heide, K. M. (2003). Psychopathic features in a juvenile diversion population: Reliability and predictive validity of two self-report measures. Behavioral Sciences & the Law, 21, 787–805. https://doi.org/ 10.1002/bsl.562 Fanti, K. A., Demetriou, C. A., & Kimonis, E. R. (2013). Variants of callous-unemotional conduct problems in a community sample of adolescents. Journal of Youth and Adolescence, 42, 964–979. https://doi.org/10.1007/s10964-013-9958-9 Forth, A. E., Hart, S. D., & Hare, R. D. (1990). Assessment of psychopathy in male young offenders. Psychological Assessment: A Journal of Consulting and Clinical Psychology, 2, 342– 344. https://doi.org/10.1037/1040-3590.2.3.342
Ó 2018 Hogrefe Publishing
H. Eisenbarth & Luna C. M. Centifanti, Psychopathy Dimensions in Young Adults
Fox, B. H., Jennings, W. G., & Farrington, D. P. (2015). Bringing psychopathy into developmental and life-course criminology theories and research. Journal of Criminal Justice, 43, 274–289. https://doi.org/10.1016/j.jcrimjus.2015.06.003 Hare, R. D. (2003). Manual for The Hare Psychopathy ChecklistRevised (2nd ed.). Toronto, Canada: Multi-Health Systems. Heinzen, H., Köhler, D., & Hinrichs, G. (2008, July). Reliability and Validity of the German Youth-Psychopathic-Traits-Inventory (YPI). Paper presented at the Conference Research in Forensic Psychiatry, Regensburg, Germany. Hillege, S., Das, J., & de Ruiter, C. (2010). The youth psychopathic traits inventory: Psycho- metric properties and its relation to substance use and interpersonal style in a Dutch sample of non-referred adolescents. Journal of Adolescence, 33, 83–91. https://doi.org/10.1016/j.adolescence.2009.05.006 Lewand, M. (2003). Fragebogen zur Legalbewährung Questionnaire. Psychology. Würzburg, Germany: University of Würzburg. Lilienfeld, S. O., Watts, A. L., Francis Smith, S., Berg, J. M., & Latzman, R. D. (2015). Psychopathy deconstructed and reconstructed: Identifying and assembling the personality building blocks of Cleckley’s chimera. Journal of Personality, 83, 593– 610. https://doi.org/10.1111/jopy.12118 Lilienfeld, S. O., & Widows, M. R. (2005). Psychopathy Personality Inventory Revised (PPI-R). Professional manual. Lutz, FL: Psychological Assessment Resources. Muñoz, L. C., Kerr, M., & Besic, N. (2008). The peer relationships of youths with psychopathic personality traits: A matter of perspective. Criminal Justice and Behavior, 35, 212–227. https://doi.org/10.1177/0093854807310159 Muthen, L., & Muthen, B. (2010). Mplus user’s guide (6th ed.). Los Angeles, CA: Muthen & Muthen. Neumann, C. S., Kosson, D. S., Forth, A. E., & Hare, R. D. (2006). Factor structure of the Hare Psychopathy Checklist: Youth Version (PCL: YV) in incarcerated adolescents. Psychological Assessment, 18, 142–154. https://doi.org/10.1037/1040-3590. 18.2.142 Neumann, C. S., & Pardini, D. (2014). Factor structure and construct validity of the Self-Report Psychopathy (SRP) Scale and the Youth Psychopathic Traits Inventory (YPI) in young men. Journal of Personality Disorders, 28, 419–433. https://doi.org/ 10.1521/pedi_2012_26_063 Poythress, N. G., Dembo, R., Wareham, J., & Greenbaum, P. E. (2006). Construct validity of the Youth Psychopathic Traits Inventory (YPI) and the Antisocial Process Screening Device (APSD) with justice-involved adolescents. American Association for Correctional and Forensic Psychology, 33, 26–55. https:// doi.org/10.1037/0021-843X.115.2.288 Roose, A., Bijttebier, P., Claes, L., Decoene, S., & Frick, P. J. (2010). Assessing the affective features of psychopathy in adolescence: A further validation of the inventory of callous and unemotional traits. Assessment, 17, 44–57. https://doi.org/ 10.1177/1073191109344153 Ross, S. R., Benning, S. D., Patrick, C. J., Thompson, A., & Thurston, A. (2009). Factors of the Psychopathic Personality Inventory: Criterion-related validity and relationship to the BIS/ BAS and five-factor models of personality. Assessment, 16, 71–87. https://doi.org/10.1177/1073191108322207 Salekin, R. T., & Frick, P. J. (2005). Psychopathy in children and adolescents: The need for a developmental perspective.
Ó 2018 Hogrefe Publishing
11
Journal of Abnormal Child Psychology, 33, 403–409. https://doi. org/10.1007/s10802-005-5722-2 Salihovic, S., Kerr, M., & Stattin, H. (2014). Under the surface of adolescent psychopathic traits: High-anxious and low-anxious subgroups in a community sample of youths. Journal of Adolescence, 37, 681–689. https://doi.org/10.1016/j.adolescence. 2014.03.002 Sherman, E. D., Lynam, D. R., & Heyde, B. (2014). Agreeableness accounts for the factor structure of the Youth Psychopathic Traits Inventory. Journal of Personality Disorders, 28, 262–280. https://doi.org/10.1521/pedi_2013_27_124 Sorman, K., Nilsonne, G., Howner, K., Tamm, S., Caman, S., Wang, H. X., . . . Kristiansson, M. (2016). Reliability and construct validity of the psychopathic personality inventory-revised in a Swedish non-criminal sample – A multimethod approach including psychophysiological correlates of empathy for pain. PLoS One, 11, e0156570. https://doi.org/10.1371/journal.pone.0156570 Taubner, S., White, L. O., Zimmermann, J., Fonagy, P., & Nolte, T. (2013). Attachment-related mentalization moderates the relationship between psychopathic traits and proactive aggression in adolescence. Journal of Abnormal Child Psychology, 41, 929– 938. https://doi.org/10.1007/s10802-013-9736-x Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum likelihood factor analysis. Psychometrika, 38, 1–10. https://doi.org/10.1007/BF02291170 Uzieblo, K., Verschuere, B., Van den Bussche, E., & Crombez, G. (2010). The validity of the Psychopathic Personality Inventory – Revised in a community sample. Assessment, 17, 334–346. https://doi.org/10.1177/1073191109356544 Vaughn, M. G., Edens, J. F., Howard, M. O., & Smith, S. T. (2009). An investigation of primary and secondary psychopathy in a statewide sample of incarcerated youth. Youth Violence and Juvenile Justice, 7, 172–188. https://doi.org/10.1177/ 1541204009333792 Vaughn, M. G., Litschge, C., Delisi, M., & Beaver, K. M. (2008). Psychopathic personality features and risks for criminal justice system involvement among emancipating forster youth. Children and Youth Services Review, 30, 1101–1110. https://doi. org/10.1016/j.childyouth.2008.02.001 White, B. A., Gordon, H., & Guerra, R. C. (2015). Callous – unemotional traits and empathy in proactive and reactive relational aggression in young women. Personality and Individual Differences, 75, 185–189. https://doi.org/10.1016/j.paid.2014.11.031
Received January 16, 2017 Revision received December 13, 2017 Accepted December 13, 2017 Published online August 3, 2018 EJPA Section/Category Personality
Hedwig Eisenbarth Department of Psychology University of Southampton University Road Southampton, SO17 1BJ United Kingdom h.eisenbarth@soton.ac.uk
European Journal of Psychological Assessment (2020), 36(1), 1–11
Original Article
Longitudinal Measurement Invariance of the Brief Symptom Inventory (BSI)-18 in Psychotherapy Patients Ruth von Brachel,1 Angela Bieda,2 Jürgen Margraf,1 and Gerrit Hirschfeld2 1
Mental Health Research & Treatment Center of Ruhr-University Bochum, Germany
2
Faculty of Business Management and Social Sciences, University of Applied Sciences Osnabrück, Germany
Abstract: The Brief Symptom Inventory (BSI)-18 is a widely-used tool to assess changes in general distress in patients despite an ongoing debate about its factorial structure and lack of evidence for longitudinal measurement invariance (LMI). We investigated BSI-18 scores from 1,081 patients from an outpatient clinic collected after the 2nd, 6th, 10th, 18th, and 26th therapy session. Confirmatory factor analysis (CFA) was used to compare models comprising one, three, and four latent dimensions that were proposed in the literature. LMI was investigated using a series of model comparisons, based on chi-square tests, effect sizes, and changes in comparative fit index (CFI). Psychological distress diminished over the course of therapy. A four-factor structure (depression, somatic symptoms, generalized anxiety, and panic) showed the best fit to the data at all measurement occasions. The series of model comparisons showed that constraining parameters to be equal across time resulted in very small decreases in model fit that did not exceed the cutoff for the assumption of measurement invariance. Our results show that the BSI-18 is best conceptualized as a four-dimensional tool that exhibits strict longitudinal measurement invariance. Clinicians and applied researchers do not have to be concerned about the interpretation of mean differences over time. Keywords: Brief Symptom Inventory (BSI)-18, confirmatory factor analysis, longitudinal measurement invariance, psychological distress
The main goal of psychotherapy and psychiatry is to reduce psychological distress and to improve mental health. Physicians use measures of psychological distress to identify patients in need for psychological interventions, therapists measure psychological distress during treatment to monitor their patients’ progress, and researchers measure the development of psychological symptoms over time as a crucial aspect of any clinical trial on psychological interventions. In addition to scales tailored to specific conditions or symptoms, several broad symptom scales have been developed to assess a broader number of symptom domains. The Brief Symptom Inventory (BSI; Derogatis & Melisaratos, 1983) and its short forms (Prinz et al., 2013) measure overall distress of a person as well as individual symptoms such as anxiety, depression, and somatization. The BSI may be used to identify those medical patients who also need psychological help (Zabora et al., 2001), to monitor the progress during therapy (Geisheim et al., 2002), or to measure outcomes in clinical studies (Piersma, Reaume, & Boes, 1994). Since monitoring progress and measuring outcomes involve European Journal of Psychological Assessment (2020), 36(1), 12–18 https://doi.org/10.1027/1015-5759/a000480
comparisons across time, it is vital to establish the longitudinal measurement invariance (LMI). The aims of the present study were twofold, first to investigate the factorial structure of the BSI in a large sample of outpatients and second to test for its LMI. There is an ongoing debate about the underlying structure of the BSI and its short form the BSI-18 (for reviews see, Loutsiou-Ladd, Panayiotou, & Kokkinos, 2008; Prinz et al., 2013). The BSI-18 was developed to assess three central symptom domains (somatization, depression, and anxiety) with six items each (Derogatis, 2000). Several studies conclude that these three scales have good psychometric properties and responses conform to this factorial structure (Abraham, Gruber-Baldini, Harrington, & Shulman, 2017; Derogatis, 2000; Durá et al., 2006; Galdón et al., 2008; Petkus et al., 2010; Prinz et al., 2013; Recklitis et al., 2006; Spitzer et al., 2011; Torres, Miller, & Moore, 2013; Wang et al., 2010; Wiesner et al., 2010; Zabora et al., 2001). There are, however, also authors suggesting that this measure assesses a single “psychological distress” factor (Asner-Self, Ó 2018 Hogrefe Publishing
R. von Brachel et al., Longitudinal Invariance of the BSI-18
Schreiber, & Marotta, 2006; Prelow, Weaver, Swenson, & Bowman, 2005), and other authors claiming that the BSI-18 is best conceptualized as measuring four factors (Andreu et al., 2008; Zabora et al., 2001). The four-factor models retain the somatization and depression factors but split the anxiety scale into two factors, that is, generalized distress and panic, each assessed by three items. Resolving this issue has important consequences for the interpretation of scale scores and is an important prerequisite for testing LMI. Our second aim is to test the LMI of the BSI-18. LMI is a special case of measurement invariance testing. While measurement invariance tries to answer the question whether a specific scale has the same meaning to different groups of participants, LMI tries to answer the question whether a scale has the same meaning over time (Widaman, Ferrer, & Conger, 2010). Even though LMI seems to be a fundamental aspect of psychological scales it is rarely tested (Borsboom, 2006). Several recent studies found that frequently used depressions scales such as the Beck Depression Inventory and the Hamilton Rating Scale for Depression lack LMI (Fokkema, Smits, Kelderman, & Cuijpers, 2013; Fried et al., 2016; Wu, 2016) while the Center for Epidemiologic Studies Depression Scale exhibits LMI (Ferro & Speechley, 2013). Thus, LMI has to be established for each individual scale in sufficiently large samples.
Materials and Methods Participants Participants in this study were outpatients at the Mental Health Research and Treatment Center at the psychology department of the Ruhr-University Bochum between the years 1990 and 2012 at which time the BSI was removed as a standard assessment. All therapists at this facility had at least a Master degree in psychology and at least one year of practical training in cognitive-behavioral therapy (CBT). All participants were undergoing CBT for a variety of diagnoses. Participants were asked to fill out questionnaires assessing their general mental health, their symptoms, and their satisfaction with their therapy at different points in treatment as part of routine diagnostic sessions. These sessions were scheduled after the 2nd, 6th, 10th, 18th, and 26th session. This study was approved by the local ethics committee under the number 318.
The Brief Symptom Inventory-18 The BSI (Derogatis & Melisaratos, 1983) in its original format measures symptom severity in nine different domains Ó 2018 Hogrefe Publishing
13
(Somatization, Obsession-Compulsion, Interpersonal Sensitivity, Depression, Anxiety, Hostility, Paranoid Ideation, Phobic Anxiety, and Psychoticism). There is a variety of short forms with the 18-item version (Derogatis, 2000) showing the best psychometric properties (Prinz et al., 2013). The BSI-18 consists of a 5-point Likert scale, participants indicate their agreement from 0 = not at all to 4 = extremely. The psychometric properties of the German version of the BSI-18 are good with internal consistencies ranging from .63 to .93 and item-total correlation 0.40. It is also correlated moderately to highly with other measures of symptom severity (Prinz et al., 2013; Spitzer et al., 2011).
Data Analysis Data were analyzed in three steps. First, we inspected individual items over time. Second, we used confirmatory factor analysis (CFA) to decide on the model that should be tested for measurement invariance. Specifically, we tested – for the five measurement occasions separately – the three different models that are discussed in the literature. In all models, each item loaded on only one latent variable and latent variables were correlated with one another. In keeping with studies into the factor structure of the BSI-18 (Recklitis et al., 2006; Torres et al., 2013; Wang et al., 2010; Wiesner et al., 2010) model parameters were estimated using Maximum Likelihood (ML) with robust standard errors to account for non-normality. We used conventional fit indices to assess overall model fit (Hu & Bentler, 1999), Akaike Information Criterion (AIC) to compare models at the individual time-points and report standardized loadings and covariances. Third, we systematically assessed LMI by fitting a series of more and more restricted models to the data. The baseline (configural) model for this analysis was based on the parameterization developed by Widaman and colleagues (2010) for testing for LMI. In this model 20 latent variables (four at each of the five measurement occasions) were used to model the responses to the 90 (18 items at 5 measurement occasions) observed variables. In the first step this parameterization entails first setting the mean of the latent variables at the first measurement occasion to 0 and the variance to 1. Second, the loadings of the first items on each latent variable were freely estimated and corresponding loadings of the first items were constrained to be equal across time. Third, the intercepts of the first items on each latent variable were estimated freely but the corresponding intercepts were constrained to be equal across time. The other parameters, that is, intercepts and variances for the latent variables, covariances among latent variables assessed at one measurement occasion, and covariances between items at different measurement occasions, were European Journal of Psychological Assessment (2020), 36(1), 12–18
14
estimated freely. The weak invariance model added acrosstime invariance constraints on the remaining loadings. Since 4 of the 18 loadings were already constrained in the baseline model this yielded 56 (18 – 4 4) degrees of freedom. The strong invariance model added across-time invariance constraints on the item intercepts. Again, since 4 of the 18 item intercepts were already constrained, this yielded 56 degrees of freedom. Lastly, the strict invariance model further added constraints on the residual variances of the items yielding 56 degrees of freedom. These different models were first compared using scaled chi-square tests (Satorra & Bentler, 2001). We calculate the effect size w that is based on chi-square test to describe the magnitude of the invariance. Because w is equal to Pearsons’ correlation coefficient it may be interpreted using the same conventions for small (w = 0.1), medium (w = .3), and large (w = .05) effect sizes (Newsom, 2015, p. 30). Since these chi-square based statistics are sensitive to sample size (Cheung & Rensvold, 2002; Little, 1997), we also used changes in fit indices to describe model differences. Specifically, we interpreted increases in CFI < .01 as indicating a similarly good model fit and thus support for invariance assumption (Cheung & Rensvold, 2002). All analyses were performed in R using the lavaan package (Rosseel, 2012).
Results Sample Overall, 1,771 patients visited the clinic at least once and participated in diagnostic sessions during the study period. Of these 590 patients took part in less than five diagnostic sessions, indicating short-term therapies or treatment termination. Of the 1,181 patients with at least five BSI-scores, no demographic or clinical data could be found for 100, leaving 1,081 patients that were included in the analysis. There were no statistically significant differences in the overall BSI-score at baseline between included and excluded participants. Participants were on average 37.07 years old (min = 18; max = 69; SD = 10.64) and mostly female (n = 632; 58%). The five most frequent first diagnoses according to Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV) terminology were panic disorder (n = 187; 17%), social phobia (n = 168; 15%), major depression with recurrent (n = 128; 12%) or a single episode (n = 76; 7%), and obsessivecompulsive disorder (n = 57; 5%).
Descriptive Statistics As a first step, we calculated the individual item means for each measurement occasion. As can be seen in Table 1 European Journal of Psychological Assessment (2020), 36(1), 12–18
R. von Brachel et al., Longitudinal Invariance of the BSI-18
participants showed decreased ratings in later compared to earlier sessions.
Structure of the BSI-18 To determine the structure of the BSI-18 and to yield a valid baseline model for invariance testing, we fitted the one-, three- and four-dimensional models described above to each of the five measurement occasions separately. The fit indices were good for all models and showed that the best fitting model according to AIC was always the model with four factors. Model fit for the four-factor model was good for all measurement occasions (Table 2). All individual standardized loadings were substantial (> .30) and increased for most items from the first to the last measurement occasion (see Table S1 in the Electronic Supplementary Material, ESM 1). Correlations between factors were substantial in both the three (.48 < r < .79) and four-factor model (.48 < r < .83; see Tables S2 and S3 in ESM 1). This four-factor solution was also invariant across gender and age (see S4 in ESM 1).
Longitudinal Measurement Invariance The series of model comparisons showed that adding constraints to the model yielded significantly decreased model fit as measured by chi-square tests (see Table 3), that is, constraining the loadings to be equal across time resulted in a significant decrease in model fit, w2(56) = 455.8; p < .001; w = 0.087; ΔCFI = 0.0041. However, these decreases were all smaller than the threshold of ΔCFI < .01. Similarly, the comparisons for strong and strict invariance showed very small albeit significant differences between the constrained and unconstrained models.
Discussion In the present study, the factorial structure and LMI of the BSI-18 during psychotherapy was investigated in a large outpatient sample. We found that a model with four latent dimensions showed the best fit to the data for all five measurement occasions. The latent dimensions were depression, somatization and two anxiety dimensions relating to panic and general anxiety. We also found that the BSI-18 exhibits strict longitudinal invariance. In the following, we will discuss each of these findings in turn, before describing general methodological aspects and limitations of the present study. An open issue concerning the factor structure of the BSI-18 is whether or not the anxiety factor needs to be split up into two factors (general anxiety and panic). Ó 2018 Hogrefe Publishing
R. von Brachel et al., Longitudinal Invariance of the BSI-18
15
Table 1. Item means for all measurement occasions Item
T1
T2
T3
T4
T5
Nervousness
2.90 (1.20)
2.71 (1.12)
2.55 (1.10)
2.47 (1.08)
2.41 (1.08)
Scared
1.92 (1.15)
1.77 (1.04)
1.68 (0.94)
1.63 (0.92)
1.64 (0.92)
Lonely
2.85 (1.37)
2.50 (1.28)
2.41 (1.24)
2.31 (1.24)
2.28 (1.21)
Blue
2.75 (1.25)
2.52 (1.17)
2.42 (1.17)
2.36 (1.21)
2.29 (1.15)
No interest
2.53 (1.34)
2.27 (1.23)
2.18 (1.23)
2.07 (1.20)
1.99 (1.13)
Fearful
2.44 (1.27)
2.24 (1.18)
2.13 (1.14)
2.05 (1.12)
2.00 (1.06)
Faintness
1.90 (1.09)
1.79 (1.00)
1.74 (0.96)
1.70 (0.96)
1.69 (0.93)
Nausea
2.08 (1.20)
2.03 (1.18)
1.98 (1.14)
1.91 (1.09)
1.87 (1.08)
Short of breath
1.89 (1.14)
1.77 (1.03)
1.74 (1.05)
1.69 (0.96)
1.68 (0.99)
Numb/tingling
1.76 (1.09)
1.66 (0.98)
1.69 (1.00)
1.66 (0.98)
1.66 (0.97)
Hopelessness
2.96 (1.39)
2.67 (1.32)
2.51 (1.27)
2.43 (1.27)
2.37 (1.26)
Body weakness
2.11 (1.20)
2.02 (1.17)
1.93 (1.14)
1.87 (1.10)
1.86 (1.09)
Tense
3.09 (1.25)
2.91 (1.18)
2.81 (1.15)
2.72 (1.13)
2.66 (1.15)
Panic episodes
2.32 (1.42)
1.97 (1.19)
1.89 (1.12)
1.76 (0.99)
1.78 (1.03)
Restlessness
2.23 (1.26)
2.13 (1.17)
2.11 (1.17)
2.02 (1.14)
1.97 (1.12)
Worthlessness
2.78 (1.45)
2.49 (1.36)
2.33 (1.33)
2.25 (1.31)
2.22 (1.28)
Chest pains
1.84 (1.10)
1.76 (1.01)
1.72 (0.99)
1.67 (0.92)
1.69 (0.97)
Suicidal thoughts
1.73 (1.10)
1.53 (0.92)
1.50 (0.88)
1.49 (0.86)
1.52 (0.91)
Table 2. Model Fit Indices at the five measurement occasions Factors
w2
1
One
1,842.351
135
0.724
1
Three
551.048
132
0.933
1
Four
482.162
129
0.943
56,397.638
0.055
0.043
2
One
1,680.569
135
0.766
53,671.268
0.119
0.079
2
Three
572.732
132
0.965
35,303.823
0.055
0.037
2
Four
456.591
129
0.951
52,038.380
0.055
0.038
3
One
1,522.678
135
0.788
52,296.144
0.115
0.077
3
Three
509.344
132
0.944
50,872.420
0.060
0.043
3
Four
435.645
129
0.955
50,778.330
0.054
0.041
4
One
1,518.242
135
0.803
50,542.507
0.114
0.075
4
Three
507.702
132
0.948
49,143.787
0.060
0.042
4
Four
417.124
129
0.960
49,024.902
0.053
0.040
5
One
1,383.608
135
0.807
49,982.899
0.113
0.074
50
Three
451.223
132
0.952
48,571.511
0.057
0.037
5
Four
383.586
129
0.962
48,478.828
0.051
0.036
Time
df
CFI
AIC
RMSEA
SRMR
58,051.912
0.120
0.089
56,475.422
0.060
0.045
Notes. df = degrees of freedom; CFI = Comparative Fit Index; AIC = Akaike Information Criterion; RMSEA = Root Mean Squared Error of Approximation; SRMR = Standardized Root Mean squared Residual.
Unfortunately, the authors that have contributed to this literature not only studied different populations but used vastly different methods to decide for or against a specific number of factors such as parallel analysis, EFA, and CFA. Studies supporting a three-dimensional structure did not explicitly test for the number of dimensions to retain, or compare the three- to the four-dimensional structure (Prinz et al., 2013; Spitzer et al., 2011; Torres et al., 2013). Studies supporting the single-factor model also did not Ó 2018 Hogrefe Publishing
directly compare the different models (Prelow et al., 2005) or compared a single-factor model with post hoc modifications to an unaltered three-factor model (AsnerSelf et al., 2006). In the present study, we directly compared the proposed models using CFA and found that a four-factor model with two anxiety dimension fit the data at all measurement occasions better than three- or onedimensional models. While the correlations between the panic and generalized anxiety scores were high, they were European Journal of Psychological Assessment (2020), 36(1), 12–18
16
R. von Brachel et al., Longitudinal Invariance of the BSI-18
Table 3. Model comparisons for LMI testing w2
df
Configural
8,593.846
3,669
Weak
8,942.657
3,725
Strong
9,114.393
Strict
9,852.268
Model
RMSEA
Δw2
w
ΔCFI
0.237
455.80
0.087
0.0041
0.240
242.25
0.064
0.0016
0.239
713.47
0.109
0.0095
CFI
SRMR
0.035
0.931
0.218
0.036
0.927
3,781
0.036
0.926
3,837
0.038
0.916
Note. Δw were calculated based on the scaled test statistic (Satorra & Bentler, 2001). 2
still smaller than suggested thresholds for factor correlations (Kline, 2015). Importantly, all other studies directly comparing the three and four-factor model found that the four-factor solution had a better fit to the data than the three-factor solution (Abraham et al., 2017; Durá et al., 2006; Galdón et al., 2008; Petkus et al., 2010; Recklitis et al., 2006; Wang et al., 2010; Wiesner et al., 2010). The authors all indicate that these small improvements may be due to chance and name theoretical reasons for their preference of the three-dimensional structure. While we believe it is very prudent to retain a factor model even if results of a single study point in a different direction, especially if it is a small study (Durá et al., 2006; Galdón et al., 2008; Hirschfeld, von Brachel, & Thielsch, 2014; Petkus et al., 2010), we believe that there is now converging evidence from several large-scale studies with several thousand participants across several countries and with diverse background (Abraham et al., 2017; Recklitis et al., 2006; Wiesner et al., 2010) all indicating that the four-dimensional model has a better fit to the empirical data than the three-factor model. This is also in-line with recent work in clinical psychology showing that panic disorder and generalized anxiety disorder are clearly distinct regarding their general distress, their psychophysiological patterns (McTeague & Lang, 2012) as well concerning their long-term course (Bruce et al., 2005). Both, however, share common etiological features such as uncertainty intolerance (Carleton et al., 2014) and are often comorbid, which may explain the high intercorrelation between the two factors in this study. Our investigation of the LMI showed some significant decreases in model fit when LMI was assumed, the magnitude of these decreases was smaller than established thresholds (Cheung & Rensvold, 2002) and appear to be smaller than studies into the BDI (Fokkema et al., 2013). Thus, it is appropriate to use BSI-18 scores for pre-post comparisons (Fokkema et al., 2013; Newsom, 2015; Widaman et al., 2010). Since measurement invariance is a property of the measurement scale rather than the construct under investigation, it is of little use to speculate on differences to studies into LMI that used different measures. However, reviewing other LMI studies raises a more general question about the methods that are used to investigate LMI. Researchers studying LMI (Ferro & Speechley, 2013; Fokkema et al., 2013) – and also MI (Raghavan, European Journal of Psychological Assessment (2020), 36(1), 12–18
Rosenfeld, & Rasmussen, 2015; Torres et al., 2013; Wang et al., 2010; Wiesner et al., 2010) – differ in their choice of parameterization of the model (via loadings of items or setting latent variable variances), estimation procedure (ML, robust ML, weighted least squares), and criterion for LMI (chi-square tests, changes in CFI). While all these have an impact on the conclusions that might be drawn there are no simulation studies in the context of LMI testing to guide these decisions. In addition to establishing evidence-based guidelines for standards to test for LMI it may be instructive to take a more graded approach to invariance testing. Rather than assuming that LMI is either present – if a decrease model fit is larger than some threshold – or absent – if a decrease model fit is smaller than some threshold – it may be more useful to think about the magnitude of these differences and their effect on clinical research questions. The present study has some limitations that need to be kept in mind. First, there is no information concerning the detailed content of the treatment (other than all received CBT), medication or duration of illness which may have influenced patients’ view of their illness and consequently their answers to the BSI-18. Second, since the data has been collected over 20 years, it is also possible that cohort effect or different Zeitgeist causes differences in answering the BSI-items. Third, the sample consisted of psychotherapy outpatients in a naturalistic setting, which allows for generalization for people with mild to moderate mental illness rather than psychiatric inpatients or somatic medical conditions like cancer. It would be interesting to investigate whether the BSI-18 possesses LMI for patients in somatic medical treatments. Furthermore, replication of LMI in other cultural settings would be beneficial. In summary, we find that the BSI-18 is best conceptualized as a four-dimensional screening instrument assessing depression, general anxiety, panic, and somatic symptoms. Furthermore, our finding of strict invariance supports the use of simple sum scores to assess individual change of symptom and distress in outpatient samples during psychotherapy. Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000480 Ó 2018 Hogrefe Publishing
R. von Brachel et al., Longitudinal Invariance of the BSI-18
ESM 1. Tables (.pdf) Tables gives the standardized factor loadings over time (S1) as well as the factor intercorrelations for the three- (S2) and four- (S3) factor solutions and results of cross-sectional invariance tests.
References Abraham, D. S., Gruber-Baldini, A. L., Harrington, D., & Shulman, L. M. (2017). The factor structure of the Brief Symptom Inventory-18 (BSI-18) in Parkinson disease patients. Journal of Psychosomatic Research, 96, 21–26. https://doi.org/ 10.1016/j.jpsychores.2017.03.002 Andreu, Y., Galdón, M. J., Dura, E., Ferrando, M., Murgui, S., García, A., & Ibáñez, E. (2008). Psychometric properties of the Brief Symptoms Inventory-18 (BSI-18) in a Spanish sample of outpatients with psychiatric disorders. Psicothema, 20, 844–850. Asner-Self, K. K., Schreiber, J. B., & Marotta, S. A. (2006). A crosscultural analysis of the Brief Symptom Inventory-18. Cultural Diversity & Ethnic Minority Psychology, 12, 367–375. https:// doi.org/10.1037/1099-9809.12.2.367 Borsboom, D. (2006). When does measurement invariance matter? Medical Care, 44, S176–S181. Bruce, S. E., Yonkers, K. A., Otto, M. W., Eisen, J. L., Weisberg, R. B., Pagano, M., . . . Keller, M. B. (2005). Influence of psychiatric comorbidity on recovery and recurrence in generalized anxiety disorder, social phobia, and panic disorder: A 12-year prospective study. The American Journal of Psychiatry, 162, 1179–1187. Carleton, R. N., Duranceau, S., Freeston, M. H., Boelen, P. A., McCabe, R. E., & Antony, M. M. (2014). “But it might be a heart attack”: Intolerance of uncertainty and panic disorder symptoms. Journal of Anxiety Disorders, 28, 463–470. Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-offit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233–255. Derogatis, L. R. (2000). The Brief Symptom Inventory-18 (BSI-18): Administration, scoring, and procedures manual. Minneapolis, MN: National Computer Systems. Derogatis, L. R., & Melisaratos, N. (1983). The Brief Symptom Inventory: An introductory report. Psychological Medicine, 13, 595–605. Durá, E., Andreu, Y., Galdón, M. J., Ferrando, M., Murgui, S., Poveda, R., & Jimenez, Y. (2006). Psychological assessment of patients with temporomandibular disorders: Confirmatory analysis of the dimensional structure of the Brief Symptoms Inventory 18. Journal of Psychosomatic Research, 60, 365–370. Ferro, M. A., & Speechley, K. N. (2013). Factor structure and longitudinal invariance of the Center for Epidemiological Studies Depression Scale (CES-D) in adult women: Application in a population-based sample of mothers of children with epilepsy. Archives of Women’s Mental Health, 16, 159–166. Fokkema, M., Smits, N., Kelderman, H., & Cuijpers, P. (2013). Response shifts in mental health interventions: An illustration of longitudinal measurement invariance. Psychological Assessment, 25, 520–531. https://doi.org/10.1037/a0031669 Fried, E. I., van Borkulo, C. D., Epskamp, S., Schoevers, R. A., Tuerlinckx, F., & Borsboom, D. (2016). Measuring depression over time... or not? Lack of unidimensionality and longitudinal measurement invariance in four common rating scales of depression. Psychological Assessment, 28, 1354–1367. https://doi.org/10.1037/pas0000275 Galdón, M. J., Durá, E., Andreu, Y., Ferrando, M., Murgui, S., Pérez, S., & Ibañez, E. (2008). Psychometric properties of the
Ó 2018 Hogrefe Publishing
17
Brief Symptom Inventory-18 in a Spanish breast cancer sample. Journal of Psychosomatic Research, 65, 533–539. Geisheim, C., Hahlweg, K., Fiegenbaum, W., Frank, M., Schröder, B., & von Witzleben, I. (2002). Das Brief Symptom Inventory (BSI) als Instrument zur Qualitätssicherung in der Psychotherapie [The German version of the Brief Symptom Inventory (BSI): Reliability and validity in a sample of outpatient psychotherapy patients]. Diagnostica, 48, 28–36. Hirschfeld, G., von Brachel, R., & Thielsch, M. (2014). Selecting items for Big Five questionnaires: At what sample size do factor loadings stabilize? Journal of Research in Personality, 53, 54–63. https://doi.org/10.1016/j.jrp.2014.08.003 Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55. Kline, R. B. (2015). Principles and practice of structural equation modeling. New York, NY: Guilford Press. Little, T. D. (1997). Mean and covariance structures (MACS) analyses of cross-cultural data: Practical and theoretical issues. Multivariate Behavioral Research, 32, 53–76. Loutsiou-Ladd, A., Panayiotou, G., & Kokkinos, C. M. (2008). A review of the factorial structure of the Brief Symptom Inventory (BSI): Greek evidence. International Journal of Testing, 8, 90–110. McTeague, L. M., & Lang, P. J. (2012). The anxiety spectrum and the reflex physiology of defense: From circumscribed fear to broad distress. Depression and Anxiety, 29, 264–281. Newsom, J. T. (2015). Longitudinal structural equation modeling. New York, NY: Routledge. Petkus, A. J., Gum, A. M., Small, B., Malcarne, V. L., Stein, M. B., & Wetherell, J. L. (2010). Evaluation of the factor structure and psychometric properties of the Brief Symptom Inventory-18 with homebound older adults. International Journal of Geriatric Psychiatry, 25, 578–587. Piersma, H. L., Reaume, W. M., & Boes, J. L. (1994). The Brief Symptom Inventory (BSI) as an outcome measure for adult psychiatric inpatients. Journal of Clinical Psychology, 50, 555–563. Prelow, H. M., Weaver, S. R., Swenson, R. R., & Bowman, M. A. (2005). A preliminary investigation of the validity and reliability of the Brief-Symptom Inventory-18 in economically disadvantaged Latina American mothers. Journal of Community Psychology, 33, 139–155. Prinz, U., Nutzinger, D. O., Schulz, H., Petermann, F., Braukhaus, C., & Andreas, S. (2013). Comparative psychometric analyses of the SCL-90-R and its short versions in patients with affective disorders. BMC Psychiatry, 13, 1. Raghavan, S. S., Rosenfeld, B., & Rasmussen, A. (2015). Measurement invariance of the Brief Symptom Inventory in survivors of torture and trauma. Journal of Interpersonal Violence, 32, 1708–1729. https://doi.org/10.1177/0886260515619750 Recklitis, C. J., Parsons, S. K., Shih, M.-C., Mertens, A., Robison, L. L., & Zeltzer, L. (2006). Factor structure of the Brief Symptom Inventory-18 in adult survivors of childhood cancer: results from the childhood cancer survivor study. Psychological Assessment, 18, 22–32. https://doi.org/10.1037/1040-3590.18.1.22 Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48, 1–36. Satorra, A., & Bentler, P. M. (2001). A scaled difference chi-square test statistic for moment structure analysis. Psychometrika, 66, 507–514. Spitzer, C., Hammer, S., Löwe, B., Grabe, H. J., Barnow, S., Rose, M., . . . Franke, G. H. (2011). The short version of the Brief Symptom Inventory (BSI-18): preliminary psychometric properties of the German translation. Fortschritte der Neurologie – Psychiatrie, 79, 517–523. https://doi.org/10.1055/s-0031-1281602 Torres, L., Miller, M. J., & Moore, K. M. (2013). Factorial invariance of the Brief Symptom Inventory-18 (BSI-18) for adults of
European Journal of Psychological Assessment (2020), 36(1), 12–18
18
Mexican descent across nativity status, language format, and gender. Psychological Assessment, 25, 300–305. Wang, J., Kelly, B. C., Booth, B. M., Falck, R. S., Leukefeld, C., & Carlson, R. G. (2010). Examining factorial structure and measurement invariance of the Brief Symptom Inventory (BSI)-18 among drug users. Addictive Behaviors, 35, 23–29. https://doi. org/10.1016/j.addbeh.2009.08.003 Widaman, K. F., Ferrer, E., & Conger, R. D. (2010). Factorial invariance within longitudinal structural equation models: Measuring the same construct across time. Child Development Perspectives, 4, 10–18. Wiesner, M., Chen, V., Windle, M., Elliott, M. N., Grunbaum, J. A., Kanouse, D. E., & Schuster, M. A. (2010). Factor structure and psychometric properties of the Brief Symptom Inventory-18 in women: A MACS approach to testing for invariance across racial/ethnic groups. Psychological Assessment, 22, 912–922. Wu, P.-C. (2016). Response shifts in depression intervention for early adolescents: Response shifts in depression intervention. Journal of Clinical Psychology, 72, 663–675. https://doi.org/ 10.1002/jclp.22291
European Journal of Psychological Assessment (2020), 36(1), 12–18
R. von Brachel et al., Longitudinal Invariance of the BSI-18
Zabora, J., Brintzenhofeszoc, K., Jacobsen, P., Curbow, B., Piantadosi, S., Hooker, C., . . . Derogatis, L. (2001). A new psychosocial screening instrument for use with cancer patients. Psychosomatics, 42, 241–246. Received September 3, 2017 Revision received January 1, 2018 Accepted January 8, 2018 Published online August 3, 2018 EJPA Section/Category Clinical Psychology Gerrit Hirschfeld Quantitative Methods University of Applied Sciences Osnabrück Caprivistr. 30A 49074 Osnabrück Germany g.hirschfeld@hs-osnabrueck.de
Ó 2018 Hogrefe Publishing
Original Article
Intuitive Eating A Novel Eating Style? Evidence From a Spanish Sample Juan Ramón Barrada,1 Blanca Cativiela,1 Tatjana van Strien,2,3 and Ausiàs Cebolla4,5 1
Facultad de Ciencias Sociales y Humanas, Universidad de Zaragoza, Teruel, Spain
2
Institute for Gender Studies and Behavioral Science Institute, Radboud University, Nijmegen, The Netherlands Faculty of Earth and Life Sciences, Institute of Health Sciences, VU University Amsterdam, The Netherlands
3 4
Facultad de Psicología, Universitat de València, Spain
5
CIBER Fisiopatología Obesidad y Nutrición (CB06/03), Instituto Carlos III, Santiago de Compostela, Spain
Abstract: Intuitive eating is defined as an adaptive way of eating that maintains a strong connection with the internal physiological signs of hunger and satiety. It has four elements: unconditional permission to eat whenever and whatever food is desired, eating for physical rather than for emotional reasons, reliance on hunger and satiety cues to determine when and how much to eat, and body-food choice congruence. In this study, we assessed the differences and similarities between intuitive eating, as measured with the Intuitive Eating Scale-2 (IES-2), and eating styles (restrained, emotional, and external eating), assessed with the Dutch Eating Behavior Questionnaire (DEBQ). Using a Spanish sample of mainly university students (n = 1,095) we found that (a) unconditional permission to eat presented a large negative correlation with restrained eating, r = –.82; (b) eating for physical reasons had a large negative correlation with emotional eating, r = –.70; (c) the dimensions of intuitive eating only showed very small correlations with positive and negative affect, satisfaction with life, body dissatisfaction or weight control behavior after restrained, emotional, and external eating had been partialled out. Altogether, the present results suggest that two of the dimensions of intuitive eating as assessed with the IES-2 are not very new or innovative. The most promising new dimension of intuitive eating seems to be body-food choice congruence. Keywords: intuitive eating, eating styles, validation, DEBQ, IES-2
Eating behavior has commonly been studied from a negative point of view (e.g., Tylka & Wilcox, 2006) with the use of words like risk factors, disordered eating, illness or pathology (i.e., Striegel-Moore & Bulik, 2007). Recently, an entirely different approach has emerged: Health At Every Size (HAES; Bombak, 2014; Miller, 2005). HAES focuses on health and adaptation, in contrast to weight maintenance or loss of body weight, and supports the dependency on internal processes of regulation of hunger and satiety (Bacon & Aphramor, 2011). A core concept of HAES is intuitive eating, defined as an adaptive way of eating that maintains a strong connection with the internal physiological signs of hunger and satiety (Tribole & Resch, 1995; Tylka, 2006). Intuitive eating has three main elements, namely: (a) unconditional permission to eat when hungry and to eat whatever food is desired, (b) eating for physical rather than emotional reasons, and (c) reliance on internal hunger and satiety cues to determine when and how much to eat. People who engage in intuitive eating are both well aware of their internal signals of hunger and satiety and trust these signals to guide their eating behavior (Tribole & Resch, 1995). According to Tylka (2006), adaptive eating (of which intuitive eating is one of the facets) is more than the absence of a preoccupation with food, binge eating, and dietary restriction: “Adaptive Ó 2018 Hogrefe Publishing
eating may be negatively related to but not solely defined by the absence of eating disorder symptoms” (p. 226). So, intuitive eating is supposed to be a new eating style which should be considered in addition to other more pathologyfocused eating styles (Tylka, 2006). Intuitive eating has been related to several relevant constructs associated with eating behavior and body image: negatively, with body mass index (BMI; Gast, Madanat, & Campbell Nielson, 2012; Smith & Hawks, 2006; Tylka, 2006), dieting behavior (Denny, Loth, Eisenberg, & Neumark-Sztainer, 2013), eating disorder symptomatology (Tylka & Wilcox, 2006), body dissatisfaction, and internalization of the thin ideal (Augustus-Horvath & Tylka, 2011; Tylka, 2006); and positively, with well-being (Tylka & Wilcox, 2006). A recent review of psychosocial correlates of intuitive eating among adult women can be found in Bruce and Ricciardelli (2016). At face value, the three dimensions of intuitive eating seem to resemble the already described eating styles of restrained eating (eating less than desired to maintain or lose body weight), emotional eating (the desire to eat in response to negative emotions), and external eating (eating in response to sensory cues – sight, smell, and taste of food – regardless of internal signals of hunger or satiety; van Strien, Frijters, Bergers, & Defares, 1986). All three European Journal of Psychological Assessment (2020), 36(1), 19–31 https://doi.org/10.1027/1015-5759/a000482
20
dimensions of intuitive eating can be conceptualized as the opposite pole of these existing eating styles: (a) Unconditional permission to eat seems to be the reverse of restrained eating; (b) eating for physical rather than for emotional reasons can be considered the opposite of emotional eating; and (c) reliance on hunger satiety cues can be considered to be similar, although in the opposite direction, to external eating. This possible overlap casts doubts about the appropriateness of developing a new theoretical framework (intuitive eating) and questionnaire (Intuitive Eating Scale, IES; Tylka, 2006) when previous theories and instruments have already been developed and tested.
Assessment of Intuitive Eating To overcome the excessive emphasis on the negative aspects of eating behavior, Tylka (2006) developed the Intuitive Eating Scale (IES), which measures the dimensions Unconditional Permission to Eat (9 items – 8 of them reverse scored – with statements like “I try to avoid certain foods high in fat, carbohydrates, or calories”), Eating for Physical Rather than Emotional Reasons (6 items – 5 of them reversescored – such as “I use food to help me soothe my negative emotions”) and Reliance on Internal Hunger/Satiety Cues (6 items such as “I can tell when I’m slightly full”). The IES was developed and tested in four studies with university women from the USA (Tylka, 2006) and showed promising psychometric properties. The next question that Tylka and Wilcox (2006) addressed was whether intuitive eating indeed implied more than the absence of eating pathology. Specifically, they tested whether the different subscales of the IES increased the percentage of variance explained of constructs such as positive affect or self-esteem over the variance explained by the 26-item version of the Eating Attitudes Test (EAT-26; Garner, Olmsted, Bohr, & Garfinkel, 1982), a test for screening eating disorders. They found positive evidence for this incremental validity. In spite of these results, this initial version of the scale showed some limitations (Tylka & Kroon Van Diest, 2013), such as the presence of a high number of reversescored items or a Cronbach’s α for the Reliance on Hunger and Satiety Cues scale at the low end of the acceptable limit (i.e., .70). This led to the development of the Intuitive Eating Scale-2 (IES-2; Tylka & Kroon Van Diest, 2013). A new dimension was added, Body-Food Choice Congruence, which measures the extent to which individuals match their food choices with their bodies’ needs, and is assessed with just three items (e.g., “I mostly eat foods that give my body energy and stamina”). As in Tylka (2006) with the IES,
European Journal of Psychological Assessment (2020), 36(1), 19–31
J. R. Barrada et al., Intuitive Eating
the IES-2 offered a statistically significant increment over the EAT-26 in the percentage of explained variance for several variables. Recently, the IES-2 has been adapted to French by Camilleri et al. (2015), with some problems replicating the original four-factor structure: The Body-Food Choice Congruence factor was removed from this version. Carbonneau et al. (2016) have adapted the IES-2 to French-Canadian. They recovered the four factors of the IES-2, but the uniquenesses of the pairs of items 13-14 and 22-23 had to be allowed to correlate. Van Dyck, Herbert, Happ, Kleveman, and Vögele (2016) have adapted the questionnaire to German. They have also found evidence favoring the four-factor solution, although some unclearly specified correlations between item uniquenesses had to be freed. They found that Restrained Eating as assessed with the Dutch Eating Behavior Questionnaire (DEBQ; van Strien et al., 1986) correlated –.68 with the dimension Unconditional Permission to Eat, whereas the DEBQ Emotional Eating correlated –.77 with Eating for Physical rather than Emotional Reasons. Ruzanska and Warschburger (2017) also adapted the IES-2 to German. They found the same pattern of results: evidence favoring the four-factor solution, although some unclearly specified correlations between item uniquenesses had to be freed. They found that Restrained Eating, assessed with the DEBQ, correlated –.61 with the dimension Unconditional Permission to Eat, whereas the DEBQ Emotional Eating correlated –.83 with Eating for Physical rather than Emotional Reasons. The relevant correlations between Unconditional Permission to Eat – from the IES-2 – and Restrained Eating – from the DEBQ – can be explained, at least in part, by the strong overlap between both constructs, as indicated by their item wording. For instance, Item 16 of the IES-2 reads “I allow myself to eat what food I desire at the moment”, while Item 11 of the DEBQ reads “Do you try to eat less at mealtimes than you would like to eat?”. The same can be said about Eating for Physical rather than Emotional Reasons (IES-2; e.g., Item 2, “I find myself eating when I’m feeling emotional (e.g., anxious, depressed, sad), even when I’m not physically hungry”) and Emotional Eating (DEBQ, e.g., Item 5, “Do you have a desire to eat when you are depressed or discouraged?” and Item 20, “Do you get the desire to eat when you are anxious, worried or tense?”).
Purpose of the Study One of the first steps when developing a new theoretical framework is to justify the novelty and need for it. If there are some previous theories or models that tap overlapping
Ó 2018 Hogrefe Publishing
J. R. Barrada et al., Intuitive Eating
constructs, the incremental validity of the new proposal must be assessed (Haynes & Lench, 2003; Hunsley & Meyer, 2003). From our point of view, that was not done in the case of intuitive eating. Showing that intuitive eating is different from disordered eating as measured with the EAT-26 (Tylka & Kroon Van Diest, 2013; Tylka & Wilcox, 2006) is not the same as showing that intuitive eating is a new perspective with regard to eating styles. This can only be done when intuitive eating and the three existing eating styles (restrained, emotional, and external eating) are simultaneously evaluated. That check of the novelty of intuitive eating over and above restrained, emotional, and external eating is the goal of the present study. All through the paper, we will consider intuitive eating and its commonly used measure (IES-2) as basically interchangeable. The best way to understand what a theory or a construct is, is to evaluate the way it is operationalized, especially when there appears to be a clear consensus about the method of assessment. For this purpose, the adaptation of the IES-2 to the Spanish language was a necessary first step.
Method
21
Intuitive Eating Scale-2 (Tylka & Kroon Van Diest, 2013) As previously described, this scale comprises 23 items grouped in four different subscales: Unconditional Permission to Eat (6 items, three of them reverse-scored), Eating for Physical rather than Emotional Reasons (8 items, four reverse-scored), Reliance on Hunger and Satiety Cues (6 items), and Body-Food Choice Congruence (3 items). Responses are provided on a scale ranging from 1 = strongly disagree to 5 = strongly agree. Scores for each subscale of the IES-2 are computed as the mean response of the items belonging to that dimension. The IES-2 was translated from English to Spanish following four steps: (1) The first and second authors of this study independently translated the IES-2; (2) Each version was sent to the other translator and each translator independently evaluated both versions, chose between the two translations for each item and could rewrite a new version; (3) The two translators met to discuss and agreed on a proposal; and (4) This proposal was sent to the fourth author for new comments, which were integrated into the final version.
Participants and Procedure The battery of questionnaires was administered through the Internet. The link was distributed through social nets (mainly Facebook and Twitter) and the e-mail distribution lists of the students from the university of the first two authors. Participants provided informed consent after reading the description of the study, where the anonymity of the responses was clearly stated. Participants had to be 18 years old or older to take the survey. A total of 1,095 participants completed the measures, 809 women (73.9%) and 286 men (26.1%). The mean age was 24.86 years (SD = 7.30, range [18, 65]). Concerning educational level, 0.2% of the sample reported not having completed primary studies, 2.4% completed secondary studies, 67.1% were university students, and 30.3% had completed university studies. The BMI, computed with self-reported height and weight, had a mean of 22.46 (SD = 3.39, range [14.30, 41.77]).
Measures Sociodemographics, Weight, and Height Participants reported their sex, age, education level, and nationality. They also reported their weight (to the nearest kilogram) and height (to the nearest centimeter).
Ó 2018 Hogrefe Publishing
Test translation followed the International Test Commission Guidelines (Muñiz, Elosua, & Hambleton, 2013). Dutch Eating Behavior Questionnaire (DEBQ; van Strien et al., 1986) Although there are several scales available for assessing restrained, emotional, and external eating styles, the DEBQ is the only questionnaire that simultaneously covers all three eating styles and was developed in community samples. The DEBQ comprises 33 items, responded to on a Likert-type scale ranging from 1 = seldom to 5 = very often. The Emotional Eating scale contains 13 items (e.g., “Do you have the desire to eat when you are irritated?”), the External Eating scale has 10 items (e.g., “Do you eat more than usual when you see others eating?”), and the Restraint scale contains 10 items (e.g., “Do you deliberately eat less in order to not become heavier?”). We used the Spanish version (Cebolla, Barrada, van Strien, Oliver, & Baños, 2014). Body Dissatisfaction Subscale of the Eating Disorder Inventory-2 (EDI-2; Garner, 1991) This subscale has nine items, with wordings like “I feel satisfied with the shape of my body”, intended to measure overall body dissatisfaction by asking respondents to rate
European Journal of Psychological Assessment (2020), 36(1), 19–31
22
on a 6-point Likert scale, from 1 = never to 6 = always, their dissatisfaction with their figure or specific parts of the body. The Spanish version was presented by Garner (1998). Positive and Negative Affect Schedule (PANAS; Watson, Clark, & Tellegen, 1988) The PANAS has 20 items measuring both positive and negative affect, with 10 items per dimension. Participants are asked to rate on a 5-point Likert scale, from 1 = very slightly or not at all to 5 = extremely, how much they experience different feelings and emotions, such as “Enthusiastic” for positive affect or “Nervous” for negative affect. We used the Spanish version of Moral de la Rubia (2011). When incorporating the PANAS into the web-survey, we incorrectly did not include one item per dimension, so our inadvertently shortened version only had 18 items. Satisfaction With Life Scale (SWLS; Diener, Emmons, Larsen, & Griffin, 1985) The SWLS assesses satisfaction with life through 5 items, such as “I am satisfied with my life,” responded to on a 7-point Likert scale ranging from 1 = strongly disagree to 7 = strongly agree. We used the Spanish version of the scale of Vázquez, Duque, and Hervás (2013). Weight Control Behavior Checklist (WCB; NeumarkSztainer, Wall, Larson, Eisenberg, & Loth, 2011) We asked the participants if they had engaged in 15 different behaviors (e.g., “used laxatives” or “skipped meals”) in order to reduce or control their weight during the last year. Responses were coded as No = 0 and Yes = 1. We used the Spanish version administered in the MABIC Project by Sánchez-Carracedo et al. (2013). For all the questionnaires, higher scores are interpreted as higher levels in the construct that lends its name to the scale or subscale.
Analyses We followed four steps to analyze the data. First, we computed descriptive statistics of the different subscales, associations between variables (Pearson correlations between numerical variables; Cohen’s d between sex and the rest of variables), and Cronbach’s alpha for all the dimensions. In this phase, we assumed that all the theoretical dimensions of the instruments would hold sound. Second, we tested the dimensional structure of the IES-2 scores and the DEBQ scores separately. For the IES-2, we tested two different confirmatory factor analysis (CFA) models: one, without correlated uniquenesses (Tylka & Kroon Van Diest, 2013); the other, where the uniquenesses of the pairs Item 13 – Item 14 and Item 22 – Item 23 were allowed to correlate (Carbonneau et al., 2016). By testing only previously published models, we discarded problems
European Journal of Psychological Assessment (2020), 36(1), 19–31
J. R. Barrada et al., Intuitive Eating
of capitalization on chance with model respecifications. We compared the fit of the best fitting CFA model with the fit of an exploratory structural equation model (ESEM). In this way, we can evaluate the adequacy of not fixing all the secondary loadings to zero. For the DEBQ, we tested an ESEM model with the correlated uniquenesses described in Barrada, van Strien, and Cebolla (2016) and Cebolla et al. (2014). In these papers, an ESEM was the preferred method to model the inter-item correlations of the 33 items of the DEBQ. Third, we analyzed the factor structure of both the items of the IES-2 (four theoretical factors) and the DEBQ (three theoretical factors). For this purpose, we used two approaches. In the first one, the inter-item correlations of the IES-2 items were modeled with the model that provided the best fit in the previous step and the inter-item correlations of the DEBQ were modeled with the described ESEM (Barrada et al., 2016; Cebolla et al., 2014). By doing so, no cross-loadings between the IES-2 and the DEBQ factors were allowed. We considered the assumption of no relevant cross-loading to have low probability to hold. Not incorporating relevant cross-loadings in the model can distort the inter-factor correlations (Asparouhov and Muthén, 2009). Considering this, in the second approach all the items were simultaneously submitted to an ESEM analysis, which allows for cross-loadings. If the IES-2 and the DEBQ are assessing conceptually distinguishable – albeit related – constructs, a solution with seven factors should show an adequate fit and a clear structure. If two dimensions are so related – as indicated by their correlation based on summed scores or latent factors – that they can be statistically collapsed, a lower number of dimensions would be required to explain the inter-item correlations. If the inter-item correlations of two sets of items, each set operationalizing a supposedly different construct, can be explained by a single latent dimension, it becomes difficult to argue that those two constructs are, in fact, different. For all the factor models we interpreted the standardized solution (STDYX solution in MPlus). Goodness of fit of all the derived models was assessed with the common cutoff values for the fit indices (Hu & Bentler, 1999): CFI and TLI with values greater than .95 and RMSEA less than .06 are indicative of a satisfactory fit. We localized areas of ill fit through the inspection of modification indices (MI). For all the models, the weighted least squares means and variance (WLSMV) estimator was used. By using this estimator we were able to maintain the categorical nature of the responses (Finney & DiStefano, 2006). For the ESEM models, we used target rotation. As described by Asparouhov and Muthén (2009),
“[c]onceptually, target rotation can be said to lie in between the mechanical approach of EFA [exploratory Ó 2018 Hogrefe Publishing
J. R. Barrada et al., Intuitive Eating
factor analysis] rotation and the hypothesis-driven CFA model specification. In line with CFA, target loading values are typically zeros representing substantively motivated restrictions. Although the targets influence the final rotated solution, the targets are not fixed values as in CFA, but zero targets can end up large if they do not provide good fit” (p. 409). Fourth, partial correlations were computed. We assessed the relation between positive and negative affect, body dissatisfaction, satisfaction with life, and weight control behaviors, on the one hand, with the four dimensions of the IES-2, on the other hand, while simultaneously controlling for restrained eating, emotional eating, and external eating. In this way, we could evaluate the incremental validity of the newly proposed constructs after removing the variance explained by the DEBQ. ESEM and CFA models were estimated with Mplus 7.4 (Muthén & Muthén, 1998–2015). The rest of the analyses were performed with R 3.4.1 (R Core Team, 2017). We used the packages psych version 1.6.12 (Revelle, 2017) and MplusAutomation version 0.6-4 (Hallquist & Wiley, 2016). No missing data were present in our database. Data and all syntax files needed to reproduce the analyses are available as Electronic Supplementary Materials, ESM 1–10.
Results Reliabilities and Correlations The Cronbach’s α for the assessed dimensions, descriptive statistics, and associations for the different variables can be seen in Table 1. The reliability of the scales, as measured by Cronbach’s α, were adequate for our research purposes, as they ranged from .72 (Body-Food Choice Congruence) to .95 (Emotional Eating). We will not comment on all the associations. Age was basically unrelated with all the variables, |r| .11, except for BMI, r = .25. The higher correlation for BMI was with Body Dissatisfaction, r = .31. Regarding sex, we will only indicate medium-high differences, d 0.50. Men presented higher means in Eating for Physical rather than Emotional Reasons, d = 0.59, and higher mean BMI, d = 0.77; women presented a higher mean in Body Dissatisfaction, d = 0.66, and in Emotional Eating, d = 0.51. The correlations among the different dimensions of the IES-2 subscales were small, ranging from –.18 to .29. Most importantly, the IES-2 subscales presented high correlations between two of the dimensions that we expected to be highly overlapping: Eating for Physical rather than Emotional Reasons with Emotional Eating, r = –.82; and Unconditional Ó 2018 Hogrefe Publishing
23
Permission to Eat with Restrained Eating, r = –.70. Contrary to our expectation, Reliance on Hunger and Satiety Cues and External Eating were essentially independent, r = –.03. The p-values of all the reported associations were < .001, with the exception of the last correlation, p = .300.
Factor Structure of the IES-2 We started by fitting a CFA model without correlated errors. The fit of this and the following models can be seen in Table 2. For this model, the fit was clearly below our proposed cut points, CFI = .938, TLI = .930, RMSEA = .089. In the next model, the uniquenesses of two pairs of items were allowed to correlate, which led to an improvement of model fit, CFI = .953, TLI = .947, RMSEA = .078, although the TLI and, mainly, RMSEA values were not in the satisfactory range. Two modification indices stood out, both indicating the adequacy of allowing a cross-loading in the Eating for Physical rather than Emotional Reasons factor: Item 4, MI = 176.6, and Item 7, MI = 161.1. The ESEM model, where the items loaded on all the factors, did not present a relevant improvement in model fit, especially when we examine the fit indices that consider model complexity, CFI = .962, TLI = .942, RMSEA = .081. Taking this into account, we considered that the best fitting solution for the IES-2 was the CFA model with correlated uniquenesses. In this model, the unsigned loadings (|λ|), which can be seen in Table 3, were medium-high for all the items: for Eating for Physical rather than Emotional Reasons, M|λ| = .76, range [.56, .87]; for Unconditional Permission to Eat, M|λ| = .68, range [.59, .81]; for Reliance on Hunger and Satiety Cues, M|λ| = .75, range [.60, .84]; and for Body-Food Choice Congruence, M|λ| = .75, range [.59, .93]. The correlations between uniquenesses were high: for Item 22 – Item 23, equal to .77; for Item 13 – Item 14, equal to .42. For the DEBQ scores, the ESEM model provided an adequate fit, although the RMSEA was slightly over the cut point, CFI = .966, TLI = .958, RMSEA = .063.
Factor Structure of the IES-2 and the DEBQ We tested three different models. The first one had seven factors (four for the IES-2 scores and three for the DEBQ scores). The IES-2 items were modeled with a CFA and the DEBQ items with an ESEM. The second model had the same seven factors and all the items were submitted to an ESEM. In the final model, following the correlations observed between scales, we tested an ESEM with five
European Journal of Psychological Assessment (2020), 36(1), 19–31
24
J. R. Barrada et al., Intuitive Eating
Table 1. Descriptive statistics, associations, partial correlations, and Cronbach’s α for the assessed dimensions 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Pearson correlations 1. IES UncP 2. IES EatP
.13
3. IES RelH
.29
4. IES B-FC
–.18
.21 .26
.15
5. DEBQ Emot
–.13
–.82
–.22
6. DEBQ Restr
–.70
–.29
–.24
.06
.31
7. DEBQ Exter
.14
–.45
–.03
–.17
.53
–.20 .08
8. PANAS NA
–.05
–.31
.00
–.15
.30
.15
.21
9. PANAS PA
.03
.18
.02
.17
–.16
–.02
–.01
–.13
10. SWLS
.07
.19
.02
.15
–.18
–.10
–.02
–.36
.36
11. EDI BD
–.28
–.46
–.22
–.22
.42
.49
.23
.26
–.19
12. WCB
–.53
–.33
–.19
.05
.29
.73
.07
.16
–.03
–.07
.46
13. Age
–.07
.00
–.06
–.03
.02
.08
–.11
–.08
.05
–.01
–.03
–.01
14. BMI
–.12
–.16
–.16
–.10
.11
.18
.00
.00
–.03
–.09
.31
.18
.25
–0.11
0.21
–0.07
–0.66
–0.44
0.16
0.77
–.27
Cohen’s d 15. Sex (men = 1)
0.09
0.14
–0.51
–0.42
–0.09
0.21
0.59
8. PANAS NA
.03
–.11
.07
–.10
9. PANAS PA
.01
.10
–.02
.15
–.00
.08
–.03
.13
.06
–.21
–.08
–.23
12. WCB
–.03
–.18
–.01
.02
M
3.39
3.51
3.15
3.56
28.09
26.05
31.19
16.07
22.26
22.84
28.84
5.36
24.86
22.46
0.26
SD
0.82
0.89
0.82
0.71
10.53
8.64
6.58
4.69
4.55
6.59
10.74
2.98
7.30
3.39
0.44
.79
.88
.86
.72
.95
.91
.85
.82
.78
.87
.90
.81
Partial correlations controlling for DEBQ Emot, DEBQ Restr, DEBQ Exter
10. SWLS 11. EDI BD
α
Notes. IES = Intuitive Eating Scale-2; UncP = Unconditional Permission to Eat; EatP = Eating for Physical rather than Emotional Reasons; RelH = Reliance on Hunger and Satiety Cues; B-FC = Body-Food Choice Congruence; DEBQ = Dutch Eating Behavior Questionnaire; Emot = Emotional Eating; Restr = Restrained Eating; Exter = External Eating; PANAS = Positive and Negative Affect Schedule; NA = Negative Affect; PA = Positive Affect; SWLS = Satisfaction with Life Scale; EDI = Eating Disorder Inventory-2; BD = Body Dissatisfaction; WCB = Weight Control Behaviors; BMI = Body Mass Index. Italicized values correspond to statistically significant correlations (p < .05). Shaded cells correspond to the pairs of dimensions that were expected to be highly similar. Sex was coded with a dummy variable, where 0 = women and 1 = men.
Table 2. Goodness of fit indices for the different models Models
w2
y
df
CFI
TLI
RMSEA
M1. CFA IES-2
2,153.4
224
.938
.930
.089
M2. CFA IES-2 CU
1,687.7
222
.953
.947
.078
M3. ESEM IES-2 CU
1,358.2
165
.962
.942
.081
M4. ESEM DEBQ CU
2,254.3
423
.966
.958
.063
M5. CFA IES-2 CU & ESEM DEBQ CU
6,479.0
1,392
.937
.930
.058
M6. ESEM IES-2 CU & DEBQ CU 7 FACTORS
3,827.1
1,158
.967
.956
.046
M7. ESEM IES-2 CU & DEBQ CU 5 FACTORS
6,448.3
1,259
.935
.921
.061
Notes. df = degrees of freedom; TLI = Tucker-Lewis index; CFI = comparative fit index; RMSEA = root mean square error of approximation; CFA = confirmatory factor analysis; ESEM = exploratory structural equation modeling; CU = correlated uniquenesses. yAll p-values for the w2 test were < .001.
factors, where Emotional Eating was expected to collapse with Eating for Physical rather than Emotional Reasons (correlation between observed scores = –.82), like Restrained
European Journal of Psychological Assessment (2020), 36(1), 19–31
Eating with Unconditional Permission to Eat (correlation between observed scores = –.70). We maintained the correlated uniquenesses from previous models.
Ó 2018 Hogrefe Publishing
J. R. Barrada et al., Intuitive Eating
25
Table 3. Factor loadings and inter-factor correlations for the IES-2 scores (confirmatory factor analysis with correlated uniquenesses) Factor loadings (M2) EatP I10. I use food to help me soothe my negative emotions.
–.87
I02. I find myself eating when I’m feeling emotional (e.g., anxious, depressed, sad), even when I’m not physically hungry.
–.86
I11. I find myself eating when I am stressed out, even when I’m not physically hungry.
–.84
I05. I find myself eating when I am lonely, even when I’m not physically hungry.
–.77
I15. I find other ways to cope with stress and anxiety than by eating.
.77
I14. When I am lonely, I do NOT turn to food for comfort.
.71
I12. I am able to cope with my negative emotions (e.g., anxiety, sadness) without turning to food for comfort.
.68
I13. When I am bored, I do NOT eat just for something to do.
.56
I16. I allow myself to eat what food I desire at the moment.
UncP
RelH
B-FC
.81 –.73
I09. I have forbidden foods that I don’t allow myself to eat. I17. I do NOT follow eating rules or dieting plans that dictate what, when, and/or how much to eat.
.66
I01. I try to avoid certain foods high in fat, carbohydrates, or calories.
–.65
I04. I get mad at myself for eating something unhealthy.
–.62
I03. If I am craving a certain food, I allow myself to have it.
.59
I21. I rely on my hunger signals to tell me when to eat.
.84
I08. I trust my body to tell me how much to eat.
.81
I23. I trust my body to tell me when to stop eating.
.77
I06. I trust my body to tell me when to eat.
.77
I22. I rely on my fullness (satiety) signals to tell me when to stop eating.
.72
I07. I trust my body to tell me what to eat.
.60
I19. I mostly eat foods that make my body perform efficiently (well).
.93
I20. I mostly eat foods that give my body energy and stamina.
.72
I18. Most of the time, I desire to eat nutritious foods.
.59 Inter-factor correlations EatP
UncP
RelH
B-FC
EatP UncP
.17
RelH
.25
.36
B-FC
.31
–.25
.19
Notes. EatP = Eating for Physical rather than Emotional Reasons; UncP = Unconditional Permission to Eat; RelH = Reliance on Hunger and Satiety Cues; B-FC = Body-Food Choice Congruence. Shaded cells indicate the factor where the item theoretically belongs. Loadings in bold indicate unsigned loadings over |.30|. Items ordered by unsigned loading.
For the seven-factor CFA-ESEM model, the fit to the data was below the recommended thresholds, CFI = . .937, TLI = . .930, RMSEA = .058. When we inspected the modification indices, we found values as high as 408.4, indicating the convenience of allowing Item 4 of the IES-2 (Eating for Physical rather than Emotional Reasons dimension) to load on the Emotional Eating dimension of the DEBQ. In this model, Emotional Eating and Eating for Physical rather than Emotional Reasons factors correlated .90; Restrained Eating and Unconditional Permission to Eat factors correlated .82. These large correlations should be interpreted with caution given the presence of relevant specification errors. The ESEM seven-factor solution provided an adequate fit to the data, CFI = .967, TLI = .956, RMSEA = .046. Item Ó 2018 Hogrefe Publishing
loadings for this and the next model can be seen in Table 4. For this model, the problem was its interpretability. Applying the threshold of |λ| .30, 15 items showed relevant cross-loadings, mainly between the pairs of dimensions Emotional Eating – Eating for Physical rather than Emotional Reasons and Restrained Eating – Unconditional Permission to Eat. In the Unconditional factor, M|λ| was rather small, equal to .33, so we consider that the content of the items belonging to this factor were better recovered by the Restrained Eating factor. The Eating for Physical rather than Emotional Reasons consisted of the items related to eating in response to boredom (e.g., Item 13 of IES-2 – “When I am bored, I do NOT eat just for something to do” – or Item 3 of DEBQ – “Desire to eat when nothing to do. . .” –), although not all the items that loaded on that factor tap this content. European Journal of Psychological Assessment (2020), 36(1), 19–31
26
J. R. Barrada et al., Intuitive Eating
Table 4. Factor loadings and inter-factor correlations for the DEBQ and IES-2 scores (exploratory structural equation models with correlated uniquenesses with seven and five-factor solutions) Seven-Factor Solution (M6)
Five-Factor Solution (M7)
Factor Loadings
Factor Loadings
Emot
Restr
Exter
EatP
UncP
RelH
B–FC
Emot
Restr
RelH
Exter
D01
.68
.05
.07
.07
.08
–.02
.05
.75
.02
–.01
.05
B–FC .13
D03*
.00
.00
.41
.53
–.05
–.09
.01
.42
.03
–.05
.39
–.17
D05*
.58
.12
.04
.31
.20
–.05
.08
.86
.05
.02
.02
–.02
D08*
.43
–.03
.13
.41
–.13
.01
–.04
.74
.02
.02
.08
–.08
D10
.77
–.06
.08
.09
–.15
.01
–.05
.86
–.03
–.03
.02
.14
D13
.84
–.05
.07
.01
–.03
–.03
.01
.87
–.06
–.04
.03
.19
D16
.86
–.02
.06
–.08
.00
–.05
–.07
.85
–.06
–.07
.01
.17
D20
.63
.08
–.01
.26
.26
–.06
.10
.88
.00
.02
–.03
.01
D23
.87
.01
–.02
.09
.04
–.03
.01
.98
–.03
–.03
–.06
.16
D25
.89
.00
.06
.02
.04
.02
.02
.95
–.05
.01
.02
.21
D28*
.23
–.02
.29
.44
–.03
–.07
.03
.57
.01
–.04
.26
–.10
D30
.85
–.12
.05
–.12
–.24
.03
–.09
.78
–.06
–.03
–.03
.20
D32
.89
–.04
.09
–.02
–.15
.02
–.03
.90
–.02
–.02
.02
.22
I02
.27
.24
–.01
.56
.31
–.02
.07
.77
.12
.10
–.02
–.20
I05
.15
.07
.13
.64
–.08
.05
.00
.64
.10
.09
.10
–.20
I10*
.42
.17
.00
.44
.13
–.07
.02
.78
.13
.00
–.02
–.12
I11*
.38
.17
–.03
.46
.35
–.04
.10
.81
.05
.08
–.03
–.13
I12
–.29
–.02
.15
–.48
.02
.02
.09
–.67
–.02
.00
.18
.18
I13
.14
.06
–.26
–.67
.18
.07
.07
–.37
–.01
.05
–.24
.24
I14
–.15
.10
.02
–.66
.24
.03
.13
–.65
.01
.02
.07
.23
I15
–.21
.02
.14
–.62
.10
.07
.18
–.70
.00
.05
.18
.29
D04
–.09
.86
.04
.02
.02
.04
–.07
–.09
.84
.08
.10
–.13
D07
–.02
.75
–.03
.09
–.20
–.05
.03
.02
.86
–.01
–.02
–.03
D11
–.01
.73
.09
.07
–.09
–.09
–.02
.02
.77
–.05
.12
–.07
D14*
–.05
.43
.01
–.08
–.29
–.08
.36
–.16
.66
–.05
.01
.32
D17
.12
.58
.04
.05
–.20
.01
–.03
.14
.66
.02
.03
–.01
D19
.03
.75
.06
–.09
.01
.06
–.01
–.06
.74
.10
.10
–.03
D22
–.03
.90
.05
.04
–.01
–.04
–.03
–.02
.91
.02
.11
–.11
D26
–.04
.78
.07
.03
–.07
–.07
–.06
–.03
.81
–.03
.11
–.11
D29
.03
.61
–.05
.17
–.11
.18
–.26
.16
.58
.18
–.05
–.24
D31
.08
.72
.06
–.04
–.17
–.01
.00
.02
.81
.01
.07
.00
I01*
–.05
.42
–.11
–.02
–.21
–.13
.30
–.11
.62
–.09
–.10
.24
I03
–.07
–.21
.33
.01
.40
.13
.02
–.02
–.42
.13
.38
–.04
I04*
.02
.44
–.01
.17
–.28
.04
.15
.13
.60
.07
–.03
.09
I09*
.07
.34
–.12
.04
–.45
–.03
.19
.06
.61
–.02
–.17
.20
I16*
–.14
–.39
.16
.09
.33
.24
–.07
–.04
–.60
.22
.19
–.11
I17
–.14
–.28
.07
.06
.30
.20
–.13
–.06
–.48
.19
.09
–.16
D02
–.10
.04
.61
.16
.21
–.09
.03
.06
–.07
–.06
.63
–.07
D06
.04
.00
.58
.08
.14
–.05
.02
.13
–.08
–.04
.58
–.02
D09
.01
.05
.79
–.13
.05
–.02
.01
–.07
.00
–.03
.79
.05
D12
–.03
.03
.54
.04
.14
–.02
–.05
.03
–.08
–.02
.55
–.07
D15
.14
.14
.61
–.27
.04
–.02
–.06
–.06
.08
–.04
.62
.04
D18
.08
–.09
.64
.03
–.17
.06
.03
.11
–.04
.03
.60
.07
D21
.05
.13
.70
.06
.11
.01
–.03
.12
.03
.01
.71
–.05
D24
.27
.10
.64
–.23
–.07
.03
–.08
.10
.07
.00
.61
.07
D27
.21
.01
.50
.05
–.18
.08
–.02
.25
.06
.05
.46
.05
D33
.03
–.02
.49
–.01
.09
–.03
.09
.04
–.05
–.02
.49
.07
(Continued on next page)
European Journal of Psychological Assessment (2020), 36(1), 19–31
Ó 2018 Hogrefe Publishing
J. R. Barrada et al., Intuitive Eating
27
Table 4. (Continued) Seven-Factor Solution (M6)
Five-Factor Solution (M7)
Factor Loadings
Factor Loadings
Emot
Restr
Exter
EatP
UncP
RelH
B–FC
Emot
Restr
RelH
Exter
B–FC
.05
.09
–.04
–.04
–.04
.82
–.01
.00
.04
.79
–.07
.04
I07
.03
.01
.01
.16
–.05
.69
–.04
.15
–.04
.66
–.02
–.04
I08
–.03
.06
–.01
.08
.04
.84
.01
.01
–.03
.82
–.03
.00
I21
.02
.07
.07
–.09
.06
.83
.11
–.07
–.01
.81
.06
.13
I22
–.04
–.01
–.07
–.05
.10
.64
.09
–.09
–.10
.62
–.06
.08
I23
–.03
.03
–.01
–.04
.07
.73
.04
–.08
–.06
.71
–.02
.05
I18
.01
–.05
.03
–.08
–.05
.11
.54
–.08
.10
.13
.05
.50
I19
–.02
–.09
–.03
–.06
–.11
.00
.92
–.12
.20
.06
.00
.78
I20
.01
–.23
–.03
.00
–.11
.14
.71
–.02
–.01
.17
–.01
.66
Exter
B–FC
I06
Inter-Factor Correlations Emot
Restr
Exter
EatP
UncP
Inter-Factor Correlations RelH
B–FC
Emot
Restr
RelH
Emot
Emot Restr
.32
Exter
.41
–.01
EatP
.65
.21
.41
UncP
–.02
–.29
.14
.07
RelH
–.23
–.30
.01
–.17
.12
B–FC
–.09
.23
–.14
–.15
–.09
Restr
.30
RelH
–.18
–.25
Exter
.48
–.01
.04
B -FC
–.21
.07
.03
–.17
.01
Notes. Emot = Emotional Eating; Restr = Restrained Eating; Exter = External Eating; EatP = Eating for Physical rather than Emotional Reasons; UncP = Unconditional Permission to Eat; RelH = Reliance on Hunger and Satiety Cues; B-FC = Body-Food Choice Congruence. Factor labels correspond to the expected content, not to the found content. Item numbering starting with I corresponds to the IES-2, starting with D to the DEBQ. Shaded cells indicate the factor where the item theoretically belongs. Loadings in bold indicate unsigned loadings over |.30|. Italicized values indicate cross-loadings over |.30|. Items with an asterisk indicate problematic items due to two loadings over |.30| in the seven-factor solution.
The other three factors were more clearly recovered. In this solution, the largest modification index corresponded to the correlation between the uniquenesses of the DEBQ Items 18 and 27 – both measuring External Eating – MI = 173.1. Importantly, in the ESEM seven-factor solution factor labels correspond to the expected content, not to the found content. It is doubtful that the content of all the recovered factors corresponds to the theoretically expected content. In line with this, the factors labeled Restrained Eating and Unconditional Permission to Eat correlated –.29, while the correlation based on summed scores was –.70. In the ESEM five-factor solution, all five factors could be clearly theoretically interpreted, with a low presence of relevant cross-loadings – only three secondary loadings were .30 – but the model fit was worsened, CFI = .935, TLI = .921, RMSEA = .061. In this model, the higher modification index corresponded to the correlation between the uniquenesses of the IES-2 Items 19 and 20 – both measuring Body-Food Choice Congruence – MI = 422.3.
Partial Correlations We computed partial correlations between five dependent variables and IES-2 scores while controlling for the three
Ó 2018 Hogrefe Publishing
eating styles assessed by the DEBQ. As can be seen in Table 1, for three IES-2 dimensions, the sizes of the partial correlations were greatly reduced in comparison with the zero-order correlations. The maximum zero-order correlation was .53; for partial correlations, the maximum was .23. In the case of Unconditional Permission to Eat, the mean (unsigned) correlation dropped from .19 to .03 (maximum partial correlation = .06); for Eating for Physical rather than Emotional Reasons, from .29 to .14 (maximum = .21); for Reliance on Hunger and Satiety Cues, from .09 to .04 (maximum = .08). The exception was Body-Food Choice Congruence, where the mean of zero-order correlations was .15 and the mean for the partial correlations was .13 (maximum = .23). In spite of these reductions, several of the partial correlations remained statistically significant.
Discussion and Conclusions Is the concept of intuitive eating really novel? Some of the proposed dimensions of intuitive eating seem to closely resemble the eating styles with a long scientific tradition, namely, emotional, external, and restrained eating. Our goal was to evaluate the incremental validity of intuitive eating over and above these already existing eating styles.
European Journal of Psychological Assessment (2020), 36(1), 19–31
28
Similar to Tylka and Kroon Van Diest (2013), we found that the Spanish translation of the IES-2 had a satisfactory dimensional validity and adequate internal consistency. The inclusion of two correlated uniquenesses, as in Carbonneau et al. (2016), markedly improved the model fit. In spite of this general trend, the RMSEA of the final model was slightly over the proposed threshold. It is not uncommon for the interpretation of different fit indices like RMSEA and CFI to disagree (Lai & Green, 2016). Following our expectations and considering summed scores, Emotional Eating from the DEBQ and Eating for Physical rather than Emotional Reasons from the IES-2 presented a high correlation; the same can be said about Restrained Eating from the DEBQ and Unconditional Permission to Eat from the IES-2. Contrary to our hypothesis, External Eating from the DEBQ and Reliance on Hunger and Satiety Cues from the IES-2 were essentially independent. A simultaneous factor analysis of the DEBQ and IES-2 showed some interesting findings. The CFA-ESEM sevenfactor solution provided a fit below the recommended thresholds. Although the correlations Emotional Eating – Eating for Physical rather than Emotional Reasons and Restrained Eating – Unconditional Permission to Eat factors were in line with our expectations (rs > .80), we consider that these results should not be considered. The modification indices pointed to the convenience of allowing cross-loadings between the IES-2 and the DEBQ factors. Not including in a model, relevant cross-loadings can distort to a large degree the estimation of inter-factors correlations (Asparouhov & Muthén, 2009). The ESEM seven-factor solution showed satisfactory model fit. As Morin, Marsh, and Nagengast (2013) noted: “ESEM should generally be preferred to ICM-CFA when the factors are appropriately identified by ESEM, the goodness of fit is meaningfully better than for ICM-CFA, and factor correlations are meaningfully smaller than for ICM-CFA” (p. 430; where ICM-CFA refers to the independent cluster model CFA, CFAs where items are allowed to correlate in a single factor, the common practice). Both conditions are met in our results. Considering this, our ESEM seven-factor solution should be preferred over the CFA-ESEM solution. However, the ESEM seven-factor solution showed a large number of relevant cross-loadings that hampered the interpretation of the obtained results. The loadings for the Unconditional Permission to Eat factor were low in general and the loadings for the Eating for Physical rather than Emotional Reasons were reduced in comparison with a model with only the IES-2 items. It is not clear if this latent factor should be interpreted as Eating for Physical rather than Emotional Reasons. As found in Cebolla et al. (2014) with the DEBQ, the items related to boredom seem to be conceptually distinguishable with respect to other items tapping emotional eating. In the solution with five factors, the model fit, although a little European Journal of Psychological Assessment (2020), 36(1), 19–31
J. R. Barrada et al., Intuitive Eating
worse than that of the seven-factor solution and below the recommended thresholds, was still in line with mean fit of published models (Jackson, Gillaspy, & Purc-Stephenson, 2009) but this time the solution could be clearly interpreted because two highly related pairs of subscales were found to collapse into just two factors. After inspecting the modification indices for this model, we consider that there were no substantial specification problems, as the main areas of strain were correlated uniquenesses not freed. With the results of a better fitting ESEM model being more difficult to interpret than those of the worse-fitting ESEM model, neither the results of the seven-factor solution nor those of the five-factor solution is ideal. Nevertheless, these results, based on latent modeling, point to problems in interpreting two out of the four dimensions of the IES-2 as clearly distinguishable from the earlier eating styles. There are two options: if the seven-factor model should be preferred, two of the dimensions of intuitive eating no longer represent what they were supposed to represent; if the five-factor model should be preferred, two of the dimensions of intuitive eating can be collapsed with two dimensions of previously considered eating styles. We also tested the novelty or utility of the intuitive eating dimensions with partial correlations computed with summed scores. This allowed us to complement the analysis of latent variables with observed scores. It could be possible that two sets of items tap the same dimension, although were better suited for measuring different extremes of the continuum. In this case, a single factor would emerge in a factor analysis, but the improvement in reliability due to conditional reliability could lead to incremental validity. Apparently, this is not the case here. For the five included criterion variables – constructs that have been previously used in research about intuitive eating (Bruce & Ricciardelli, 2016) – their associations with three out of four IES-2 scales were greatly reduced after controlling for the three DEBQ eating styles. The associations of Reliance on Hunger and Satiety Cues with body dissatisfaction and weight control behavior, for example, dropped from –.22 and –.19 to –.08 and –.01, respectively. An exception was Body-Food Choice Congruence, which was almost unaffected by the control variables. Taken altogether, the findings suggest that two factors from the IES-2, namely, Eating for Physical rather than Emotional Reasons and, more clearly, Unconditional Permission to Eat, offer little incremental validity over and above the already existing, earlier eating styles. The Eating for Physical rather than Emotional Reasons factor apparently covers some aspects of emotional eating that are not fully captured by the Emotional Eating DEBQ factor, with elements such as eating when bored deserving further research. The comparison of the DEBQ, the IES-2, and measures that assess Ó 2018 Hogrefe Publishing
J. R. Barrada et al., Intuitive Eating
eating when bored (Koball, Meers, Storfer-Isser, Domoff, & Musher-Eizenman, 2012) could shed further light on this matter. The dimension Reliance on Hunger and Satiety Cues, although not showing any overlap with DEBQ external eating, had only small partial correlations with our five criterion variables. The most promising scale of the IES-2 is the three-item Body-Food Choice Congruence scale. Some limitations of our study should be noted. First, we used a convenience sample of mainly Spanish young adults, where women with higher education were overrepresented. Further research is needed with more representative samples. Second, we have assumed that the IES-2 validly measures the intuitive eating construct. In case of problems with the content validity of the IES-2, we could be missing some relevant aspects of intuitive eating, although, to our knowledge, the IES-2 is the most commonly used measure for assessing intuitive eating. Third, we inadvertently shortened the PANAS questionnaire, as we omitted one item per subscale. Fourth, we did not use the last version of the Eating Disorders Inventory, the EDI-3 (Garner, 2004), but the EDI-2. In the EDI-3, a new item is assigned to the Body Dissatisfaction subscale. In both cases, considering the high Cronbach’s alpha for the questionnaires used and that ever shorter versions of the PANAS have been proposed (Mackinnon et al., 1999; Thompson, 2007), we consider this as a minor problem. Fifth, there is current debate about the validity of assessing eating styles by means of questionnaires (e.g., Jansen et al., 2011; van Strien, Herman, & Anschutz, 2012; van Strien, Herman, Anschutz, Engels, & de Weerth, 2012). The critical evaluation of the evidence and arguments of the different positions is clearly beyond the scope of the present paper. Sixth, data to obtain BMI were based on self-reported measures, although studies have found a high correlation between self-reported body measures and real measures (e.g., McAdams, Van Dam, & Hu, 2007). Seventh, we have not computed conditional reliabilities. It could be possible that the DEBQ and the IES-2 could be better suited to different ranges of the trait levels. In spite of this, some relevant, albeit tentative, conclusions can be drawn. The novelty of the two of the dimensions of the intuitive eating construct as operationalized with the IES-2 (Eating for Physical rather than Emotional Reasons and Unconditional Permission to Eat) seems not as high as claimed. The other two dimensions of the IES-2 (Reliance on Hunger and Satiety Cues and Body-Food Choice Congruence) can be considered as new eating styles, mainly the second one. We have provided evidence about this, both with latent and observed variables. We are not suggesting here that intuitive eating is conceptually empty or irrelevant. We share the idea that a new glance is needed in the area of eating behavior and the relation between health and weight (Bacon & Aphramor, 2011; Mann et al., 2007). But the efforts in this line would clearly benefit from Ó 2018 Hogrefe Publishing
29
incorporating what is already known in the area of eating styles, specifically emotional and restrained eating. Acknowledgments This research was supported by a grant from the Fundación Universitaria Antonio Gargallo and the Obra Social de Ibercaja. CIBERobn is an initiative of the ISCIII. Electronic Supplementary Materials The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000482 ESM 1. Text (.txt) Readme. ESM 2. Data (.dat) Data file (variables included in ESM3). ESM 3. Data (.Rmd) R Mark Down file. ESM 4. Data (.inp) Mplus syntax for model 1. ESM 5. Data (.inp) Mplus syntax for model 2. ESM 6. Data (.inp) Mplus syntax for model 3. ESM 7. Data (.inp) Mplus syntax for model 4. ESM 8. Data (.inp) Mplus syntax for model 5. ESM 9. Data (.inp) Mplus syntax for model 6. ESM 10. Data (.inp) Mplus syntax for model 7.
References Asparouhov, T., & Muthén, B. (2009). Exploratory structural equation modeling. Structural Equation Modeling, 16, 397– 438. https://doi.org/10.1080/10705510903008204 Augustus-Horvath, C. L., & Tylka, T. L. (2011). The acceptance model of intuitive eating: A comparison of women in emerging adulthood, early adulthood, and middle adulthood. Journal of Counseling Psychology, 58, 110–125. https://doi.org/10.1037/ a0022129 Bacon, L., & Aphramor, L. (2011). Weight science: Evaluating the evidence for a paradigm shift. Nutrition Journal, 2011, 1–13. https://doi.org/10.1186/1475-2891-10-9 Barrada, J. R., van Strien, T., & Cebolla, A. (2016). Internal structure and measurement invariance of the Dutch Eating Behavior Questionnaire (DEBQ) in a (nearly) representative Dutch community sample. European Eating Disorders Review, 24, 503–509. https://doi.org/10.1002/erv.2448 Bombak, A. (2014). Obesity, health at every size, and public health policy. American Journal of Public Health, 104, 60–67. https:// doi.org/10.2105/AJPH.2013.301486 Bruce, L. J., & Ricciardelli, L. A. (2016). A systematic review of the psychosocial correlates of intuitive eating among adult women.
European Journal of Psychological Assessment (2020), 36(1), 19–31
30
Appetite, 96, 454–472. https://doi.org/10.1016/j.appet.2015. 10.012 Carbonneau, E., Carbonneau, N., Lamarche, B., Provencher, V., Bégin, C., Bradette-Laplante, M., . . . Lemieux, S. (2016). Validation of a French-Canadian adaptation of the Intuitive Eating Scale-2 for the adult population. Appetite, 105, 37–45. https://doi.org/10.1016/j.appet.2016.05.001 Camilleri, G. M., Méjean, C., Bellisle, F., Andreeva, V. A., Sautron, V., Hercberg, S., & Péneau, S. (2015). Cross-cultural validity of the Intuitive Eating Scale-2. Psychometric evaluation in a sample of the general French population. Appetite, 84, 34–42. https://doi.org/10.1016/j.appet.2014.09.009 Cebolla, A., Barrada, J. R., van Strien, T., Oliver, E., & Baños, R. (2014). Validation of the Dutch Eating Behavior Questionnaire (DEBQ) in a sample of Spanish women. Appetite, 73, 58–64. https://doi.org/10.1016/j.appet.2013.10.014 Denny, K. N., Loth, K., Eisenberg, M. E., & Neumark-Sztainer, D. (2013). Intuitive eating in young adults. Who is doing it, and how is it related to disordered eating behaviors? Appetite, 60, 13– 19. https://doi.org/10.1016/j.appet.2012.09.029 Diener, E., Emmons, R. A., Larsen, R. J., & Griffin, S. (1985). The Satisfaction with Life Scale. Journal of Personality Assessment, 49, 71–75. https://doi.org/10.1207/s15327752jpa4901_13 Finney, S. J., & DiStefano, C. (2006). Non-normal and categorical data in structural equation modeling. In G. R. Hancock & R. O. Mueller (Eds.), Structural equation modeling: A second course. Greenwich, CT: Information Age Publishing. Garner, D. M. (1991). Eating Disorders Inventory-2: Professional manual. Lutz, FL: Psychological Assessment Resources. Garner, D. M. (1998). Inventario de Trastornos de la Conducta Alimentaria-2 [Eating Disorders Inventory-2]. Madrid, Spain: Tea Ediciones. Garner, D. M. (2004). EDI-3 Eating Disorders Inventory-3: Professional manual. Odessa, FL: Psychological Assessment Resources. Garner, D. M., Olmsted, M. P., Bohr, Y., & Garfinkel, P. E. (1982). The Eating Attitudes Test: Psychometric features and clinical correlates. Psychological Medicine, 12, 871–878. https://doi. org/10.1017/S0033291700049163 Gast, J., Madanat, H., & Campbell Nielson, A. (2012). Are men more intuitive when it comes to eating and physical activity? American Journal of Men’s Health, 6, 164–171. https://doi.org/ 10.1177/1557988311428090 Hallquist, M., & Wiley, J. (2016). Package “MplusAutomation”. Retrieved from at https//cran.r-project.org/web/packages/ MplusAutomation/ Haynes, S. N., & Lench, H. C. (2003). Incremental validity of new clinical assessment measures. Psychological Assessment, 15, 456–466. https://doi.org/10.1037/1040-3590.15.4.456 Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55. https://doi. org/10.1080/10705519909540118 Hunsley, J., & Meyer, G. J. (2003). The incremental validity of psychological testing and assessment: Conceptual, methodological, and statistical issues. Psychological Assessment, 15, 446–455. https://doi.org/10.1037/1040-3590.15.4.446 Jackson, D. L., Gillaspy, J. A. Jr., & Purc-Stephenson, R. (2009). Reporting practices in confirmatory factor analysis: An overview and some recommendations. Psychological Methods, 14, 6–23. https://doi.org/10.1037/a0014694 Jansen, A., Nederkoorn, C., Roefs, A., Bongers, P., Teugels, T., & Havermans, R. (2011). The proof of the pudding is in the eating: Is the DEBQ-external eating scale a valid measure of external eating? International Journal of Eating Disorders, 44, 164–168. https://doi.org/10.1002/eat.20799
European Journal of Psychological Assessment (2020), 36(1), 19–31
J. R. Barrada et al., Intuitive Eating
Koball, A. M., Meers, M. R., Storfer-Isser, A., Domoff, S. E., & Musher-Eizenman, D. R. (2012). Eating when bored: Revision of the Emotional Eating Scale with a focus on boredom. Health Psychology, 31, 521–524. https://doi.org/10.1037/a0025893 Lai, K., & Green, S. B. (2016). The problem with having two watches: Assessment of fit when RMSEA and CFI disagree. Multivariate Behavioral Research, 51, 220–239. https://doi.org/ 10.1080/00273171.2015.1134306 McAdams, M. A., Van Dam, R. M., & Hu, F. B. (2007). Comparison of self-reported and measured BMI as correlates of disease markers in US adults. Obesity, 15, 188–196. https://doi.org/ 10.1038/oby.2007.504 Mackinnon, A., Jorm, A. F., Christensen, H., Korten, A. E., Jacomb, P. A., & Rodgers, B. (1999). A short form of the Positive and Negative Affect Schedule: Evaluation of factorial validity and invariance across demographic variables in a community sample. Personality and Individual differences, 27, 405–416. https://doi.org/10.1016/S0191-8869(98)00251-7 Mann, T., Tomiyama, A. J., Westling, E., Lew, A. M., Samuels, B., & Chatman, J. (2007). Medicare’s search for effective obesity treatments: Diets are not the answer. The American Psychologist, 62, 220–233. https://doi.org/10.1037/0003-066X.62.3.220 Miller, W. C. (2005). The weight-loss-at-any-cost environment: How to thrive with a health-centered focus. Journal of Nutrition Education and Behavior, 37, 89–93. https://doi.org/10.1016/ S1499-4046(06)60205-4 Moral de la Rubia, J. (2011). La Escala de Afecto Positivo y Negativo (PANAS) en parejas casadas mexicanas [The Positive and Negative Affect Scale in married Mexican couples]. Ciencia Ergo Sum, 18, 117–125. Morin, A. J. S., Marsh, H. W., & Nagengast, B. (2013). Exploratory structural equation modeling. In G. R. Hancock & R. O. Mueller (Eds.), Structural equation modeling. A second course (2nd ed., pp. 395–436). Charlotte, NC: Information Age Publishing. Muñiz, J., Elosua, P., & Hambleton, R. K. (2013). Directrices para la traducción y adaptación de los tests: Segunda edición [International Test Commission Guidelines for test translation and adaptation: Second edition]. Psicothema, 25, 151–157. https://doi.org/10.7334/psicothema2013.24 Muthén, L. K., & Muthén, B. O. (1998–2015). Mplus user’s guide (7th ed.). Los Angeles, CA: Muthén & Muthén. Neumark-Sztainer, D., Wall, M., Larson, N. I., Eisenberg, M. E., & Loth, K. (2011). Dieting and disordered eating behaviors from adolescence to young adulthood: Findings from a 10-year longitudinal study. Journal of the American Dietetic Association, 111, 1004–1011. https://doi.org/10.1016/j.jada.2011.04.012 R Core Team. (2017). R: A language and environment for statistical computing Retrieved from https//www.R-project.org/. Vienna, Austria: R Foundation for Statistical Computing Revelle, W. (2017). Package “psych”. Retrieved from https//cran.rproject.org/web/packages/psych/psych.pdf Ruzanska, U. A., & Warschburger, P. (2017). Psychometric evaluation of the German version of the Intuitive Eating Scale-2 in a community sample. Appetite, 117, 126–134. https://doi.org/ 10.1016/j.appet.2017.06.018 Sánchez-Carracedo, D., López-Guimerà, G., Fauquet, J., Barrada, J. R., Pàmias, M., Puntí, J., . . . Trepat, E. (2013). A school-based program implemented by community providers previously trained for the prevention of eating and weight-related problems in secondary-school adolescents: The MABIC study protocol. BMC Public Health, 13, 955. https://doi.org/10.1186/ 1471-2458-13-955 Smith, T., & Hawks, S. R. (2006). Intuitive eating, diet composition, and the meaning of food in healthy weight promotion. American Journal of Health Education, 37, 130–136. https://doi.org/ 10.1080/19325037.2006.10598892
Ó 2018 Hogrefe Publishing
J. R. Barrada et al., Intuitive Eating
Striegel-Moore, R. H., & Bulik, C. M. (2007). Risk factors for eating disorders. The American Psychologist, 62, 181–198. https://doi. org/10.1037/0003-066X.62.3.181 Thompson, E. R. (2007). Development and validation of an internationally reliable short-form of the Positive and Negative Affect Schedule (PANAS). Journal of Cross-Cultural Psychology, 38, 227–242. https://doi.org/10.1177/0022022106297301 Tribole, E., & Resch, E. (1995). Intuitive eating: A recovery book for the chronic dieter. New York, NY: St. Martin’s Press. Tylka, T. L. (2006). Development and psychometric evaluation of a measure of intuitive eating. Journal of Counseling Psychology, 53, 226–240. https://doi.org/10.1037/0022-0167.53.2.226 Tylka, T. L., & Kroon Van Diest, A. M. K. (2013). The Intuitive Eating Scale-2: Item refinement and psychometric evaluation with college women and men. Journal of Counseling Psychology, 60, 137–153. https://doi.org/10.1037/a0030893 Tylka, T. L., & Wilcox, J. A. (2006). Are intuitive eating and eating disorder symptomatology opposite poles of the same construct? Journal of Counseling Psychology, 53, 474–485. https:// doi.org/10.1037/0022-0167.53.4.474 Van Dyck, Z., Herbert, B. M., Happ, C., Kleveman, G. V., & Vögele, C. (2016). German version of the Intuitive Eating Scale: Psychometric evaluation and application to an eating disordered population. Appetite, 105, 798–807. https://doi.org/ 10.1016/j.appet.2016.07.019 van Strien, T., Frijters, J. E. R., Bergers, G. P. A., & Defares, P. B. (1986). The Dutch Eating Behavior Questionnaire (DEBQ) for assessment of restrained, emotional and external eating behavior. International Journal of Eating Disorders, 5, 295–315. https://doi.org/10.1002/1098-108X(198602)5:2<295:: AID-EAT2260050209>3.0.CO;2-T
Ó 2018 Hogrefe Publishing
31
van Strien, T., Herman, C. P., & Anschutz, D. (2012). The predictive validity of the DEBQ-external eating scale for eating in response to food commercials while watching television. International Journal of Eating Disorders, 45, 257–262. https://doi.org/10.1002/eat.20940 van Strien, T., Herman, C. P., Anschutz, D. J., Engels, R. C., & de Weerth, C. (2012). Moderation of distress-induced eating by emotional eating scores. Appetite, 58, 277–284. https://doi.org/ 10.1016/j.appet.2011.10.005 Vázquez, C., Duque, A., & Hervás, G. (2013). Satisfaction with Life Scale in a representative sample of Spanish adults: Validation and normative data. Spanish Journal of Psychology, 16, E82. https://doi.org/10.1017/sjp.2013.82 Watson, D., Clark, L. A., & Tellegen, A. (1988). Development and validation of brief measures of positive and negative affect: The PANAS scales. Journal of Personality and Social Psychology, 54, 1063–1070. https://doi.org/10.1037/0022-3514.54.6.1063 Received January 2, 2016 Revision received December 30, 2017 Accepted February 2, 2018 Published online August 3, 2018 EJPA Section/Category Clinical Psychology Juan Ramón Barrada Facultad de Ciencias Sociales y Humanas Universidad de Zaragoza 44003 Teruel Spain barrada@unizar.es
European Journal of Psychological Assessment (2020), 36(1), 19–31
Original Article
Further Evidence for Criterion Validity and Measurement Invariance of the Luxembourg Workplace Mobbing Scale Philipp E. Sischka,1 Alexander F. Schmidt,2 and Georges Steffgen1 1
Institute for Health and Behavior, Research Group of Health Promotion and Aggression Prevention, University of Luxembourg, Luxembourg
2
Institute of Psychology, Social & Legal Psychology, Johannes Gutenberg University Mainz, Mainz, Germany
Abstract: Workplace mobbing has various negative consequences for targeted individuals and is costly to organizations. At present it is debated whether gender, age, or occupation are potential risk factors. However, empirical data remain inconclusive as measures of workplace mobbing so far lack of measurement invariance (MI) testing – a prerequisite for meaningful manifest between-group comparisons. To close this research gap, the present study sought to further elucidate MI of the recently developed brief Luxembourg Workplace Mobbing Scale (LWMS; Steffgen, Sischka, Schmidt, Kohl, & Happ, 2016) across gender, age, and occupational groups and to test whether these factors represent important risk factors of workplace mobbing. Furthermore, we sought to expand data on criterion validity of the LWMS with different self-report criterion measures such as psychological health (e.g., work-related burnout, suicidal thoughts), physiological health problems, organizational behavior (i.e., subjective work performance, turnover intention, and absenteeism), and with a self-labeling mobbing index. Data were collected via computer-assisted telephone interviews (CATI) in a representative sample of 1,480 employees working in Luxembourg (aged from 16 to 66; 45.7% female). Confirmatory factor analyses revealed scalar MI across gender and occupation as well as partial scalar invariance across age groups. None of these factors impacted on the level of workplace mobbing. Correlation and receiver operating characteristic (ROC) analyses strongly support the criterion validity of the LWMS. Due to its briefness while at the same time being robust against language, age, gender, and occupational group factors and exhibiting meaningful criterion validity, the LWMS is particularly attractive for large-scale surveys as well as for single-case assessment and, thus, general percentile norms are reported in the Electronic Supplementary Materials. Keywords: workplace mobbing, measurement invariance testing, working conditions
Workplace mobbing is a serious phenomenon that has various negative consequences for the targeted employees’ health (e.g., depression, burnout), attitudes (e.g., lower job commitment), and work-related behavior (e.g., absence; Bowling & Beehr, 2006; Nielsen & Einarsen, 2012). There exist several measures of workplace mobbing (for an overview see, Nielsen, Matthiesen, & Einarsen, 2010). However, these measures suffer from several shortcomings: First, they are rather long and therefore less economical (e.g., Leymann Inventory of Psychological Terror with 45 items; Leymann, 1996). Second, they are confounded with behaviors not relevant to workplace mobbing (e.g., respecting tight deadlines) which compromises construct validity (Agervold, 2007). Third, many scales have only been tested
European Journal of Psychological Assessment (2020), 36(1), 32–43 https://doi.org/10.1027/1015-5759/a000483
in selective samples, thus, limiting generalizability (e.g., Simons, Stark, & DeMarco, 2011). Finally, only limited data exist on these measures’ psychometric properties (Einarsen, Hoel, & Notelaers, 2009). With the aim to overcome these weaknesses, a brief 5item workplace mobbing measure – the Luxembourg Workplace Mobbing Scale (LWMS) – has recently been published (Steffgen, Sischka, Schmidt, Kohl, & Happ, 2016). The LWMS has good psychometric properties concerning its reliability, its one-factorial structure, and measurement invariance (MI) across three different language versions (German, French, and Luxembourgish). Furthermore, indications of criterion validity were reported as the LWMS was meaningfully associated with job satisfaction, respect at
Ó 2018 Hogrefe Publishing
P. E. Sischka et al., Luxembourg Workplace Mobbing Scale
work, communication and feedback, cooperation, appraisal of work, mental strain at work, burnout, and psychological stress (Steffgen et al., 2016). However, MI of the LWMS – a prerequisite for meaningful mean level comparisons – across gender, age, and occupational groups that are frequently compared subsamples in the workplace mobbing literature remains untested to date. Furthermore, criterion validity so far rests on a restricted number of ad hoc designed self-report scales. Accordingly, the main purpose of the present study was to test the LWMS for MI across gender, age, and occupational groups and to expand its nomological net with relevant psychological and physiological health measures as well as important organizational criteria (i.e., work performance, turnover intention, absenteeism).
Workplace Mobbing Prevalence and Measurement Invariance A critical task in the workplace mobbing research concerns the estimation of prevalence rates for differential groups and the identification of possible risk groups (e.g., Mikkelsen, & Einarsen, 2001; Ortega, Høgh, Pejtersen, & Olsen, 2009). For instance, it is discussed whether women have a higher victim prevalence than men (e.g., Salin & Hoel, 2013), whether younger employees have a higher prevalence rate than older employees (e.g., Quine, 1999), and whether there are different workplace mobbing prevalence rates for different work sectors (e.g., Niedhammer, David, & Degioanni, 2007). Comparisons of these risk groups hinge on the assumption that manifest mean levels are meaningfully comparable across subgroups. Critically, none of the existing workplace mobbing scales has been tested for MI across these groups, which is a necessary prerequisite for such direct comparisons. In order to meaningfully compare constructs, the measurement structures of the latent factor and their corresponding manifest items need to be stable across the compared research units (e.g., Vandenberg & Lance, 2000). If MI has not been tested, differences between groups cannot be unambiguously attributed to ‘real’ differences or to differences in the measurement attributes (Steinmetz, 2013). A lack of MI for different workplace mobbing assessment instruments across different groups is plausible as the same negative behavior might be differently perceived across differential groups. These perceptive differences could stem from differential socialization, group norms, social expectations, as well as different power imbalance perception (Cortina & Magley, 2009; Salin & Hoel, 2013). For instance, men’s masculinity concepts may affect the perception of what behavior depicts “being ridiculed” (Einarsen &
Ó 2018 Hogrefe Publishing
33
Raknes, 1997) and this could substantially differ from women’s perceptions. Therefore, gender differences due to differential sensitivity to classify certain experiences as mobbing behavior are plausible (e.g., Salin, 2003). For the same reason, one could also expect a lack of MI for different age groups due to cohort socialization effects. Furthermore, different occupational norms might influence the perception of appropriate behavior standards across different occupational fields (Parzefall & Salin, 2010; Salin & Hoel, 2011). Usually, MI is tested across a hierarchical set of increasingly stringent invariance assumptions (Widaman & Reise, 1997). Testing starts with configural invariance, followed by metric invariance, and finally scalar invariance (Vandenberg & Lance, 2000). Configural invariance refers to the same pattern of loadings of the items on the constructs in each group. Configural invariance supposes that the same indicators measure the same latent constructs in the compared groups (i.e., have the same meaning in all groups). This is supported if the same factor model structure fits the data well in all groups (Little, 2013). Configural invariance is a necessary prerequisite for further MI tests and is used as the baseline model to evaluate further invariance tests (Little, 2013). Metric invariance indicates that the indicators have the same metric in all groups. In other words, metric invariance can be assumed when changes in the latent variable lead to the same expected changes on the indicators in all groups (Vandenberg & Lance, 2000). The metric invariance assumption holds when the factor loadings can be constrained to have the same value in each group without a substantial deterioration of fit indices compared to the configural model. Finally, scalar invariance indicates that the meaning of the construct and the levels of the underlying items are equal across groups and groups are comparable on the latent variable (van de Schoot, Lugtig, & Hox, 2012). Hence, in order to test for scalar invariance, in addition to the constraints of equal factor loadings (i.e. metric invariance), the intercepts of each item are also being equated across groups. If this equality constraint does not lead to a substantial deterioration of fit indices compared to the metric invariance model, the assumption of scalar invariance holds. Because invariance of all indicators (full MI) is a very strict assumption that seldom is fulfilled, Byrne, Shavelson, and Muthén (1989) introduced the concept of partial invariance. Partial invariance requires that only a subset of factor loadings and/or indicator intercepts must be invariant whereas others are allowed to vary between compared groups. (Partial) scalar invariance allows for meaningful level comparisons between different groups, because the observed indicators have identical (i.e., invariant) quantitative relationships with the latent variable within each group (e.g., Widaman & Reise, 1997).
European Journal of Psychological Assessment (2020), 36(1), 32–43
34
Criterion Validity of Workplace Mobbing Based on recent findings on workplace mobbing (e.g., Bowling & Beehr, 2006; Nielsen & Einarsen, 2012), we used several measures to expand analyses on criterion validity and the nomological net of nomological net of the LWMS (Cronbach & Meehl, 1955). Because Steffgen et al. (2016) already had tested the relationship between some job attitudes (work satisfaction, perceived respect), we now switched our focus on psychological and physiological health as well as organizational criteria with some objective (i.e., behavioral) indicators. In the following, we present the outcome constructs and the theory on why workplace mobbing should be related to them. It can be hypothesized that prolonged exposure to workplace mobbing threatens fundamental psychological needs (e.g., sense of belonging; Aquino & Thau, 2009). The victim is at the receiving end of negative social behavior that aims to stigmatize, to repress, and to belittle accomplishments (e.g., being ignored, ridiculed, criticized). In consequence, this fosters feelings of isolation, ostracism, oppression, incompetence, and self-doubt (Trépanier, Fernet, & Austin, 2015) in the mobbing victim. Self-determination theory states that such violations of basic psychological needs influence employee’s functioning and well-being (Deci & Ryan, 2008). Accordingly, the violation of these psychological needs has been associated with lower work engagement and performance, increased burnout levels, and poor psychological health (Fernet, Austin, Trépanier, & Dussault, 2013; Van den Broeck, Vansteenkiste, De Witte, & Lens, 2008; Van den Broeck, Vansteenkiste, De Witte, Soenens, & Lens, 2010) as well as turnover intention (Richer, Blanchard, & Vallerand, 2002). Similarly, negative consequences could also be due to mobbing victims’ attributions (Cortina & Magley, 2009). If mobbing victims attribute their experience to the organization (e.g., absence of protective conditions that could help them to cope with this stressor), they are likely to develop feelings of resentment toward not only the mobbing perpetrator but also the organization itself (e.g., due to felt psychological breaching and violation of contract; Parzefall & Salin, 2010). Among others, the negative consequences of such perceived breach are higher turnover intention and lower individual performance (Zhao, Wayne, Glibkowski, & Bravo, 2007). Indeed, workplace mobbing research showed associations between workplace mobbing and work engagement, performance, burnout, decreased general psychological health and turnover intention (e.g., Bowling & Beehr, 2006; Nielsen & Einarsen, 2012). As sickness absence is a common indicator of impaired health (Steers & Rhodes, 1984; Ortiz & Samaniego, 1995), absence at work has been empirically related to workplace mobbing (Ortega, Christensen, Hogh, Rugulies, & Borg, 2011). European Journal of Psychological Assessment (2020), 36(1), 32–43
P. E. Sischka et al., Luxembourg Workplace Mobbing Scale
In line with psychological stress theories (e.g., Lazarus & Folkman, 1984), workplace mobbing can be seen as a prolonged stressor that is systematically and persistently directed toward the mobbing victim (Hauge, Skogstad, & Einarsen, 2010). As a prolonged stressor it is conceivable that workplace mobbing leads to cognitive arousal (e.g., worrying, difficulty controlling thoughts) that leads to sleeping problems (Hansen, Hogh, Garde, & Persson, 2014). Furthermore, the Interpersonal Theory of Suicide (Van Orden et al., 2010) states that when people over a prolonged period perceive themselves to be socially alienated from others (e.g., through social exclusion) and feel that they are a burden on others (e.g., feelings of incompetence), they can develop a desire to die, displayed by suicidal ideation. Therefore, the link between workplace mobbing and suicidal ideation seems conceivable. The tension reduction theory states that people use psychoactive substances, such as alcohol (Cooper, Frone, Russell, & Mudar, 1995) or nicotine (Breslau, Peterson, Schultz, Chilcoat, & Andreski, 1998) under conditions of psychological distress in order to decrease negative affective states. Therefore, a direct link between alcohol/nicotine consumption and workplace mobbing is to be expected. Furthermore, the concept of emotional eating states the tendency to eat in response to negative emotions in order to reduce emotional stress (Macht, 2008). Again, all these hypotheses were empirically supported: Workplace mobbing was linked with sleeping problems (Hansen, Hogh, Garde, & Persson, 2014), suicidal thoughts (Nielsen, Einarsen, Notelaers, & Nielsen, 2016), alcohol (Richman, Rospenda, Flaherty, & Freels, 2001) and nicotine use (Quine, 1999), and weight gain (Kivimäki et al., 2006). Finally, when negative social behavior reaches a certain threshold, the target of these negative acts should perceive this as mobbing behavior (Parzefall & Salin, 2010). Workplace mobbing exposure has been shown to be related to employee’s perception of being victimized (Agervold, 2007; Einarsen et al., 2009). Based on the above mentioned detrimental consequences of workplace mobbing, we hypothesized that the LWMS is negatively related to subjective psychological well-being, work engagement, sleeping hours, and work performance and positively related to physiological health problems, alcohol and smoking consumption, body mass index (BMI), suicidal thoughts, turnover intention, absenteeism, and self-labeling as mobbing victim.
Method Data Collection The LWMS was evaluated as part of a larger longitudinal research project on quality of work and its effects on health Ó 2018 Hogrefe Publishing
P. E. Sischka et al., Luxembourg Workplace Mobbing Scale
and well-being in Luxembourg (Sischka & Steffgen, 2016). This project was implemented by the University of Luxembourg in collaboration with the Luxembourg Chamber of Labor (a council that aims to defend the employees’ rights with regards to legislation) as an assessment over yearly waves since 2014. Data for the present research were entailed via Computer-Assisted Telephone Interviews (CATI) with employees from Luxembourg’s working population from the 2016 wave. The LWMS exists in four language versions: Luxembourgish, French, German, and Portuguese (of which the first three exhibit at least partial scalar invariance, Steffgen et al., 2016). All data reported in the present research are cross-sectional.
Participants The initial sample consisted of 1,506 employees working in Luxembourg who were randomly chosen from the working population.1 Due to incomplete data 1.7% (n = 26) of participants had to be excluded from the analyses. Therefore, the effective sample consisted of 1,480 employees (45.7% females, n = 676). Included were Luxembourg residents (59.9%, n = 886) and commuters from Belgium (10.3%, n = 152), France (20.3%, n = 301), and Germany (9.5%; n = 141), who received wages for working at least 10 hrs/ week. The interviewees’ age ranged from 16 to 66 years (M = 45.7, SD = 8.9). The majority of participants had an apprenticeship (33.4%, n = 495) or an academic degree (37.9%, n = 561). Employees’ occupations were classified according to the International Standard Classification of Occupations (ISCO-08; International Labour Organization, 2012). Most participants worked in a profession (26.7%, n = 395) followed by technicians and associate professionals (25.1%, n = 371), clerical support workers (12.8%, n = 190), service and sales workers (10.8%, n = 160), craft and related trades workers (9.5%, n = 141), managers (5.3%, n = 78), plant and machine operators, and assemblers (4.5%, n = 66), elementary occupations (3.6%, n = 54), and others (1.7%, n = 25). Women were more likely to work as clerical support workers (17.3%, n = 117, men: 9.1%, n = 73, w2 = 22.219, df = 1, p < .001) and service and sales workers (15.4%, n = 104, men: 7.0%, n = 56, w2 = 26.998, df = 1, p < .001), and less likely to work as plant and machine operators (1.0%, n = 7, men: 7.3%, n = 59, w2 = 34.240, df = 1, p < .001) and craft and related trades workers (0.6%, n = 4, men: 17.0%, n = 137, w2 = 115.260, df = 1, p < .001). Employees working as associate professionals
35
(academic degree: 87.6%, n = 346) and managers (academic degree: 68.0%, n = 53) were more educated than employees working as plant and machine operators (academic degree: 4.6%, n = 3, w2 = 207.870, df = 2, p < .001).
Measures Luxembourg Workplace Mobbing Scale (LWMS) The LWMS (Steffgen et al., 2016) contains five items (“criticized,” “ignored,” “absurd duties,” “ridiculed,” “conflicts”). The response scale is a 5-point Likert scale ranging from 1 (= never) to 5 (= almost at all times). Scores on the LWMS were calculated as the mean across the items, thus ranging from 1 to 5, with higher scores reflecting a higher level of mobbing exposure. The reliability of the scale for the total sample is satisfactory (α = .72, ω = .73). WHO-5 Well-Being Index The 5-item scale (α = .85, ω = .85) is a well-validated brief general index of subjective psychological well-being (World Health Organization, 1998; Topp, Østergaard, Søndergaard, & Bech, 2015) with a response format ranging from 1 (= at no time) to 6 (= all of the time). A sample item is “Over the past two weeks I have felt cheerful and in good spirits.” Work-Related Burnout We used six items of the 7-item2 subscale of the Copenhagen Burnout Inventory (Kristensen, Borritz, Villadsen, & Christensen, 2005; α = .85, ω = .86). A sample item is “Do you feel that every working hour is tiring for you?”. The response scale is a 5-point Likert scale ranging from 1 (= never) to 5 (= almost at all times). Vigor The 3-item subscale (α = .71, ω = .71) of the Utrecht Work Engagement Scale (Schaufeli, Bakker, & Salanova, 2006) is characterized by high levels of energy and the willingness to invest effort in one’s work, even when it comes to difficulties and problems. A sample item is “At my work, I feel bursting with energy.” The response format ranges from 1 (= never) to 5 (= almost at all times). The following scales have been ad hoc designed for validation purposes. Subjective Physiological Health Problems This 7-item index (α = .73, ω = .74) is concerned with physiological health problems (general health problems,
1
Notably, due to the longitudinal design of the overarching research project, 488 (32.4%) respondents from the 2016 wave also participated in Steffgen et al. (2016; wave 2014).
2
Unfortunately, due to a programming error in the CATI scheme we lost one item of the original Copenhagen Burnout Inventory (i.e., “Do you have enough energy for family and friends during leisure time?”).
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 32–43
36
headaches, heart problems, back problems, joint problems, stomach pain, sleeping problems). Higher scores signify that an employee faces physiological health problems. A sample item is “How often do you suffer from headaches?”. The response scale is a 5-point Likert scale ranging from 1 (= never) to 5 (= almost at all times). Sleeping Hours The respondents were asked how many hours they sleep per day. Alcohol Consumption Respondents were asked how often they drink at least one glass of alcohol within a week with response scale ranging from 1 (= never) to 5 (= each day or nearly each day). If the respondents did not answer never, they were asked how many standard drinks they typically drink within a day with a response format ranging from 1 (= one or two) to 5 (= ten or more). With these items we calculated the number of glasses of alcohol per week. Smoking Participants were asked whether they smoke. Response format was dichotomous with 0 (= no) and 1 (= yes). If they stated “yes,” they were asked how many cigarettes they smoke per day. Body Mass Index The respondents were asked about their body weight and body height. With this information, we calculated the body mass index (BMI). Suicidal Thoughts Participants were also asked if they had suicidal thoughts during the last 12 months. Response format was dichotomous with 0 (= no) and 1 (= yes). Subjective Work Performance Work performance was assessed by two items (α = .68, ω = .69). The two items are “How do you evaluate your general work performance compared to your colleagues?” and “How does your supervisor evaluate your general work?”. Response format ranging from 1 (= below average) to 5 (= above average). Turnover Intention The respondents were asked whether they planned to change their workplace in the near future. Response format was dichotomous with 0 (= no) and 1 (= yes). Absenteeism Participants were asked how many days they had been absent from work during the last 12 months.
European Journal of Psychological Assessment (2020), 36(1), 32–43
P. E. Sischka et al., Luxembourg Workplace Mobbing Scale
Mobbing Self-Labeling A 2-item mobbing self-labeling index with a dichotomous response format was constructed. First, respondents got the following definition of mobbing: “Mobbing takes place, when a person is repeatedly treated badly or bullied by one or more persons with the intention to harm. In order to call a behavior mobbing, it has to be continued over a longer period and the affected person usually has difficulties to defend herself/himself. Singular conflicts and factually justified disputes do not represent mobbing”. Subsequently, respondents were asked if they considered themselves as actual victims of mobbing exposure by their colleagues (item 1) or by their supervisor (item 2). If respondents stated that they felt they were mobbing victims by their colleagues and/or by their supervisor, they were counted as mobbing victims. Therefore, the index has the values of 0 (= nonmobbing victim) or 1 (= mobbing victim).
Statistical Analysis Given that the indicators’ distribution has a strong influence on confirmatory factor analyses’ (CFAs) estimation results, univariate and multivariate distribution of the items were analyzed. Subsequently, the factorial structure of the LWMS was tested for each subgroup separately with CFAs to see if the one-factor model adequately fitted across all subgroups in order to evaluate more stringent MI models in the next steps. Satorra-Bentler scaled w2 and robust SEs (Satorra, & Bentler, 2001) were calculated as they provide more accurate parameter estimations for items with five answer categories and for distortion from univariate and multivariate normality (Finney & DiStefano, 2013). The effects-coding method was used for scale setting to estimate each construct’s latent variance in a non-arbitrary metric (Little, Slegers, & Card, 2006). Therefore, the latent LWMS has a theoretical range from 1 to 5 (similar to the manifest items). Model fit was evaluated with the root mean squared error of approximation (RMSEA), standardized root mean square residual (SRMR), comparative fit index (CFI), and TuckerLewis index (TLI). In a first step, we used the cutoff values from Hu and Bentler (1999) to evaluate the model fit for each group (RMSEA .06; SRMR .08, CFI .95; TLI .95). We used multigroup CFA to test for MI between the different language versions. The ΔCFI was used to assess goodness of fit of MI models as it has been shown to perform reasonably well in detecting (lack of) MI (e.g., Chen, 2007). A ΔCFI .01 between a baseline model and the resulting model indicates MI (e.g., Little, 2013). If metric or scalar invariance was not supported, we switched to fixed-factor scale setting. In this case, we freed one parameter at a time and conducted w2-difference tests (Satorra & Bentler, 2001) between the nested models
Ó 2018 Hogrefe Publishing
P. E. Sischka et al., Luxembourg Workplace Mobbing Scale
37
Table 1. Fit indices for single CFAs and measurement invariance models across gender Model
df
w2a
RMSEA
RMSEA 90%
SRMR
CFI
TLI 0.989
Men (n = 804)
5
6.926
.022
[.000; .051]
.017
0.994
Women (n = 676)
5
4.320
.000
[.000; .036]
.016
1.000
1.004
Configural invariance
10
10.784
.010
[.000; .035]
.017
0.999
0.998
Metric invariance
14
15.235
.011
[.000; .032]
.026
0.998
0.997
Scalar invariance
18
19.986
.012
[.000; .032]
.029
0.997
0.997
a
Notes. No model is significant. Satorra-Bentler corrected.
(i.e., model with equality constrains vs. model without equality constrains). To avoid Type I errors, we used an adjusted α-level (Bonferroni corrections). When the differentially functioning parameter was found, we switched again to the effects-coding method. Finally, to test for mean differences, we set the latent mean of the LWMS of the scalar (or partial scalar) invariance model to be equal across groups. We used w2-difference tests to examine, if the equated latent means led to a substantial deterioration in model fit compared to the scalar (or partial scalar) invariance model without equated latent means. Criterion validity was assessed with intercorrelations (Pearson’s r and point-biserial correlations as effect sizes). However, due to the fact that particularly point-biserial correlations are base rate-dependent and thus become substantially reduced with increasing deviations from a 50% base rate (Babchishin & Helmus, 2016), we also report area under curves (AUCs) from receiver operating characteristic (ROC) analyses (Swets, 1986; AUCs of .50., .65, .75, and .90 correspond to unbiased rpbs of 0, .26, .43 and .67, respectively; Rice & Harris, 2005) that are independent from base rates to correct for deflated associations due to floor effects in criterion base rates in case of frequency measures.
Results The software R version 3.4.3 (R Core Team, 2017) was used to conduct all analyses. Software inputs and outputs for all analyses are provided in the Electronic Supplementary Material (see ESM 1).
Preliminary Analysis Regarding univariate distribution for the whole sample, item means ranged from 1.27 to 2.25 (SD between 0.61 and 0.99), skewness between 0.61 and 2.65, and kurtosis 3
between .06 and 7.93. Furthermore, items violated multivariate normality (Mardia’s multivariate skewness: ^ γ1;5 = 10.36; w2 = 2,555.40; p < .001; Mardia’s multivariate kurtosis: ^ γ2;5 = 58.28; z = 53.52; p < .001). Therefore, we estimated all model parameters, SEs, fit indices and w2-statistics according to Satorra and Bentler (2001).
Factor Structure The reliability of the LWMS for the total sample was satisfactory (α = .72, ω = .73). The one-factor structure established in Steffgen et al. (2016) fitted very well (w2 = 6.527; df = 5; p = .258; RMSEA = .014; 90% CI = [.000; .035]; SRMR = .013; CFI = .998; TLI = .995).3 Regarding gender, reliability was satisfactory for men (α = .70; ω = .70) and women (α = .74; ω = .75). Table 1 shows the results for the tests of different forms of MI across gender. Scalar invariance was confirmed. Men and women did not differ in their mobbing levels (men: M = 1.85; SD = 0.46; women: M = 1.81; SD = 0.51; Δw2(1) = 1.959, p = .162). Reliability for all age groups fell into an acceptable to satisfactory range (α range from .68 to .75; ω range from .68 to .76). Table 2 shows the results for the different MI tests across age groups. Configural and metric invariance were confirmed but scalar invariance was rejected. The w2-difference tests with Bonferroni corrections revealed no clear non-invariant parameter (see ESM 2). Therefore, we freed the intercepts with the highest influence on model fit (i.e., items 1 and 2), thus leading to an acceptable deterioration in model fit compared to the metric invariance model. We used this partial scalar invariance model to test for mean differences between the age groups. Importantly, factor mean differences were thus exclusively based on the three items whose intercepts were fixed (items 3–5). This model showed no difference between the age groups regarding the workplace mobbing level (16–34 years: M = 1.82, SD = 0.46; 35–44 years: M = 1.86, SD = 0.52;
As a control analysis, using only respondents (n = 992) who were not included in Steffgen et al. (2016) still revealed the one-factor structure (w2 = 7.236; df = 5; p = .204; RMSEA = .021; 90% CI = [.000; .045]; SRMR = .017; CFI = .995; TLI = .990).
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 32–43
38
P. E. Sischka et al., Luxembourg Workplace Mobbing Scale
Table 2. Fit indices for single CFAs and measurement invariance models across age groups w2a
RMSEA
RMSEA 90%
SRMR
CFI
TLI
Model
df
16–34 (n = 175)
5
5.217
.016
[.000; .090]
.034
0.997
0.994
35–44 (n = 430)
5
11.271*
.054
[.023; .085]
.034
0.969
0.938
45–54 (n = 641)
5
1.799
.000
[.000; .000]
.009
1.000
1.021
55+ (n = 234)
5
3.704
.000
[.000; .066]
.021
1.000
1.023
Configural invariance
20
22.920
.020
[.000; .044]
.021
0.996
0.992
Metric invariance
32
40.534
.027
[.000; .045]
.041
0.988
0.985
Scalar invariance
44
64.498*
.035
[.018; .050]
.046
0.970
0.973
Partial scalar invarianceb
41
54.682
.030
[.007; .046]
.044
0.980
0.981
a
b
Notes. *p < .05; Satorra-Bentler corrected; Intercept of items 1 and 2 freely estimated.
Table 3. Fit indices for single CFAs and measurement invariance models across occupational groups Model
df
w2a
RMSEA
RMSEA 90%
SRMR
CFI
TLI
Professionals (n = 395)
5
2.060
.000
[.000; .028]
.013
1.000
1.035
Clerical support workers (n = 190)
5
5.977
.032
[.000; .099]
.029
0.992
0.984
Service and sales workers (n = 160)
5
7.507
.056
[.000; .108]
.041
0.967
0.934
Technicians and associate professionals (n = 371)
5
5.086
.007
[.000; .065]
.020
1.000
0.999
Configural invariance
20
21.846
.018
[.000; .049]
.022
0.997
0.993
Metric invariance
32
33.164
.011
[.000; .041]
.041
0.998
0.997
Scalar invariance
44
50.046
.022
[.000; .044]
.046
0.989
0.990
a
Notes. No model is significant; Satorra-Bentler corrected.
45–54 years: M = 1.77, SD = 0.48; 55+ years: M = 1.83, SD = 0.43; Δw2(3) = 4.531, p = .210). MI tests across different occupational groups were based on subgroups with a substantiate sample size (n 150 respondents in order to guarantee substantial power for detecting at least moderately non-invariant items; Meade & Bauer, 2007). These groups were: professionals (n = 395), technicians and associate professionals (n = 371), clerical support workers (n = 190), and service and sales workers (n = 160). Reliability for all four occupational groups was in a satisfactory range (α range from .71 to .75; ω range from .71 to .76). The results indicated that the single-factor model presented a good fit to the data for all tested groups (Table 3). Configural, metric, and scalar invariance were confirmed across occupational groups. The latent means and SDs of the LWMS for the different occupational groups were: professionals M = 1.84, SD = 0.45; clerical support workers M = 1.82, SD = 0.48; service and sales workers M = 1.86, SD = 0.55; technicians and associate professionals M = 1.84, SD = 0.47. The w2difference test indicated no differences between the four occupational groups (Δw2(3) = 0.412, p = .938).
criterion measures. The WHO-5 as well as vigor were negatively correlated with the LWMS with moderate effect sizes. Sleeping hours and subjective work performance are also negatively correlated with the LWMS but with weaker effect sizes. Contrary, work-related burnout showed strong positive associations with the LWMS. Similarly, subjective health problems revealed a moderately positive effect. Another weak link was found for BMI and absenteeism. Alcohol and smoking consumption were not interrelated with the LWMS. The LWMS showed a moderate correlation with mobbing self-labeling. Regarding the dichotomous variables we found considerable deviation from a 50% base rate for turnover intention (14.6%), mobbing self-labeling (5.7%), and suicidal thoughts (3.3%), making the interpretation of AUCs more appropriate than point-biserial correlations (Babchishin & Helmus, 2016). Regarding AUCs, the LWMS showed moderate links with turnover intention (AUC = .66, p < .001, 95% CI [.62; .70]), suicidal thoughts (AUC = .69, p < .001, 95% CI [.61; .77]), and a very strong link with self-labeled mobbing victim status (AUC = .87, p < .001, 95% CI [.82; .91]).
Criterion Validity
Discussion
Table 4 shows the intercorrelations (Pearson or point-biserial correlations, Cramer’s V) between the LWMS and the
Our study aims were to replicate the factor structure of the recently developed 5-item LWMS (Steffgen et al., 2016) and
European Journal of Psychological Assessment (2020), 36(1), 32–43
Ó 2018 Hogrefe Publishing
Ó 2018 Hogrefe Publishing
1.11
6.20
3.42
2.07
3.11
25.77
3.2% suicidal thoughts
3.74
14.6% turnover intention
7.08
7. Subjective physiological health problems
8. Sleeping hours 6.60
3.70
6. Vigor
9. Alcohol
10. Smoking
11. BMI
12. Suicidal thoughtsa
13. Subjective work performance
14. Turnover intentiona
15. Absenteeism
European Journal of Psychological Assessment (2020), 36(1), 32–43
19.25
0.73
4.50
6.92
0.80
0.77
2.
3.
4.
5.
.34*** [ .38; .29]
(.71)
6.
(.73)
7.
.01 [ .06; .04] .03 [ .02; .08] .03 [ .08; .02]
.16*** .21*** [ .21; [.16; .26] .11] .04 .11*** [ .01; [.06; .16] .09] .04 .40*** [ .01; [.35; .44] .09]
.13*** .18*** [ .18; [.13; .23] .08] .10*** .17*** [ .15; [.12; .22] .05] .21*** .27*** [ .26; [.22; .31] .16]
.18*** .12*** [ .23; [.06; .17] .13] .15*** .27*** [ .20; [.22; .31] .10] .22*** .23*** [ .26; [.18; .28] .17]
.05* .18*** .18*** .23*** .09*** .27*** [ .10; [ .23; [.13; .23] [ .28; [.04; .14] [ .32; .00] .13] .18] .22] .19*** .07** .02 .01 .00 .02 .04 [ .24; [.02; .12] [ .03; [ .06; [ .05; [ .07; [ .09; .14] .07] .05] .05] .03] .02] .08** .02 .01 .06* .04 .01 .07* [ .03; [ .07; [ .04; [ .11; [ .01; [ .06; [.02; .12] .13] .04] .06] .01] .09] .04] .24*** .05 .08** .01 .06* .08** .09** [ .29; [ .01; [.03; .13] [ .07; [.00; .11] [ .13; [.04; .14] .19] .10] .04] .03] .03 .02 .16*** .21*** .22*** .15*** .20*** [ .08; [ .07; [.11; .21] [ .26; [.18; .27] [ .20; [.15; .25] .02] .03] .16] .10] .05 [.00; .09*** .23*** .18*** .17*** .24*** .11*** .10] [.04; .14] [ .28; [.13; .23] [ .22; [.19; .29] [ .16; .18] .12] .06]
.06* [.01; .11]
.01 [ .04; .06] .04 .02 (.72) [ .09; [ .07; .01] .03] .02 .05* .33*** (.85) [ .07; [.00; .10] [ .38; .03] .29] .05* .03 .50*** .53*** (.85) [.00; .10] [ .09; [.46; .53] [ .56; .02] .49] .08** .01 .33*** .48*** .51*** [.03; .13] [ .04; [ .38; [.44; .52] [ .55; .06] .29] .47] .15*** .08** .33*** .41*** .57*** [.10; .20] [.03; .13] [.29; .38] [ .46; [.53; .60] .37]
1.
9.
10.
11.
.09*** [ .14; .04] .04 [ .09; .01] .12*** [ .17; .07]
.01 [ .06; .04] .03 [ .08; .02] .03 [ .08; .02]
.04 [ .09; .01]
12.
.06* .07** .20*** [.01; .12] [.02; .12] [.15; .25]
.00 .01 .09*** [ .05; [ .06; [.04; .14] .05] .04] .06* .08** .16*** [.01: .12] [.03; .14] [.11; .21]
.00 [ .05; .05] .11*** .10*** [ .16; [.05; .15] .05] .05 .01 .01 [ .10; [ .06; [ .06; .00] .04] .04] .10*** .01 .05* .06* [ .15; [ .06; [.00; .10] [.01; .11] .05] .04] .02 .03 .01 .04 [ .03; [ .02; [ .06; [ .09; .07] .08] .04] .01]
8.
Notes. *p < .05, **p < .01, ***p < .001; ahigher values depict female, suicidal thoughts, turnover intention, and mobbing victim, respectively. Cronbach’s α in the main diagonal.
16. Mobbing self- 5.7% mobbing labelinga victims
0.69
2.42
5. Work-related burnout
1.02
4.20
4. WHO-5
0.56
8.9
SD
1.83
45.7% females 45.7
M
3. LWMS
2. Age
1. Sex
a
Table 4. Means, standard deviations, intercorrelations, and 95% confidence interval of intercorrelations 14.
15.
.04 [ .09; .01] .09*** .03 [ .14; [ .02; .04] .08] .17*** .17*** .13*** [ .22; [.12; .22] [.08; .18] .12]
13.
P. E. Sischka et al., Luxembourg Workplace Mobbing Scale 39
40
to test for MI across frequently compared subsamples in the workplace mobbing literature (e.g., Einarsen & Skogstad, 1996; Mikkelsen & Einarsen, 2001; Ortega et al., 2009; Salin & Hoel, 2013). The one-factorial structure of the LWMS was replicated in an independent sample. Additionally, evaluation of different MI models confirmed invariance across all compared groups.4 This corroborates that the LWMS is suitable for frequently analyzed manifest subgroup comparisons. Notably, from a theoretical perspective our results suggest that neither age, gender, nor the most frequent areas of occupation in Luxembourg represent important risk factors for workplace mobbing. Empirical evidence for gender (e.g., Niedhammer et al., 2007; Zapf, Escartín, Einarsen, Hoel, & Vartia, 2011) and age differences (e.g., Einarsen & Skogstad, 1996; Einarsen & Raknes, 1997; Hauge, Skogstad, & Einarsen, 2009) in the prevalence of workplace mobbing hitherto is mixed. While some part of the explanation of these inconclusive findings may be (a) based on the lack of a common method (self-labeling with or without definition vs. behavioral approach), (b) varying measures of workplace mobbing, or (c) possible country-specific effects (Nielsen et al., 2010), an additional explanation rests on the absence of MI testing for these measures. Even if the first three aspects were constant, the latter criticism would question results from manifest group comparisons. Therefore, although several studies found manifest occupational differences (e.g., Einarsen & Skogstad, 1996; Niedhammer et al., 2007), it cannot be ruled out that these differences are based on differences in the measurement attributes of the measures. However, within this study we show that the LWMS is invariant across these subgroups, and therefore can be used to study possible differences across these groups. In order to further evaluate the criterion validity of the LWMS, theoretically meaningful correlations with measures of psychological health (i.e., well-being, burnout, vigor, suicidal thoughts), subjective physiological health problems, sleeping hours, alcohol and smoking consumption, BMI, various important organizational criteria (i.e., absenteeism, subjective work performance, turnover intention), and self-labeled mobbing victim status were explored. With the exceptions of alcohol and smoking consumption, all proposed psychological well-being and organizational criteria were meaningfully associated with the LWMS further corroborating criterion validity of the scale.
4
P. E. Sischka et al., Luxembourg Workplace Mobbing Scale
Limitations and Outlook Since the data were collected via CATI, it remains unclear whether other data collection methods (e.g., paper-andpencil or online survey) will replicate the reported MI properties of the LWMS. Future research might also test MI for a wider range of less frequent occupational groups that could not be tested in this study due to sample size restrictions. Because of the cross-sectional design of the study, all correlations between the LWMS and the different criteria cannot be interpreted in a causal manner. Finally, all results are based on self-report data exclusively. Since the LWMS is a new instrument that now has passed a series of thorough psychometric tests, future studies should focus on divergent validity to further elucidate its construct validity. In summary, we think particularly due to its briefness while at the same time exhibiting meaningful criterion validity and generally good psychometric properties as well as its robustness against language, gender, age and occupational group factors, the LWMS is a measure of workplace mobbing that is particularly attractive for various (large-scale) research and applied contexts. Hence, to aid research and applied purposes that might profit from normative comparisons, whole-sample percentile norms are reported in ESM 3.
Acknowledgments This research was supported by a grant from the Luxembourg Chamber of Labor. The authors would like to thank Sylvain Hoffmann and David Büchel.
Electronic Supplementary Materials The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000483 ESM 1. Syntax (.pdf) LWMS syntax. ESM 2. Table (.pdf) Age groups measurement invariance models: w2-difference tests with Bonferroni corrections. ESM 3. Table (.pdf) Percentile norms of the LWMS.
For age groups full scalar invariance was not confirmed but Bonferroni corrected w2-difference tests did not indicate any of the items as noninvariant, thus, the indicated non-invariance by ΔCFI could be due to random error. However, we relaxed the two constraints in the model that had the highest influence on model misfit. Relaxing constraints without a strong theory or statistical justification bears the risk of capitalizing on chance. However, using the full scalar mode to estimate the latent mean and mean differences yielded very similar results (16–34 years: M = 1.81, SD = 0.46; 35–44 years: M = 1.86, SD = 0.52; 45–54 years: M = 1.77, SD = 0.48; 55+ years: M = 1.83, SD = 0.43; Δw2(3) = 4.414, p = .220). Nevertheless, the demonstrated partial invariance with three invariant out of five indicators still allows for meaningful level comparisons (e.g., Steenkamp & Baumgartner, 1998).
European Journal of Psychological Assessment (2020), 36(1), 32–43
Ó 2018 Hogrefe Publishing
P. E. Sischka et al., Luxembourg Workplace Mobbing Scale
References Agervold, M. (2007). Bullying at Work: A discussion of definitions and prevalence, based on an empirical study. Scandinavian Journal of Psychology, 48, 161–172. https://doi.org/10.1111/ j.1467-9450.2007.00585.x Aquino, K., & Thau, S. (2009). Workplace victimization: Aggression from the target’s perspective. Annual Review of Psychology, 60, 717–741. https://doi.org/10.1146/annurev.psych.60.110707. 163703 Babchishin, K. M., & Helmus, L.-M. (2016). The influence of base rates on correlations: An evaluation of proposed alternative effect sizes with real-world data. Behavior Research Methods, 48, 1021–1031. https://doi.org/10.3758/s13428-015-0627-7 Bowling, N. A., & Beehr, T. A. (2006). Workplace harassment from the victim’s perspective: A theoretical model and metaanalysis. Journal of Applied Psychology, 91, 998–1012. https://doi.org/10.1037/0021-9010.91.5.998 Breslau, N., Peterson, E. L., Schultz, L. R., Chilcoat, H. D., & Andreski, P. (1998). Major depression and stages of smoking. A longitudinal investigation. Archives of General Psychiatry, 55, 161–166. https://doi.org/10.1001/archpsyc.55.2.161 Byrne, B. M., Shavelson, R. J., & Muthén, B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456–466. https://doi.org/10.1037/0033-2909.105.3.456 Chen, F. F. (2007). Sensitivity of goodness of fit indexes to lack of measurement invariance. Structural Equation Modeling, 14, 464–504. https://doi.org/10.1080/10705510701301834 Cooper, M. L., Frone, M. R., Russell, M., & Mudar, P. (1995). Drinking to regulate positive and negative emotions: A motivational model of alcohol use. Journal of Personality and Social Psychology, 69, 990–1005. https://doi.org/10.1037/0022-3514.69.5.990 Cortina, L. M., & Magley, V. J. (2009). Patterns and profiles of response to incivility in the workplace. Journal of Occupational Health Psychology, 14, 272–288. https://doi.org/10.1037/a0014934 Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. https://doi.org/10.1037/h0040957 Deci, E. L., & Ryan, R. M. (2008). Facilitating optimal motivation and psychological well-being across life’s domains. Canadian Psychology, 49, 14–23. https://doi.org/10.1037/0708-5591.49.1.14 Einarsen, S., Hoel, H., & Notelaers, G. (2009). Measuring exposure to bullying and harassment at work: Validity, factor structure and psychometric properties of the Negative Acts Questionnaire-Revised. Work & Stress, 23, 24–44. https://doi.org/ 10.1080/02678370902815673 Einarsen, S., & Raknes, B. J. (1997). Harassment in the workplace and the victimization of men. Violence and Victims, 12, 247–263. Einarsen, S., & Skogstad, A. (1996). Bullying at work: Epidemiological findings in public and private organizations. European Journal of Work and Organizational Psychology, 5, 185–201. https://doi.org/10.1080/13594329608414854 Fernet, C., Austin, S., Trépanier, S.-G., & Dussault, M. (2013). How do job characteristics contribute to burnout? Exploring the distinct mediating roles of perceived autonomy, competence, and relatedness. European Journal of Work and Organizational Psychology, 22, 123–137. https://doi.org/10.1080/1359432X. 2011.632161 Finney, S. J., & DiStefano, C. (2013). Nonnormal and categorical data in structural equation modeling. In G. R. Hancock & R. O. Mueller (Eds.), Structural equation modeling – a second course (2nd ed., pp. 439–492). Charlotte, NC: Information Age Publishing. Hansen, A. M., Hogh, A., Garde, A. H., & Persson, R. (2014). Workplace bullying and sleep difficulties: A 2-year follow-up
Ó 2018 Hogrefe Publishing
41
study. International Archives of Occupational and Environmental Health, 87, 285–294. https://doi.org/10.1007/s00420-0130860-2 Hansen, A. M., Hogh, A., Garde, A. H., & Persson, R. (2014). Workplace bullying and sleep difficulties: A 2-year follow-up study. International Archives of Occupational and Environmental Health, 87, 285–294. https://doi.org/10.1007/s00420-0130860-2 Hansen, A. M., Hogh, A., Persson, R., Karlson, B., Garde, A. H., & Ørbæk, P. (2006). Bullying at Work, Health Outcomes, and Physiological Stress Response. Journal of Psychosomatic Research, 60, 63–72. https://doi.org/10.1016/j.jpsychores. 2005.06.078 Hauge, J. H., Skogstad, A., & Einarsen, S. (2009). Individual and situational predictors of workplace bullying: Why do perpetrators engage in the bullying of others? Work & Stress, 23, 349– 358. https://doi.org/10.1080/02678370903395568 Hauge, J. H., Skogstad, A., & Einarsen, S. (2010). The relative impact of workplace bullying as a social stressor at work. Scandinavian Journal of Psychology, 51, 426–433. https://doi. org/10.1111/j.1467-9450.2010.00813.x Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55. https://doi. org/10.1080/10705519909540118 International Labour Organization. (Eds.). (2012). International standard classification of occupations: ISCO-08. Vol. 1. Structure, group definitions and correspondence tables. Geneva, Switzerland: International Labour Office. Kivimäki, M., Head, J., Ferrie, J. E., Shipley, M. J., Brunner, E., Vahtera, J., & Marmot, M.G. (2006). Work stress, weight gain and weight loss: Evidence for bidirectional effects of job strain on body mass index in the Whitehall II study. International Journal of Obesity, 30, 982–987. https://doi.org/10.1038/sj. ijo.0803229 Kristensen, T. S., Borritz, M., Villadsen, E., & Christensen, K. B. (2005). The Copenhagen Burnout Inventory: A new tool for the assessment of burnout. Work & Stress, 19, 192–207. https:// doi.org/10.1080/02678370500297720 Lazarus, R. S., & Folkman, S. (1984). Stress, appraisal and coping. New York, NY: Springer. Leymann, H. (1996). Handanleitung für den LIPT-Fragebogen [Leymann Inventory of Psychological Terror]. Tübingen, Germany: Deutsche Gellschaft für Verhaltenstherapie Verlag. Little, T. D. (2013). Longitudinal structural equation modeling. New York, NY: Guilford Press. Little, T. D., Slegers, D. W., & Card, N. A. (2006). A non-arbitrary method of identifying and scaling latent variables in SEM and MACS models. Structural Equation Modeling, 13, 59–72. https://doi.org/10.1207/s15328007sem1301_3 Macht, M. (2008). How emotions affect eating: A five-way model. Appetite, 50, 1–11. https://doi.org/10.1016/j.appet.2007.07.002 Meade, A. W., & Bauer, D. J. (2007). Power and precision in confirmatory factor analytic tests of measurement invariance. Structural Equation Modeling, 14, 611–635. https://doi.org/ 10.1080/10705510701575461 Mikkelsen, E. G., & Einarsen, S. (2001). Bullying in Danish worklife: Prevalence and health correlates. European Journal of Work and Organizational Psychology, 10, 393–413. https://doi. org/10.1080/13594320143000816 Niedhammer, I., David, S., & Degioanni, S. (2007). Economic activities and occupations at high risk for workplace bullying: Results from a large-scale cross-sectional survey in the general working population in France. International Archives of Occupational and Environmental Health, 80, 346–353. https://doi.org/10.1007/s00420-006-0139-y
European Journal of Psychological Assessment (2020), 36(1), 32–43
42
Nielsen, M. B., & Einarsen, S. (2012). Outcomes of exposure to workplace bullying: A meta-analytic review. Work & Stress, 26, 309–332. https://doi.org/10.1080/02678373.2012.734709 Nielsen, M. B., Einarsen, S., Notelaers, G., & Nielsen, G. H. (2016). Does exposure to bullying behaviors at the workplace contribute to later suicidal ideation? A three-wave longitudinal study. Scandinavian Journal of Work, Environment & Health, 42, 246–250. https://doi.org/10.5271/sjweh.3554 Nielsen, M. B., Matthiesen, S. B., & Einarsen, S. (2010). The impact of methodological moderators on prevalence rates of workplace bullying. A meta-analysis. Journal of Occupational and Organizational Psychology, 83, 955–979. https://doi.org/ 10.1348/096317909X481256 Ortega, A., Christensen, K. B., Hogh, A., Rugulies, R., & Borg, V. (2011). One-year prospective study on the effect of workplace bullying on long-term sickness absence. Journal of Nursing Management, 19, 752–759. https://doi.org/10.1111/j.13652834.2010.01179.x Ortega, A., Høgh, A., Pejtersen, J. H., & Olsen, O. (2009). Prevalence of workplace bullying and risk groups: A representative population study. International Archives of Occupational and Environmental Health, 82, 417–426. https://doi.org/10.1007/ s00420-008-0339-8 Ortiz, Y., & Samaniego, C. (1995). A test of Steers and Rhodes’ Model of employees’ absence. In A. González, A. de la Torre, & y. J. de Elena (Eds.), Work and organizational psychology. Human resources management and new technologies (pp. 237– 246). Salamanca, Spain: Eudema. Parzefall, M.-R., & Salin, D. M. (2010). Perceptions of and reactions to workplace bullying: A social exchange perspective. Human Relations, 63, 761–780. https://doi.org/10.1177/ 0018726709345043 Quine, L. (1999). Workplace bullying in NHS community trust: Staff questionnaire survey. British Medical Journal, 318, 228–232. https://doi.org/doi.org/10.1136/bmj.318.7178.228 Rice, M. E., & Harris, G. T. (2005). Comparing effect sizes in follow-up studies: ROC area, Cohen’s d, and r. Law and Human Behavior, 29, 615–620. https://doi.org/10.1007/s10979-0056832-7 Richer, S. F., Blanchard, C., & Vallerand, R. J. (2002). A motivational model of work turnover. Journal of Applied Social Psychology, 32, 2089–2113. https://doi.org/10.1111/j.15591816.2002.tb02065.x Richman, J. A., Rospenda, K. M., Flaherty, J. A., & Freels, S. (2001). Workplace harassment, active coping, and alcoholrelated outcomes. Journal of Substance Abuse, 13, 347–366. https://doi.org/10.1016/s0899-3289(01)00079-7 R Core Team. (2017). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Salin, D. (2003). The significance of gender in the prevalence, forms and perceptions of bullying. Nordiske Organisasjonsstudier, 5, 30–50. Retrieved from http://hdl.handle.net/10227/ 290 Salin, D., & Hoel, H. (2011). Organisational causes of workplace bullying. In S. Einarsen, H. Hoel, D. Zapf, & C. L. Cooper (Eds.), Bullying and harassment in the workplace. Developments in theory, research, and practice (2nd ed., pp. 227–243). Boca Raton, FL: CRC Press. Salin, D., & Hoel, H. (2013). Workplace bullying as a gendered phenomenon. Journal of Managerial Psychology, 28, 235–251. https://doi.org/10.1108/02683941311321187 Satorra, A., & Bentler, P. M. (2001). A scaled difference chi-square test statistic for moment structure analysis. Psychometrika, 66, 507–514. https://doi.org/10.1007/BF02296192
European Journal of Psychological Assessment (2020), 36(1), 32–43
P. E. Sischka et al., Luxembourg Workplace Mobbing Scale
Schaufeli, W. B., Bakker, A. B., & Salanova, M. (2006). The measurement of work engagement with a short questionnaire a cross-national study. Educational and Psychological Measurement, 66, 701–716. https://doi.org/10.1177/0013164405282471 Simons, S. R., Stark, R. B., & DeMarco, R. F. (2011). A new, fouritem instrument to measure workplace bullying. Research in Nursing & Health, 34, 132–140. https://doi.org/10.1002/ nur.20422 Sischka, P., & Steffgen, G. (2016). Quality of Work-Index. 2. Forschungsbericht zur Weiterentwicklung des Arbeitsqualitätsindexes in Luxembourg [2nd Research report for the enhancement of the quality of work index in Luxembourg] (Working Paper). Luxembourg: Universität Luxemburg. Steenkamp, J.-B. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25, 78–90. https://doi.org/ 10.1086/209528 Steers, R. M., & Rhodes, S. R. (1984). Knowledge and speculation about absenteeism. In P. S. Goodman & R. S. Atkin (Eds.), Absenteeism: New approaches to understanding, measuring and managing absence (pp. 229–275). San Francisco, CA: Jossey-Bass. Steinmetz, H. (2013). Analyzing observed composite differences across groups. Is partial measurement invariance enough? Methodology, 9, 1–12. https://doi.org/10.1027/1614-2241/ a000049 Steffgen, G., Sischka, P., Schmidt, A. F., Kohl, D., & Happ, C. (2016). The Luxembourg Workplace Mobbing Scale. Psychometric properties of a short instrument in three different languages. European Journal of Psychological Assessment. Advance online publication. https://doi.org/10.1027/10155759/a000381 Swets, J. A. (1986). Indices of discrimination or diagnostic accuracy: Their ROCs and implied models. Psychological Bulletin, 99, 100–117. https://doi.org/10.1037/0033-2909.99.1.100 Topp, C. W., Østergaard, S. D., Søndergaard, S., & Bech, P. (2015). The WHO-5 Well-Being Index: A systematic review of the literature. Psychotherapy and Psychosomatics, 84, 167–176. https://doi.org/10.1159/000376585 Trépanier, S.-G., Fernet, C., & Austin, S. (2015). A longitudinal investigation of workplace bullying, basic need satisfaction, and employee functioning. Journal of Occupational Health Psychology, 20, 105–116. https://doi.org/10.1037/a0037726 Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4–70. https://doi.org/10.1177/ 109442810031002 Van den Broeck, A., Vansteenkiste, M., De Witte, H., & Lens, W. (2008). Explaining the relationships between job characteristics, burnout, and engagement: The role of basic psychological need satisfaction. Work & Stress, 22, 277–294. https://doi.org/ 10.1080/02678370802393672 Van den Broeck, A., Vansteenkiste, M., De Witte, H., Soenens, B., & Lens, W. (2010). Capturing autonomy, competence, and relatedness at work: Construction and initial validation of the Workrelated Basic Need Satisfaction scale. Journal of Occupational and Organizational Psychology, 83, 981–1002. https://doi.org/ 10.1348/096317909X481382 van de Schoot, R., Lugtig, P., & Hox, J. (2012). A checklist for testing measurement invariance. European Journal of Developmental Psychology, 9, 486–492. https://doi.org/10.1080/ 17405629.2012.686740 Van Orden, K. A., Wite, T. K., Cukrowciz, K. C., Braithwaite, S. R., Selby, E. A., & Joiner, T. E. (2010). The interpersonal theory of
Ó 2018 Hogrefe Publishing
P. E. Sischka et al., Luxembourg Workplace Mobbing Scale
suicide. Psychological Review, 117, 575–600. https://doi.org/ 10.1037/a0018697 Widaman, K. F., & Reise, S. P. (1997). Exploring the measurement invariance of psychological instruments: Applications in the substance use domain. In K. J. Bryant, M. Windle, & S. G. West (Eds.), The science of prevention: Methodological advances from alcohol and substance abuse research (pp. 281–324). Washington, DC: American Psychological Association. World Health Organization. Regional Office for Europe. (1998). Well-Being measures in primary health care: The DiabCare Project. Consensus meeting, Stockholm, Sweden. Zapf, D., Escartín, J., Einarsen, S., Hoel, H., & Vartia, M. (2011). Empirical findings on prevalence and risk groups of bullying in the workplace. In S. Einarsen, H. Hoel, D. Zapf, & C. L. Cooper (Eds.), Bullying and harassment in the workplace. Developments in theory, research, and practice (2nd ed., pp. 75–105). Boca Raton, FL: CRC Press. Zhao, H., Wayne, S. J., Glibkowski, B. C., & Bravo, J. (2007). The impact of psychological contract breach on work-related
Ó 2018 Hogrefe Publishing
43
outcomes: A meta-analysis. Personnel Psychology, 60, 647– 680. https://doi.org/10.3724/sp.j.1042.2012.01296 Received January 30, 2017 Revision received February 19, 2018 Accepted February 22, 2018 Published online September 18, 2018 EJPA Section/Category Short Scales Philipp Sischka Integrative Research Unit Social and Individual Development (INSIDE) Health Promotion and Aggression Prevention University of Luxembourg Maison des Sciences Humaines 11, Porte des Sciences 4366 Esch-sur-Alzette Luxembourg philipp.sischka@uni.lu
European Journal of Psychological Assessment (2020), 36(1), 32–43
Original Article
Can Serious Games Assess Decision-Making Biases? Comparing Gaming Performance, Questionnaires, and Interviews Kyoungwon Seo, Hokyoung Ryu, and Jieun Kim Imagine X Lab, Hanyang University, Seoul, Republic of Korea
Abstract: The limitations of self-report questionnaires and interview methods for assessing individual differences in human cognitive biases have become increasingly apparent. These limitations have led to a renewed interest in alternative modes of assessment, including for implicit and explicit aspects of human behavior (i.e., dual-process theory). Acknowledging this, the present study was conducted to develop and validate a serious game, “Don Quixote,” for measuring specific cognitive biases: the bandwagon effect and optimism bias. We hypothesized that the implicit and explicit game data would mirror the results from an interview and questionnaire, respectively. To examine this hypothesis, participants (n = 135) played the serious game and completed a questionnaire and interview in a random order for cross-validation. The results demonstrated that the implicit game data (e.g., response time) were highly correlated with the interview data. On the contrary, the explicit game data (e.g., game score) were comparable to the results from the questionnaire. These findings suggest that the serious game and the underlying intrinsic nature of its game mechanics (i.e., evoking instant responses under time pressure) are of importance for the further development of cognitive bias measures in both academia and practice. Keywords: assessment, serious game, dual-process theory, cognitive bias
Cognitive bias refers to a systematic pattern of deviations in judgment and decision-making, resulting from a lack of appropriate information acquisition or a limited information processing capacity (Haselton, Nettle, & Murray, 2005; Reece & Matthews, 1993). Such biases may enable faster decisions when timeliness is more valuable than accuracy; however, sometimes they introduce severe and systematic errors. The most well-known approach to assess the individual differences in human cognitive biases constitutes selfreport questionnaires and interviews (Ariely, 2008; Hilbert, 2012). However, these conventional methods have come under renewed scrutiny in the last decades. In a review of 19 questionnaire-interview comparison studies (Harris & Brown, 2010), researchers confirmed a discrepancy between self-report and interview outcomes. Dual-process theory provides a theoretical rationale for this measurement discrepancy by positing that two distinct mental processes exist that underlie behavioral responses: the implicit process and explicit process (Evans & Stanovich, 2013; Kahneman, 2011). An implicit process is an unintentional, effortless, uncontrollable, or unconscious
European Journal of Psychological Assessment (2020), 36(1), 44–55 https://doi.org/10.1027/1015-5759/a000485
process that is assumed to yield our automatic default responses (Gawronski & Creighton, 2013). In comparison, an explicit process supports our controlled hypothetical thinking that is characterized by an intentional, effortful, controllable, or conscious process (Evans & Stanovich, 2013). In an interview, the interviewer takes a third-person perspective and they are much more focused on the implicit, internalized, often unconscious process, which is not open to introspection (Furman & Flanagan, 1997). On the contrary, a questionnaire is an explicit measure that evaluates people’s analytic and controlled responses (Schaeffer, 2000). Correlational studies show that each implicit and explicit measure has causal connections with different aspects of behavior (Petty, Fazio, & Briñol, 2012; Wittenbrink & Schwarz, 2007). Explicit measures have been shown to have better correlations with actual behaviors, whereas implicit measures have shown strength in incremental validity for behavior; that is, explaining variance in a behavior over and above what is explained by explicit measures (Richetin, Perugini, Prestwich, & O’Gorman, 2007). In this context, measuring both the implicit and
Ó 2018 Hogrefe Publishing
K. Seo et al., Measuring Cognitive Bias Via Serious Games
45
explicit aspects of cognitive bias is crucial for a comprehensive understanding of the phenomenon.
game equals (or outperforms) that of conventional assessment methods.
Rise of the Serious Game as a New Assessment Paradigm
The Present Study
Attempts at developing a new method to assess cognitive biases have been made recently (Jasper & Ortner, 2015). Serious game studies offer novel and interesting ways to comprehensively assess cognitive biases (Choliz, 2010; Peng, Liu, & Mou, 2008). Note that a serious game is a game designed for a primary purpose other than pure entertainment, such as learning or training (Cowley, Fantato, Jennett, Ruskov, & Ravaja, 2014). For instance, Van Herpen, Pieters, and Zeelenberg (2009) have developed a shopping game and examined the player’s response time latency while shopping (e.g., time spent examining the shelf tags). A response time revealed the strength of association between the player’s implicit bias and presented stimuli (see Implicit Association Test in Greenwald, Nosek, & Banaji, 2003). Lejuez and colleagues (2002) examined a rather different aspect of the game. They developed a balloon popping game and analyzed a score when players stop pumping up the balloon. If players stopped pumping up the balloon when a game score was high, their self-reported bias was higher accordingly. These controlled action choices in response to visual elements during the game (e.g., score) are more related with the explicit aspects of cognitive bias. In the Sirius research program, 4-year-long multidisciplinary studies were conducted to verify the effectiveness of games as a training tool for teaching about and mitigating cognitive biases (Bush, 2017). The researchers developed several game genres, including an adventure game, “MACBETH,” a puzzle game, “CYCLES,” a mystery game, “Missing,” and a sci-fi game, “Heuristica,” to investigate whether a game could be an effective mechanism for training adults to identify and mitigate their cognitive biases (Dunbar et al., 2013; Mullinix et al., 2013; Symborski et al., 2014). Other serious games, “Wasabi Waiter” and “Balloon Brigade,” have shown that intuitive behaviors in games can be used to identify a player’s systematic decision patterns, such as risk aversion, empathy, or responsiveness (Jacob, 2013), with the results being applied directly to the human resource division of a company for learning and the allocation of job positions. Despite the potential applications of serious games, previous studies have focused more on the effectiveness of games as a training tool, and few studies have been conducted to verify the content validity of serious game-based assessments. For the adoption of serious games for both training and assessment, it is crucial to ensure that the quality of an assessment using a serious
Ó 2018 Hogrefe Publishing
Two cognitive biases were investigated in this study: the bandwagon effect and optimism bias. When the bandwagon effect co-occurs with the optimism bias, people easily accept risky decisions without proper consideration and precaution. The bandwagon effect makes people easily accept an unproven but popular decision without proper consideration due to sensitivity to majority or influential opinions (Bornstein & Emler, 2001). Majority opinion sensitivity is the desire to belong to the major social group (e.g., “It is important that others like the products and brands I buy”), and influential opinion sensitivity refers to how easily individuals are affected by influential people’s opinions (e.g., “I often consult other people to help me choose the best alternative available from a product class”). The optimism bias leads people to exaggerate the perceived benefit and to underestimate the perceived risk of making a risky decision (Shepperd, Carroll, Grace, & Terry, 2002). Perceived benefit indicates the benefits that an individual would obtain from each situation, and perceived risk indicates how risky an individual perceives the same situation to be. People with optimism bias believe that they are at less risk of experiencing a negative event compared to others; as such, they engage in risky behaviors and do not take precautionary measures for safety. A serious game called “Don Quixote” was designed to address and measure bandwagon effect and optimism bias (see Figure 1). The Don Quixote game was developed in app form using four basic game elements (i.e., theme, challenge, reward, and progress; Flatla, Gutwin, Nacke, Bateman, & Mandryk, 2011). The main theme was the famous novel “Don Quixote” written by Miguel de Cervantes. The game player takes the role of Don Quixote who is obsessed with chivalrous ideals and decides to bring justice to the world. Several fictive players join the game with the player. The main goal is to collect as many points as possible in two stages. Stage 1 was set to measure the bandwagon effect about the rate of uptake of beliefs through a set of “True or False Quizzes.” Stage 2 was a “Scooping Water” game, which is similar to a balloon popping game, which was designed to assess the effects of optimism bias under risky situations. During the game, each player’s implicit and explicit behavior data were collected. A full set of the experimental data is given in the Electronic Supplementary Material, ESM 1. To summarize, the present study aims at developing and validating a serious game to comprehensively assess both the implicit and explicit aspects of cognitive bias.
European Journal of Psychological Assessment (2020), 36(1), 44–55
46
(A)
K. Seo et al., Measuring Cognitive Bias Via Serious Games
(B)
Figure 1. Screenshots of the serious game “Don Quixote.” (A) Stage 1 “True or False Quizzes” for the bandwagon effect. (B) Stage 2 “Scooping Water” for the optimism bias.
We hypothesized that (1) the implicit game data (e.g., a response time) would correlate with the interview outcomes and (2) the explicit game data, which are visual elements in game, like score, number of players, would correlate with the results from questionnaires. To investigate this, a comparative study of gaming performance, questionnaires, and interviews were conducted. Both implicit and explicit game data were compared to the questionnaire and the interview and verify the validity of the serious game as an assessment of cognitive biases.
The questionnaire was administrated by one researcher with more than 10 years of experience in psychology. In the interview session, two independent interviewers with 13 and 17 years of experience in psychometrics were recruited. The Don Quixote game contains two stages: Stage 1 True or False Quizzes (bandwagon effect) and Stage 2 Scooping Water (optimism bias). The mechanics in the game (e.g., time pressure) motivated participants to focus solely on the game with no distractions. The game data (e.g., clicks, response time) were computerized and collected during the game play.
Materials and Methods
Experiment 1: Bandwagon Effect
Participants The participants were 135 college students (65 men and 70 women) between the age of 22 and 28 years (M = 25.55; SD = 2.11). They were recruited from two departments – Industrial Engineering and Applied Systems – of Hanyang University. The participants were taking human-computer interaction classes and were selected randomly to take part in the study as a course assignment. All participants received course credits for taking part in the experiment. Upon completion, all participants were given a gift voucher as a reward.
Procedure All participants were introduced to the experimental procedure and completed consent forms. Two experiments were conducted with the three measures in a random order: (i) a self-report questionnaire (about 10 min), (ii) the Don Quixote serious game (about 10 min), and (iii) an interview (about 40 min) with 5 min rest pause between the measures. European Journal of Psychological Assessment (2020), 36(1), 44–55
Self-Report Questionnaire A self-report questionnaire for the bandwagon effect was administered, which consists of 12 items with a 7-point scale ranging from 1 (= not at all) to 7 (= very much so) (Bearden, Netemeyer, & Teel, 1989). Two bandwagon effect variables, majority opinion sensitivity and influential opinion sensitivity, were assessed. The internal consistency was found to be 0.85 (Cronbach’s α). Game: True or False Quizzes (Stage 1) A game stage for the bandwagon effect was designed as shown in Table A1 of the Appendix. The game is True or false quizzes with 10 trivia questions. The player has to solve each question in 10 s. The correct answer earns 1,000 points. We intentionally inserted several fictive players into the game and examined how the majority opinion of the fictive players (number of players on each side) and the influential opinion (top-scoring player) affected the player’s answers. The explicit game data were the “number of fictive players involved in the majority opinion when changing answer” and the “number of changes due to the influential opinion.” The implicit game data collected were the “time Ó 2018 Hogrefe Publishing
K. Seo et al., Measuring Cognitive Bias Via Serious Games
taken to change answer via the majority opinion” and the “time taken to change answer via influential opinion.” The retest reliability was 0.82 (Pearson’s r). Interview First, one interviewer carried out a semi-structured interview to examine a participant’s bandwagon effect with various hypothetical situations. Questions included, for example, “how much are you concerned about others’ preferences when dining out?”. Using these projected hypothetical scenarios, the interviewer rated the level of bandwagon effect with a 7-point scale ranging from 1 (= not at all) to 7 (= very much so). All interviews were video-recorded. The next day, the other interviewer visited the experimental room and independently rated the level of bandwagon effect of each participant with a 7-point scale from the video-recording obtained during the experiment. The inter-rater agreement was found to be 0.77 (Cohen’s κ).
47
interview with hypothetical questions, such as “how much do you prefer to invest for high risk and high return in China’s stock-exchange market?”. Interviewers rated the level of optimism bias on a 5-point scale ranging from 1 (= not at all) to 5 (= very much so). The inter-rater agreement was found to be 0.74 (Cohen’s κ).
Analysis In order to examine the validity of the game as an evaluation method for cognitive biases, a Pearson correlation analysis was applied using a statistical software package (SPSS Statistics 21). In both experiments, a comparison was made between the results of the questionnaire, the game, and the interview. The overall descriptive and correlation results are provided in Tables A3 and A4 of the Appendix.
Experiment 2: Optimism Bias Self-Report Questionnaire Optimism bias was surveyed using eight items with a 5point scale ranging from 1 (= not at all) to 5 (= very much so) (Weber, Blais, & Betz, 2002). Two optimism bias variables, perceived benefit and perceived risk, were assessed by the questionnaire. The internal consistency was found to be 0.88 (Cronbach’s α). Game: Scooping Water (Stage 2) The scooping water game is formed in five rounds for assessing the optimism bias. The goal is to fill as much water as possible in a 20 L pot. The player could accumulate points as the level of water takes up. In game, each scoop was randomized from 1 L to 5 L. If the water exceeds the 20 L, the water pot will be cracked and the player will lose all. There is a trade-off between the water level and the point to be collected. The return of point reward rises with an increase of water level in risk (Scoreboard: 1 L = 100 points, 15 L = 1,500 points, and 20 L = 20,000 points). The more details are in Table A2 of the Appendix. The player’s decision-making patterns in regard to optimism bias were examined to assess how perceived benefit and perceived risk about each situation affected the player’s decisions. The implicit game data were the “time taken to decide to do further scooping” and the “time taken to decide to do no more scooping.” The explicit game data were the “water level when the participant stopped scooping” and the “number of decision changes via a confirmation check.” The retest reliability was 0.73 (Pearson’s r). Interview The process of the interview was the same as in Experiment 1. The first interviewer carried out a semi-structured Ó 2018 Hogrefe Publishing
Results Results of Experiment 1: Bandwagon Effect Table 1 shows that the game data selectively corresponded with the bandwagon effect assessed by either the self-report questionnaire or the interview. The implicit game data, such as the “time taken to change answer via the majority opinion” and the “time taken to change answer via influential opinion” were significantly associated with the results from the interview. The participants with a strong bandwagon effect in the interview demonstrated taking less time to change their answer via the majority or influential opinion. In the case of the explicit game data, the “number of fictive players involved in the majority opinion when changing answer” was negatively correlated with majority opinion sensitivity from the questionnaire, and the “number of changes due to the influential opinion” was positively related with influential opinion sensitivity. The participants with strong majority opinion sensitivity changed their answer with a lower number of fictive players involved in the majority opinion. In addition, the participants with strong influential opinion sensitivity changed more answers due to the influential leaders. There was no association between the results from the questionnaire and the interview.
Results of Experiment 2: Optimism Bias Table 2 also shows that there was no association between the questionnaire and the interview. Furthermore, the European Journal of Psychological Assessment (2020), 36(1), 44–55
48
K. Seo et al., Measuring Cognitive Bias Via Serious Games
Table 1. Comparison between the questionnaire, the game, and the interview to assess the bandwagon effect (n = 135) Questionnaire Measures
Majority opinion sensitivity
Interview
Influential opinion sensitivity
Bandwagon effect interview
Questionnaire Majority opinion sensitivity
1.00
Influential opinion sensitivity
0.24
1.00
(Implicit) Time taken to change answer via the majority opinion
0.13
0.10
0.91*
(Implicit) Time taken to change answer via influential opinion
0.23
0.02
0.85*
(Explicit) Number of fictive players involved in the majority opinion when changing answer
0.66*
0.16
0.39
(Explicit) Number of changes due to the influential opinion
0.24
0.68*
0.30
0.14
0.05
1.00
Game data
Interview Bandwagon effect interview Note. *p < 0.01. The significant correlations are shown in bold.
Table 2. Comparison between the questionnaire, the game, and the interview to assess the optimism bias (n = 135) Questionnaire Measures
Perceived benefit
Interview
Perceived risk
Optimism bias interview
Questionnaire Perceived benefit
1.00
Perceived risk
0.81*
1.00
(Implicit) Time taken to decide to do further scooping
0.03
0.01
(Implicit) Time taken to decide to do no more scooping
0.17
0.19
0.76*
(Explicit) Water level when the participant stopped scooping
0.78*
0.64*
0.08
(Explicit) Number of decision changes via a confirmation check
0.15
0.03
0.07
0.07
0.12
1.00
Game data 0.65*
Interview Optimism bias interview Note. *p < 0.01. The significant correlations are shown in bold.
implicit game data, such as the “time taken to decide to do further scooping” and the “time taken to decide to do no more scooping,” were significantly associated with the results from the interview. The participants with strong optimism bias in the interview demonstrated taking less time to decide to do further scooping and more time to decide to do no more scooping. The explicit game data of the “water level when the participant stopped scooping” were significantly associated with perceived benefit and perceived risk from the questionnaire. When the perceived benefit was high, the participants scooped up water until a high water level. On the contrary, when the perceived risk was high, the participants stopped scooping up water even with a low water level. Another type of explicit game data, the “number of decision changes via a confirmation check,” showed no relationship with conventional methods. The European Journal of Psychological Assessment (2020), 36(1), 44–55
two measures in the questionnaire were negatively correlated with one another, meaning that one’s perceived benefit in a risky activity and one’s assessment of the riskiness of the same situation are inversely proportional.
Discussion and Conclusions Discrepancy of Self-Reported Outcome With Interview Tables 1 and 2 show that there was a discrepancy between the self-report questionnaire and the interview. Dual-process theory can provide a theoretical rationale for this measurement discrepancy by dividing the realm of mental Ó 2018 Hogrefe Publishing
K. Seo et al., Measuring Cognitive Bias Via Serious Games
processes into two general categories depending on whether they operate automatically or in a controlled fashion (i.e., an implicit and explicit process; Evans & Stanovich, 2013; Kahneman, 2011). Waehrens, Bliddal, DanneskioldSamsøe, Lund, and Fisher (2012) showed that a self-report questionnaire, an expert interview, and behavior observations yielded rather different results because people respond differently according to the characteristics of each measure. While responding to the self-report questionnaire, participants carefully processed the questions before making an answer by using their explicit process (Schaeffer, 2000). On the contrary, the interview evoked an implicit process; as such, participants were more likely to present immediate and automatic responses (Kahneman, 2011). The correlation between implicit and explicit measures varies widely across studies (Ajzen, 2001; Hofmann, Gawronski, Gschwendner, Le, & Schmitt, 2005; Nosek, 2005). In our serious game, the player’s behavior data could be used to comprehensively assess both implicit and explicit processes.
Implicit and Explicit Game Data for Evaluating Cognitive Biases In the serious game, the implicit game data represent automatic and unconscious behaviors, such as the time taken to make a decision. The implicit behavior sufficiently accounted for the player’s cognitive biases in a similar manner to the interview. In the bandwagon effect stage, if players took less time to change their answers due to the opinion of the majority, their interviewed bandwagon effect was higher. Similarly, if players took less time to change their answers due to the top-scoring player’s choice, their interviewed bandwagon effect was also higher. In the optimism bias stage, if players took less time to decide to scoop up the water, their optimism bias was higher, whereas if players took more time to decide to stop scooping up the water (i.e., hesitate to stop), their optimism bias was also higher. These high correlations are in line with previous studies which have supported the existence of the links between an individual’s response times and their implicit process (Greenwald et al., 2003; Van Herpen et al., 2009). The serious game and the underlying mechanics (i.e., evoking instant responses under time pressure) may contribute to these high correlations (Bush, 2017). In essence, the current study demonstrated that the implicit response time from the serious game could be applicable to revealing the implicit aspect of an individual’s cognitive biases. Unlike the implicit game data, the explicit game data indicate controlled and conscious decision-making outcomes in accordance with the visual elements during the
Ó 2018 Hogrefe Publishing
49
game, which is similar to the self-report questionnaire results (Asendorpf, Banse, & Mücke, 2002; Dovidio, Kawakami, Johnson, Johnson, & Howard, 1997). Note that the explicit game data represent the number of visual elements when changing decisions, count of decision changes, and total number of clicks. In the bandwagon effect stage, if players changed their answers when a smaller number of fictive players was on the opposite side (i.e., visual elements), their self-reported bandwagon effect was higher accordingly. In addition, when players changed their answers more often due to the top-scoring influential player’s choice (i.e., count of decision changes), their bandwagon effect was higher. In the optimism bias stage, when the current water level was close to the limit (e.g., 19 L; maximum = 20 L), and the players decided to continue to scoop up water, their optimism bias levels were very high. In this regard, our serious game, Don Quixote, might be of value for interpreting both implicit and explicit behaviors, that is, the dual-process of decision-making. Evans and Stanovich (2013) proposed an integrative way to understand both implicit and explicit outcomes by considering the role of working memory. For example, decisions made about problematic behaviors were better predicted by explicit measures when conscious control resources from working memory were available, but were better predicted by implicit measures when control resources had been experimentally depleted (Friese, Hofmann, & Wänke, 2008; see Experiment 2 in Gibson, 2008). The integrative understanding of both implicit and explicit measures can support a holistic evaluation of the manifestation of cognitive biases.
Implications and Future Research Multiple aspects of gaming performance (e.g., response times, decision-making patterns, reactions to visual elements) can be employed to interpret an individual’s psychological biases. From the perspective of assessment, having a comprehensive understanding of implicit and explicit behavior data from games can expand our knowledge about dual-process mechanisms in various decision-making process, such as the racial prejudice in presidential elections (Payne et al., 2010); the relationship of self-esteem with depression and loneliness (Creemers, Scholte, Engels, Prinstein, & Wiers, 2012); and the role of motivation in healthrelated behaviors (Keatley, Clarke, & Hagger, 2012). Don Quixote is a specific genre of serious game that deals only with cognitive biases. Additional replications in other research areas would help to establish the generalizability of the game-based assessment method. It is hoped that by applying this assessment method to other conditions and contexts, researchers can gain the insights needed to make their understanding more fruitful.
European Journal of Psychological Assessment (2020), 36(1), 44–55
50
Acknowledgments This research was supported by the MSIT (Ministry of Science, ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-20182017-0-01637) supervised by the IITP (Institute for Information & Communications Technology Promotion). Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000485 ESM 1. Syntax (.xlsx) Experimental data.
References Ajzen, I. (2001). Nature and operation of attitudes. Annual Review of Psychology, 52, 27–58. https://doi.org/10.1146/annurev. psych.52.1.27 Ariely, D. (2008). Predictably irrational. New York, NY: HarperCollins Press. Asendorpf, J. B., Banse, R., & Mücke, D. (2002). Double dissociation between implicit and explicit personality self-concept: The case of shy behavior. Journal of Personality and Social Psychology, 83, 380–393. https://doi.org/10.1037/00223514.83.2.380 Bearden, W. O., Netemeyer, R. G., & Teel, J. E. (1989). Measurement of consumer susceptibility to interpersonal influence. Journal of Consumer Research, 15, 473–481. https://doi.org/ 10.1086/209186 Bornstein, B. H., & Emler, A. C. (2001). Rationality in medical decision making: A review of the literature on doctors’ decisionmaking biases. Journal of Evaluation in Clinical Practice, 7, 97– 107. https://doi.org/10.1046/j.1365-2753.2001.00284.x Bush, R. M. (2017). Serious play: An introduction to the Sirius Research Program. Games and Culture, 12, 227–232. https:// doi.org/10.1177/1555412016675728 Choliz, M. (2010). Cognitive biases and decision making in gambling. Psychological Reports, 107, 15–24. https://doi.org/ 10.2466/02.09.18.22.PR0.107.4.15-24 Cowley, B., Fantato, M., Jennett, C., Ruskov, M., & Ravaja, N. (2014). Learning when serious: Psychophysiological evaluation of a technology-enhanced learning game. Educational Technology & Society, 17, 3–16. Retrieved from https://search. proquest.com/docview/1502989081?accountid=11283 Creemers, D. H., Scholte, R. H., Engels, R. C., Prinstein, M. J., & Wiers, R. W. (2012). Implicit and explicit self-esteem as concurrent predictors of suicidal ideation, depressive symptoms, and loneliness. Journal of Behavior Therapy and Experimental Psychiatry, 43, 638–646. https://doi.org/10.1016/j. jbtep.2011.09.006 Dovidio, J. F., Kawakami, K., Johnson, C., Johnson, B., & Howard, A. (1997). On the nature of prejudice: Automatic and controlled processes. Journal of Experimental Social Psychology, 33, 510– 540. https://doi.org/10.1006/jesp.1997.1331 Dunbar, N. E., Wilson, S. N., Adame, B. J., Elizondo, J., Jensen, M. L., Miller, C. H., . . . Burgoon, J. K. (2013). MACBETH: Development of a training game for the mitigation of cognitive bias. International Journal of Game-Based Learning, 3, 7–26. https:// doi.org/10.4018/ijgbl.2013100102
European Journal of Psychological Assessment (2020), 36(1), 44–55
K. Seo et al., Measuring Cognitive Bias Via Serious Games
Evans, J. S. B., & Stanovich, K. E. (2013). Dual-process theories of higher cognition: Advancing the debate. Perspectives on Psychological Science, 8, 223–241. https://doi.org/10.1177/ 1745691612460685 Flatla, D. R., Gutwin, C., Nacke, L. E., Bateman, S., & Mandryk, R. L. (2011). Calibration games: Making calibration tasks enjoyable by adding motivating game elements. In J. Pierce, M. Agrawala, & S. Klemmer (Eds.), Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (pp. 403–412). New York, NY: ACM. Friese, M., Hofmann, W., & Wänke, M. (2008). When impulses take over: Moderated predictive validity of explicit and implicit attitude measures in predicting food choice and consumption behaviour. The British Journal of Social Psychology, 47, 397– 419. https://doi.org/10.1348/014466607X241540 Furman, W., & Flanagan, A. S. (1997). The influence of earlier relationships on marriage: An attachment perspective. Hoboken, NJ: Wiley. Gawronski, B., & Creighton, L. A. (2013). Dual-process theories. In E. C. Donal (Ed.), The Oxford handbook of social cognition (pp. 282–312). Oxford, UK: Oxford University Press. Gibson, B. (2008). Can evaluative conditioning change attitudes toward mature brands? New evidence from the implicit association test. Journal of Consumer Research, 35, 178–188. https://doi.org/10.1086/527341 Greenwald, A. G., Nosek, B. A., & Banaji, M. R. (2003). Understanding and using the implicit association test: I. An improved scoring algorithm. Journal of Personality and Social Psychology, 85, 197–216. https://doi.org/10.1037/0022-3514.85.2.197 Harris, L. R., & Brown, G. T. L. (2010). Mixing interview and questionnaire methods: Practical problems in aligning data. Practical Assessment, Research & Evaluation, 15, e1–e19. Retrieved from http://hdl.handle.net/123456789/2867 Haselton, M. G., Nettle, D., & Murray, D. R. (2005). The evolution of cognitive bias. In M. B. David (Ed.), The handbook of evolutionary psychology (pp. 724–746). Hoboken, NJ: Wiley. Hilbert, M. (2012). Toward a synthesis of cognitive biases: How noisy information processing can bias human decision making. Psychological Bulletin, 138, 211–237. https://doi.org/10.1037/ a0025940 Hofmann, W., Gawronski, B., Gschwendner, T., Le, H., & Schmitt, M. (2005). A meta-analysis on the correlation between the implicit association test and explicit self-report measures. Personality and Social Psychology Bulletin, 31, 1369–1385. https://doi.org/10.1177/0146167205275613 Jacob, M. (2013). Want to work here? Play this game first!. Forbes,. Retrieved from http://www.forbes.com/sites/jacobmorgan/ 2013/12/17/want-to-work-here-play-this-game-first/ Jasper, F., & Ortner, T. M. (2015). The tendency to fall for distracting information while making judgments. European Journal of Psychological Assessment, 30, 193–207. https:// doi.org/10.1027/1015-5759/a000214 Kahneman, D. (2011). Thinking, fast and slow. London, UK: Macmillan Press. Keatley, D., Clarke, D. D., & Hagger, M. S. (2012). Investigating the predictive validity of implicit and explicit measures of motivation on condom use, physical activity, and healthy eating. Psychology & Health, 27, 550–569. https://doi.org/10.1080/ 08870446.2011.605451 Lejuez, C. W., Read, J. P., Kahler, C. W., Richards, J. B., Ramsey, S. E., Stuart, G. L., . . . Brown, R. A. (2002). Evaluation of a behavioral measure of risk taking: The Balloon Analogue Risk Task (BART). Journal of Experimental Psychology: Applied, 8, 75–84. https://doi.org/10.1037/1076-898X.8.2.75 Mullinix, G., Gray, O., Colado, J., Veinott, E., Leonard, J., Papautsky, E. L., . . . Todd, P. M. (2013). Heuristica: Designing a serious
Ó 2018 Hogrefe Publishing
K. Seo et al., Measuring Cognitive Bias Via Serious Games
game for improving decision making. In Games Innovation Conference 2013 IEEE International (pp. 250–255). IEEE. Nosek, B. A. (2005). Moderators of the relationship between implicit and explicit evaluation. Journal of Experimental Psychology: General, 134, 565–584. https://doi.org/10.1037/00963445.134.4.565 Payne, B. K., Krosnick, J. A., Pasek, J., Lelkes, Y., Akhtar, O., & Tompson, T. (2010). Implicit and explicit prejudice in the 2008 American presidential election. Journal of Experimental Social Psychology, 46, 367–374. https://doi.org/10.1016/ j.jesp.2009.11.001 Peng, W., Liu, M., & Mou, Y. (2008). Do aggressive people play violent computer games in a more aggressive way? Individual difference and idiosyncratic game-playing experience. CyberPsychology & Behavior, 11, 157–161. https://doi.org/ 10.1089/cpb.2007.0026 Petty, R. E., Fazio, R. H., & Briñol, P. (Eds.). (2012). Attitudes: Insights from the new implicit measures. New York, NY: Psychology Press. Reece, W., & Matthews, L. (1993). Evidence and uncertainty in subjective prediction: Influences on optimistic judgment. Psychological Reports, 72, 435–439. https://doi.org/10.2466/ pr0.1993.72.2.435 Richetin, J., Perugini, M., Prestwich, A., & O’Gorman, R. (2007). The IAT as a predictor of food choice: The case of fruits versus snacks. International Journal of Psychology, 42, 166–173. https://doi.org/10.1080/00207590601067078 Schaeffer, N. C. (2000). Asking questions about threatening topics: A selective overview. In A. S. Arthur, A. B. Christine, B. J. Jared, S. K. Howard, & S. C. Virginia (Eds.), The science of self-report: Implications for research and practice (pp. 105– 122). New York, NY: Psychology Press. Shepperd, J. A., Carroll, P., Grace, J., & Terry, M. (2002). Exploring the causes of comparative optimism. Psychologica Belgica, 42, 65–98. Retrieved from http://citeseerx.ist.psu.edu/ viewdoc/summary?doi=10.1.1.507.9932 Symborski, C., Barton, M., Quinn, M., Morewedge, C., Kassam, K., Korris, J. H., & Hollywood, C. A. (2014). Missing: A serious game
Ó 2018 Hogrefe Publishing
51
for the mitigation of cognitive biases. Proceedings of the Interservice/Industry Training, Simulation, Education Conference, 14295, 1–13. Retrieved from http://www.iitsec.org/ about/publicationsproceedings/documents/bp_trng_14295_paper.pdf Van Herpen, E., Pieters, R., & Zeelenberg, M. (2009). When demand accelerates demand: Trailing the bandwagon. Journal of Consumer Psychology, 19, 302–312. https://doi.org/10.1016/ j.jcps.2009.01.001 Waehrens, E. E., Bliddal, H., Danneskiold-Samsøe, B., Lund, H., & Fisher, A. G. (2012). Differences between questionnaire- and interview-based measures of activities of daily living (ADL) ability and their association with observed ADL ability in women with rheumatoid arthritis, knee osteoarthritis, and fibromyalgia. Scandinavian Journal of Rheumatology, 41, 95–102. https://doi. org/10.3109/03009742.2011.632380 Weber, E. U., Blais, A. R., & Betz, N. E. (2002). A domain-specific risk-attitude scale: Measuring risk perceptions and risk behaviors. Journal of Behavioral Decision Making, 15, 263–290. https://doi.org/10.1002/bdm.414 Wittenbrink, B. & Schwarz, N. (Eds.). (2007). Implicit measures of attitudes. New York, NY: Guilford Press. Received January 11, 2017 Revision received February 28, 2018 Accepted February 28, 2018 Published online September 18, 2018 Jieun Kim Imagine X Lab Hanyang University 701 ho, Multidisciplinary Lecture Hall 222 Wangsimni-ro, Seongdong-gu Seoul 04763 Republic of Korea jkim2@hanyang.ac.kr
European Journal of Psychological Assessment (2020), 36(1), 44–55
52
K. Seo et al., Measuring Cognitive Bias Via Serious Games
Appendix Detailed Information of the Serious Game "Don Quixote" Table A1. Stage 1 – True or False Quizzes (bandwagon effect) Data collected Screenshot
Descriptions/Measure
Implicit
Explicit
Once the player makes an initial answer, fictive players are progressively shown on the opposite side from the player. The movement of fictive players who seem to be virtual competitors in the game may influence the player’s initial belief. In order to assess this majority opinion sensitivity, the “time taken to change answer via the majority opinion” (implicit data) and the “number of fictive players involved in the majority opinion when changing answer” (explicit data) were recorded.
X
X
The player’s influential opinion sensitivity was measured when one fictive top-scoring player with a badge appears in the game. We measured the “time taken to change answer via influential opinion” (implicit data) and the “number of changes due to the influential opinion” (explicit data).
X
X
A tutorial for the stage is provided.
A set of trivia quiz questions is given. A player had 10 s to answer and could change their answers as many times as they want within the time limit.
(Continued on next page)
European Journal of Psychological Assessment (2020), 36(1), 44–55
Ó 2018 Hogrefe Publishing
K. Seo et al., Measuring Cognitive Bias Via Serious Games
53
Table A1. (Continued) Data collected Screenshot
Descriptions/Measure
Implicit
Explicit
The player has 10 s to give his/her answer and check the correct answer. Ten rounds were played in the same manner.
After completing the game, a pop-up window summarized the virtual points collected. Each question offers 1,000 points if the player responds correctly.
Table A2. Stage 2 – Scooping Water (optimism bias) Date collected Images
Descriptions/Measure
Implicit
Explicit
A tutorial for the stage is provided.
The players start to fill up the water pots. The players accumulate points each time they scoop up the water, but the amount of each scoop was randomized from 1 L to 5 L. Players were informed of the cumulated current water level in the water pot.
(Continued on next page) Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 44–55
54
K. Seo et al., Measuring Cognitive Bias Via Serious Games
Table A2. (Continued) Data collected Screenshot
Descriptions/Measure
Implicit
The player could collect more points when they fill up five pots with water as much as possible. Once the water level in each pot exceeded 20 L, the pot would break and all points obtained would be lost. Based on the player’s perceived benefit and risk, the player decided to scoop up further or not. Player’s behavior data were recorded as following: the “time taken to decide to do further scooping” (implicit data), the “time taken to decide to do no more scooping” (implicit data), and the “water level when the participant stopped scooping” (explicit data).
Explicit X
At each decision-making point, a pop-up window asks the player whether (s)he will continue or cease to scoop up water by clicking on the “MORE” or “STOP” button, respectively. Note that the confirmation question is asked after clicking on the “MORE” or “STOP” button when water level exceeds 16 L: “Are you really sure about your choice?” This question was used to examine the following behavior data: the “number of decision changes via a confirmation check” (explicit data).
X
X
After completing the game, a pop-up window summarized the virtual points collected. Each water pot offers points depending on the water level. –1 L: 100 points –. . . –15 L: 1,500 points –16 L: 2,000 points –17 L: 3,000 points –18 L: 5,000 points –19 L: 9,000 points –20 L: 20,000 points
European Journal of Psychological Assessment (2020), 36(1), 44–55
Ó 2018 Hogrefe Publishing
K. Seo et al., Measuring Cognitive Bias Via Serious Games
55
Table A3. Overall descriptive statistics for the questionnaire, the game, and the interview Measures
M ± SD
Min
Max
Majority opinion sensitivity
3.4 ± 1.2
2.00
6.25
Influential opinion sensitivity
4.2 ± 1.4
2.00
6.75
(Implicit) Time taken to change answer via the majority opinion
5.7 ± 2.3
2.00
10.00
(Implicit) Time taken to change answer via influential opinion
6.1 ± 2.5
1.00
10.00
(Explicit) Number of fictive players involved in the majority opinion when changing answer
7.6 ± 1.6
4.00
10.00
(Explicit) Number of changes due to the influential opinion
2.0 ± 1.5
0.00
7.00
4.5 ± 1.6
1.00
7.00
Perceived benefit
2.6 ± 0.6
1.25
3.88
Perceived risk
3.5 ± 1.2
1.00
4.88
Experiment 1: Questionnaire
Experiment 1: Game data
Experiment 1: Interview Bandwagon effect interview Experiment 2: Questionnaire
Experiment 2: Game data (Implicit) Time taken to decide to do further scooping
6.7 ± 2.1
1.00
10.00
(Implicit) Time taken to decide to do no more scooping
4.1 ± 2.7
0.00
10.00
18.4 ± 0.9
16.00
20.00
1.4 ± 1.2
0.00
4.00
3.3 ± 0.9
1.00
5.00
(Explicit) Water level when the participant stopped scooping (Explicit) Number of decision changes via a confirmation check Experiment 2: Interview Optimism bias interview
Table A4. Overall correlations of the questionnaire, the game, and the interview Measures
E1-Q1
E1-Q2
E1-G1
E1-G2
E1-G3
E1-G4
E1-I
E2-Q1
E2-Q2
E2-G1
E2-G2 E2-G3 E2-G4
E2-I
Experiment 1 E1-Q1
1.000
E1-Q2
0.126
E1-G1
0.151
1.000 0.072
E1-G2
0.119
0.132
0.822*
1.000
E1-G3
0.659*
0.114
0.173
0.207
1.000
E1-G4
0.110
0.676*
0.014
0.035
0.104
1.000
E1-I
0.138
0.047
0.906*
0.849*
0.163
0.006
1.000
1.000
Experiment 2 E2-Q1
0.019
0.020
0.043
0.123
0.021
0.006
0.067
1.000
E2-Q2
0.064
0.115
0.030
0.108
0.074
0.090
0.042
0.807*
1.000
E2-G1
0.115
0.117
0.013
0.112
0.230
0.005
0.078
0.024
0.095
E2-G2
0.043
0.147
0.005
0.054
0.112
0.104
0.088
0.021
0.038
0.733* 1.000
E2-G3
0.063
0.048
0.066
0.172
0.053
0.083
0.094
0.781*
0.642*
0.014
0.002
1.000
E2-G4
0.025
0.013
0.181
0.109
0.084
0.055
0.164
0.074
0.071
0.016
0.019
0.004
1.000
E2-I
0.091
0.124
0.123
0.055
0.011
0.093
0.043
0.060
0.089
0.653* 0.763* 0.041
0.001
1.000
1.000
Notes. Values are Pearson correlation coefficient, r. *p < .01. The significant correlations are shown in bold. E1-Q1: Majority opinion sensitivity questionnaire; E1-Q2: Influential opinion sensitivity questionnaire; E1-G1: (Implicit game data) Time taken to change answer via the majority opinion; E1-G2: (Implicit game data) Time taken to change answer via influential opinion; E1-G3: (Explicit game data) Number of fictive players involved in the majority opinion when changing answer; E1-G4: (Explicit game data) Number of changes due to the influential opinion; E1-I: Bandwagon effect interview; E2-Q1: Perceived benefit questionnaire; E2-Q2: Perceived risk questionnaire; E2-G1: (Implicit game data) Time taken to decide to do further scooping; E2-G2: (Implicit game data) Time taken to decide to do no more scooping; E2-G3: (Explicit game data) Water level when the participant stopped scooping; E2-G4: (Explicit game data) Number of decision changes via a confirmation check; and E2-I: Optimism bias interview.
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 44–55
Original Article
Identification and Utility of a Short Form of the Pediatric Symptom Checklist-Youth Self-Report (PSC-17-Y) Paul Bergmann,1 Cara Lucke,2 Theresa Nguyen,3 Michael Jellinek,4 and John Michael Murphy2,4 1
Research Department, Foresight Logic, Saint Paul, MN, USA
2
Child Psychiatry Service, Massachusetts General Hospital, Boston, MA, USA
3
Policy and Programs, Mental Health America, Alexandria, VA, USA
4
Psychiatry-Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
Abstract: The Pediatric Symptom Checklist-Youth self-report (PSC-Y) is a 35-item measure of adolescent psychosocial functioning that uses the same items as the original parent report version of the PSC. Since a briefer (17-item) version of the parent PSC has been validated, this paper explored whether a subset of items could be used to create a brief form of the PSC-Y. Data were collected on more than 19,000 youth who completed the PSC-Y online as a self-screen offered by Mental Health America. Exploratory factor analyses (EFAs) were first conducted to identify and evaluate candidate solutions and their factor structures. Confirmatory factor analyses (CFAs) were then conducted to determine how well the data fit the candidate models. Tests of measurement invariance across gender were conducted on the selected solution. The EFAs and CFAs suggested that a three-factor short form with 17 items is a viable and most parsimonious solution and met criteria for scalar invariance across gender. Since the 17 items used on the parent PSC short form were close to the best fit found for any subsets of items on the PSC-Y, the same items used on the parent PSC-17 are recommended for the PSC-Y short form. Keywords: psychosocial, screening, PSC-Y, Pediatric Symptom Checklist
Mental health problems are common among children and adolescents, with approximately 13% estimated to have a problem that impairs functioning (Gardner et al., 1999; Jellinek et al., 1999; Semansky, Koyanagi, & VandivortWarren, 2003). Research continues to show that only about half of these children are identified (Sayal & Taylor, 2004; Sheldrick, Merchant, & Perrin, 2011; Simonian & Tarnowski, 2001) and only a fraction of them receive mental health services (O’Connell, Boat, & Warner, 2009). Since studies have shown that brief assessment tools can improve rates of identification of mental health problems in primary care (Cassidy & Jellinek, 1998; Hacker, Penfold, et al., 2014; Kolko, Campo, Kelleher, & Cheng, 2010), blue ribbon panels (Hogan, 2003; O’Connell et al., 2009), professional organizations (American Academy of Pediatrics, 2010a, 2010b), and insurers (Kuhlthau et al., 2011; Mann, 2013) have recommended that screening for psychosocial problems be a required component of routine well child care. Over the past two decades the Pediatric Symptom Checklist (PSC) has been one of the most frequently recommended measures (Jellinek, Murphy, & Burns, European Journal of Psychological Assessment (2020), 36(1), 56–64 https://doi.org/10.1027/1015-5759/a000486
1986; Jellinek et al., 1999; Semansky et al., 2003). The original PSC is a 35-item parent completed form. It has been translated into more than two dozen languages and has been widely used in both research and clinical settings. In order to make the PSC easier for pediatricians to use, Gardner et al. (1999) used a nationally representative sample of more than 20,000 pediatric outpatients to create a shorter form of the PSC based on an exploratory factor analysis of the PSC items (Gardner et al., 1999). This resulted in a 17-item version of the parent completed PSC form, which contained three subscales based on clusters of items as well as a total score based on all items. The PSC-17 demonstrated high internal consistency for the overall scale as well as the internalizing, externalizing, and attention subscales (α = 0.89, 0.79, 0.83, and 0.83, respectively; Gardner et al., 1999), and has also demonstrated concurrent validity when semi-structured psychiatric interviews (Schedule for Affective Disorders and Schizophrenia for School-Age Children – Present and Lifetime Version) and other validated questionnaires like the Children’s Depression Inventory and Reynolds Child Ó 2018 Hogrefe Publishing
P. Bergmann et al., Identification of PSC-Y Short Form
Manifest Anxiety Scale were used as gold standards (Gardner, Lucas, Kolko, & Campo, 2007). The PSC-17 has been used in more than 40 published studies (Murphy, 2016) and has not only demonstrated risk rates that are comparable to the PSC-35 (Gardner et al., 1999), but it has also proven to be a useful tool due to its brevity and the fact that it has subscale scores that can assess three different domains of functioning. A recent paper confirmed the validity of the PSC-17 and its three-factor structure using confirmatory factor analysis (CFA) in a large (N = 80,000+) national sample of pediatric outpatients, aged 4–15 years (Murphy et al., 2016). In addition to the parent-report PSC-35 and parentreport PSC-17 forms, a youth self-report version of the 35-item PSC has been created and has also seen wide use. The PSC-Y-35 has the same 35 items as the PSC parent 35-item form, and is used to evaluate self-reported general psychosocial functioning among youth aged 11–17 years. The PSC-Y-35 has been validated against a number of standards, including the parent-reported PSC, teacher reports, and the youth’s self-report on the Children’s Depression Inventory and Reynolds Child Manifest Anxiety Scale (Pagano, Cassidy, Little, Murphy, & Jellinek, 2000). The same subscales created for the parent-report PSC have also been used in at least one published study of the PSC-Y (Montaño, Mahrer, Nager, Claudius, & Gold, 2011). Since the PSC-Y was created in 2000, it has been used in more than 20 studies, with most reporting rates of positive screening ranging from 4.2% to 20.0% in diverse samples from schools, outpatient pediatric practices, and other populations (Claudius, Mahrer, Nager, & Gold, 2012; Gall, Pagano, Desmond, Perrin, & Murphy, 2000; Hacker, Arsenault, et al., 2014; Kleinman et al., 2002; Montaño et al., 2011; Okuda et al., 2013; Pagano et al., 2000). Most recently, Mental Health America (MHA), the nation’s oldest mental health advocacy organization, made the 35-item PSC-Y and a number of brief mental health screening instruments for adults available online for no cost (http:// www.MHAScreening.org). Since a briefer (17-item) parentreport version of the PSC with three subscales has been validated and is widely used, the current paper explored whether the same or another subset of items could be used as a brief form of the youth self-report version of the PSC.
Methods Data for this study were collected from youth who had filled out the 35-item PSC-Y online as a self-screen offered by Mental Health America (MHA). Respondents are asked to voluntarily answer any or all of a small number of background questions along with each PSC-Y form. For all questionnaires, feedback is provided instantaneously online Ó 2018 Hogrefe Publishing
57
upon completion of the form with standard output that explains the total and subscale scores and provides links to additional information and referrals. MHA provided data to the authors for the analyses reported here. MHA does not maintain IP addresses or any other protected information from the screens so they are completely anonymous and untraceable. Since the analyses used only de-identified data, the study was approved as exempt by the IRB.
Sample This study used 19,158 complete PSC-Y questionnaires filled out by youth aged 11–17 years through the MHA website from May 15, 2015 (when the 35-item version of the PSCY was first posted) to May 14, 2016. This sample of convenience and the public service screening project that produced it are described in another paper (Murphy et al., 2017). The MHA website provides information about the characteristics of the more than 1.5 million respondents who have completed questionnaires thus far as well as more details about the screening program (called B4Stage4). Demographic characteristics of the sample used in the current study are presented in Table 1. MHA inspected the data for completeness and anonymity prior to sending it to the analytic team who performed all analyses.
Measures The PSC-Y is a 35-item self-report measure designed to evaluate general psychosocial functioning among youth aged 11–17 years (Jellinek et al., 1988). Respondents are asked to indicate the frequency of each symptom on a 3-point Likert scale with the options of 0 = never, 1 = sometimes, 2 = often, and the weighted scores are summed to create a total score ranging from 0-70. Total scores are recoded dichotomously to indicate overall mental health risk (or lack thereof) based on a cutoff score of 30 or higher on the global scale. The PSC-Y has demonstrated acceptable test-retest reliability (r = 0.45; k = 0.50) in a sample of 90 children (Pagano et al., 2000) as well as strong internal consistency for the overall scale (a = 0.86–0.90) in samples ranging from 348 to 2,513 children in infectious disease clinics, emergency departments, and primary schools (Lowenthal et al., 2011; Montaño et al., 2011; Okuda et al., 2013). Prior research also demonstrated moderate internal consistency for the PSC-Y-35’s internalizing (a = 0.76), externalizing (a = 0.73), and attention (a = 0.69) subscales (Montaño et al., 2011).
Analyses Prior to the factor analyses the dataset was divided into two samples in a two-step process: (1) we used the Stata command runiform (1,2) to assign a random value ranging European Journal of Psychological Assessment (2020), 36(1), 56–64
58
P. Bergmann et al., Identification of PSC-Y Short Form
Table 1. Demographic characteristics of sample Full sample N
%
95% CI (%)
Gender Male
3,222
16.8
16.3
17.3
Female
14,841
77.5
76.9
78.1
Missing
1,095
5.7
5.4
6.0
White
11,427
59.6
59.0
60.3
Black
1,142
6.0
5.6
6.3
Hispanic/Latino
2,060
10.8
10.3
11.2
Asian/Pacific
1,304
6.8
6.4
7.2
174
0.9
0.8
1.0
Multiple/Other
1,845
9.6
9.2
10.0
Missing
1,206
6.3
6.0
6.6
30,999
3,263
17.0
16.5
17.6
40,000–79,999
2,797
14.6
14.1
15.1
80,000–149,999
1,849
9.7
9.2
10.1
Race/Ethnicity
Native American
Household Income
150,000
849
4.4
4.1
4.7
10,400
54.3
53.6
55.0
No
9,037
47.2
46.5
47.9
Yes
10,121
52.8
52.1
53.5
15,168
79.2
78.6
79.7
9,756
50.9
50.2
51.6
Internalizing
17,290
90.2
89.8
90.7
Externalizing
3,703
19.3
18.8
19.9
Missing Emotional/Behavioral Problem
Full PSC-Y At-Risk Overall Attention
between 1 and 2 to each case and (2) then assigned cases with a random value less than 1.5 to the EFA sample and cases with a random value greater than or equal to 1.5 to the CFA sample (see Electronic Supplementary Material, ESM 2). Next, exploratory factor analyses of ordinal variables (EFAs) were conducted on the first subsample to identify candidate factor structures and the semantics of those factors (Stage 1) (see ESM 3). In Stage 2, confirmatory factor analysis (CFA) and tests of measurement invariance across gender (MI) were conducted using the second subsample to determine how well the data fit the factor structure of the selected model derived from the EFAs (see ESM 4–10). All EFA, CFA, and MI analyses were conducted using the statistical software applications, PRELIS 9.3 and LISREL 9.3 (Jöreskog & Sörbom, 2017). All other analyses were conducted using Stata 14.2 (StataCorp, 2015). While multiple factor structures were considered, we sought the simplest solution that was most similar to the factor structure and items of the short form parent PSC (PSC-17). The practical benefits such as ease of use, scoring, and interpretation that would ensue if parent and youth versions European Journal of Psychological Assessment (2020), 36(1), 56–64
were the same were determined to outweigh potential minor incremental improvements in loadings, reliabilities, and correlations.
Stage 1: Ordinal Exploratory Factor Analyses We performed two exploratory factor analyses of ordinal variables (EFA) with logistic link functions and promaxrotated factor loadings to generate unweighted least squares solutions fitted to polychoric correlation matrices using the first randomly selected subsample of 9,585 cases (see ESM 3–5, 10). A scree plot based on the 35-item pool showed a clear reduction in slope after 3 factors (see Figure 1 in ESM 1). For both EFAs, we specified three-factor solutions with the hypothesis that the resulting factor structures would be similar between the parent and youth short forms. The first EFA utilized all 35 items in the PSC-Y (see Tables 6–11 in ESM 1), yielding a 25-item model. The second EFA utilized only those PSC-Y items that correspond Ó 2018 Hogrefe Publishing
P. Bergmann et al., Identification of PSC-Y Short Form
59
Table 2. Promax factor loadings of 16/17-item three-factor EFA solution Attention
Internalizing
Externalizing
Unique variance
pscy09
Distract easily
0.962
0.132
0.029
0.190
pscy14
Have trouble concentrating
0.776
0.056
0.011
0.363
pscy08
Daydream too much
0.529
0.076
0.005
0.680
pscy04
Fidgety, unable to sit still
0.526
0.057
0.084
0.704
pscy07
Act as if driven by motor
0.249
0.162
0.146
0.816
pscy11
Feel sad, unhappy
0.108
0.932
0.016
0.211
pscy13
Feel hopeless
0.045
0.897
0.005
0.228
pscy19
Down on yourself
0.008
0.867
0.042
0.264
pscy27
Seem to be having less fun
0.018
0.647
0.112
0.533
pscy22
Worry a lot
0.153
0.558
0.124
0.614
pscy34
Take things that do not belong to you
0.004
0.044
0.728
0.481
pscy35
Refuse to share
0.090
0.012
0.725
0.523
pscy32
Tease others
0.058
0.083
0.648
0.616
pscy33
Blame others for your troubles
0.082
0.078
0.589
0.670
pscy29
Do not listen to rules
0.186
0.046
0.581
0.552
pscy16
Fight with other children
0.086
0.067
0.549
0.630
pscy31
Do not understand other people’s feelings
0.010
0.044
0.512
0.733
Notes. Bold values indicate the factor to which the item is attributed ( 0.4). The item pscy07 (in italics) is the 17th item in the parent short form PSC. Its factor loading was < 0.4 for Attention and was therefore not included in the 16-item model.
to the 17 items in the PSC-17, which yielded a 16-item model (see Table 2; also see Tables 12–15 in ESM 1). We used a promax-rotated factor loading of 0.4 as the minimum required loading for item inclusion in a factor.
( 0.06, 90% CI LB 0.06), SRMR ( 0.08), CFI ( 0.95), and TLI ( 0.95).
Measurement Invariance Stage 2: Confirmatory Factor Analyses The three-factor models proposed in Stage 1 were evaluated through confirmatory factor analyses utilizing polychoric correlation matrices, asymptotic covariance matrices, and a robust unweighted least squares estimation method (Forero, Maydeu-Olivares, & Gallardo-Pujol, 2009; Muthén, 1993) on the second subsample of 9,573 cases (see ESM 6, 7, and 10). We also evaluated a 17-item model which directly corresponded to the parent-reported PSC-17. This 17-item PSC-Y was created by adding item 7 of the PSC-Y, “Act as if driven by a motor,” to the 16-item model, in spite of its relatively low factor loading. Multiple indices were used to evaluate different aspects of model fit. Absolute fit was evaluated by using the Satorra-Bentler scaled chi-squared statistic (w2) and the standardized root mean squared residual (SRMR). Root mean squared error of approximation (RMSEA) and its 90% confidence interval (90% CI) provided a measure of fit adjusting for model parsimony (Browne, Cudeck, Bollen, & Long, 1993). The Comparative Fit Index (CFI) and the Tucker-Lewis index (TLI) provided measures of comparative fit. Hu and Bentler (1999) suggest an acceptable model fit is defined by the following criteria: w2 (p > .05), RMSEA Ó 2018 Hogrefe Publishing
Measurement invariance was evaluated through four nested levels, each with increasing equality constraints: (i) configural invariance, (ii) metric invariance, (iii) scalar invariance, and (iv) strict invariance (Meredith, 1993). Configural invariance requires that the same factor model specification (i.e., the same structure) holds across groups. It investigates whether members of different groups are responding to the test’s items within the same conceptual framework. Metric invariance requires cross-group equality in the structure and factor loadings, and investigates whether a unit change in an item score is scaled to an equal unit change in the factor score across groups. Scalar invariance requires cross-group equality in structure, loadings, and intercepts, and investigates whether the meaning of the construct and the levels of the underlying items are equal in both groups. If so, the latent variable scores of the groups can be compared. Strict invariance requires cross-group equality in structure, loadings, intercepts, and residual variances. When these conditions are met, the latent variable is measured identically across groups. The following criteria for testing measurement invariance were considered together: w2 (p > .05); Δw2 (p > .05); RMSEA 0.06; SRMR 0.08; CFI 0.095; TLI 0.95; ΔCFI > 0.01; and ΔTLI > 0.02. Δw2 is the European Journal of Psychological Assessment (2020), 36(1), 56–64
60
test of the w2 difference of the nested models, and ΔCFI and ΔTLI are the difference in CFI and TLI, respectively, between the two nested models (Wu, Li, & Zumbo, 2007) (see ESM 8–10).
Results
P. Bergmann et al., Identification of PSC-Y Short Form
We compared internal consistency reliabilities with and without item 7 (see Tables 12–15 in ESM 1). When including Item 7, Cronbach’s α for the Overall Score and Internalizing factor indicated good internal consistency (.79 and .81, respectively), and indicated acceptable internal consistency for the Attention and Externalizing factors (.68 and .74, respectively). Removing Item 7 has a slight impact on the Overall Score and Attention factor (α = .78 and α = .69, respectively) and no impact on the other factors.
Stage 1: Exploratory Factor Analyses of Ordinal Variables Model 1 – 35-Item Pool EFA of all 35 PSC-Y items yielded a 25-item, three-factor solution (see Tables 6 and 7 in ESM 1). Semantics of the three latent variables were consistent with the three factors of the 35-Item PSC-Y. Factor correlations were moderately strong between Attention and Internalizing (r = 0.53) and between Attention and Externalizing (r = 0.51), and weak between Internalizing and Externalizing (r = 0.31). The Externalizing factor of the 25-item solution included exactly the same items as the corresponding PSC-17 Externalizing Factor. The Internalizing factor included the same five items in the PSC-17 Internalizing factor and added an additional six items – all of which are arguably internalizing behaviors. The Attention factor included all but one of the same items in the PSC-17 Attention factor (Item 7, “Act as if driven by a motor,” had a factor loading of only 0.233), and added three additional behaviors that are arguably associated with ADHD. Internal consistency reliabilities of the resultant factors and the overall score of the 25-item solution are presented in Tables 8–11 in ESM 1. Cronbach’s α for the Overall Score and Internalizing factor indicated good internal consistency (.85 and .84, respectively), and for the Attention and Externalizing factors indicated acceptable internal consistency (.74 and .74, respectively). Model 2 – 17-Item Pool Seeking to maximize similarities between youth and parent short forms of the PSC, we also conducted an EFA using only the 17 items of the PSC-Y that correspond to the parent PSC-17. The EFA yielded a 16-item, three-factor solution (see Table 2). Item 7, “Act as if driven by a motor,” had a factor loading of only 0.249 (below the 0.4 threshold) on the Attention factor. Semantics of the factor structures were similar to those of the 25-item solution, though more limited in scope for the Attention and Internalizing factors. As a result, factor correlations were slightly lower for Attention-Internalizing and Attention-Externalizing (r = 0.44 and r = 0.42, respectively), and substantially lower for Internalizing-Externalizing (r = 0.17).
European Journal of Psychological Assessment (2020), 36(1), 56–64
Stage 2: Confirmatory Factor Analyses For each model, all items loaded on exactly one factor, and all measurement error was presumed to be uncorrelated. The latent variables were allowed to be correlated. As a result, the 25-item, 17-item and 16-item models were overidentified with 272, 116, and 101 degrees of freedom, respectively. Goodness-of-fit measures for all three models are presented in Table 3. As expected given the large sample size, the w2 tests were overpowered and rejected the null hypothesis for all three models. The 25-item model failed to meet any of the prescribed criteria for acceptable model fit. Both the 16-item and 17-item models met criteria for SRMR, CFI, and TLI. Differences in indices between the 16-item and 17-item models were relatively small. Completely standardized parameter estimates from the solutions are presented in Tables 16–19 in ESM 1. All freely estimated parameters were statistically significant (p < .001). Factor loading estimates revealed that the indicators for the 25-item model were moderately to strongly related to their purported factors (R2s = 0.154–0.581); indicators for the 16-item and 17-item models were more strongly related to their factors (R2s = 0.260–0.748 and 0.263– 0.743, respectively; see Table 19 in ESM 1). Estimates from the solutions indicate very strong relationships between the three dimensions in the 25-item model, and for the 16-item and 17-item models, moderately strong relationships between Attention and both Internalizing and Externalizing, but weak relationships between Internalizing and Externalizing (see Table 4).
Measurement Invariance None of the models met criteria for the w2, Δw2, and RMSEA. Like the w2 test for a single model CFA, the Δw2 is sensitive to sample size, often rejecting trivial differences when the sample size is large. All models did meet criteria for CFI, ΔCFI, TLI, and ΔTLI. In addition, the differences in RMSEA and SRMR were small for all nested models except for strict invariance. These results suggest the 17-item
Ó 2018 Hogrefe Publishing
P. Bergmann et al., Identification of PSC-Y Short Form
61
Table 3. Goodness-of-fit statistics for CFA models Statistic Satorra-Bentler (1988) scaled w
2
25-item model
16-item model
17-item model
20,738.20*
1763.49*
2372.99*
Standardized root mean squared residual (SRMR)
0.1203
0.0505
0.0526
Root mean squared error of approximation (RMSEA)
0.1370
0.0709
0.0754
RMSEA 90% confidence interval lower bound
0.1360
0.0692
0.0739
Comparative fit index (CFI)
0.8090
0.9736
0.9654
Tucker-Lewis fit index (TLI)
0.7894
0.9686
0.9595
Note. *p < .001.
Table 4. CFA factor correlations 25-item model
16-item model
17-item model
r
SE
z
r
SE
z
r
SE
z
Attention-Internalizing
0.870
0.008
111.479
0.442
0.013
34.265
0.487
0.013
38.500
Attention-Externalizing
0.678
0.012
56.309
0.431
0.013
33.382
0.464
0.013
36.126
Internalizing-Externalizing
0.578
0.011
53.114
0.171
0.014
12.274
0.172
0.014
12.328
ΔCFI
TLI
ΔTLI
Table 5. Fit Indices of measurement invariance models Model
Santorra-Bentler scaled w2 (df)
RMSEA
SRMR
Δw2 (df)
CFI
Configural
1,955.560 (232)*
0.076
0.051
–
0.982
0.979
–
Metric
2,084.083 (246)*
0.075
0.055
92.533 (14)*
0.981
0.001
–
0.979
0.000
Scalar
2,206.668 (260)*
0.073
0.061
122.585 (14)*
0.980
0.001
0.979
0.000
Strict
2,360.625 (283)*
0.069
0.084
153.957 (23)*
0.979
0.001
0.980
0.001
Note. *p < .05.
three-factor model meets criteria for Scalar Invariance (see Table 5).
Discussion Using data from a large national dataset, the current study provides evidence of the viability of a 17-item short form version of the PSC-Y that possesses the same subscales and items as the parent reported PSC-17. The initial EFA utilizing all 35 items of the PSC-Y yielded a three-factor, 25-item solution with semantics consistent with the full 35 item PSC-Y. Internal consistency reliability estimates of the overall score and the Internalizing subscale were very good; estimates for the Attention and Externalizing subscales were adequate. However, CFA revealed a poor fit of the model to the data, with an unexpectedly high Internalizing-Externalizing factor correlation, and low Item-Factor correlations for the majority of items. These findings suggest the 25-item model is not a viable candidate for a short form PSC-Y (Brown, 2014; Browne et al., 1993; MacCallum, Browne, & Sugawara, 1996). The EFA utilizing the same 17 items as found on the PSC17 yielded a three-factor, 16-item solution. As with the 25Ó 2018 Hogrefe Publishing
item solution, the semantics of the factors were consistent with the PSC-Y. Internal consistency reliability estimates were good to very good for the overall and all three subscale scores, and CFA indicated the model satisfied three of six goodness-of-fit criteria and nearly satisfied two others. As expected given the large sample size, the model did not satisfy the w2 test for goodness-of-fit, simply indicating the model estimates do not exactly reproduce the sample variances and covariances. The strong item-factor relationships found in the CFA of the 16-item solution combined with the more intuitive moderate AttentionInternalizing and Attention-Externalizing, and low Internalizing-Externalizing factor correlations suggest the items are reliable indicators of the purported constructs and adequately differentiate the constructs from each other. These findings suggest the 16-item model is a viable candidate for a short form PSC-Y (Brown, 2014; Browne et al., 1993; MacCallum et al., 1996). The only item from the parent PSC-17 that did not load into either the 25-item or the 16-item solutions was Item 7, “Acts as if driven by a motor.” Since an important consideration in the development of the PSC, PSC-Y, and PSC-17 has been to keep the measure as simple as possible for respondents and clinicians to complete, score, and interpret and since the parent and youth reported short forms European Journal of Psychological Assessment (2020), 36(1), 56–64
62
were identical except for this one item, we felt that it was important to investigate whether the inclusion of Item 7 would adversely affect the psychometric properties of a short form PSC-Y. We found that inclusion of Item 7 had a negligible impact on the internal consistency of overall and subscale scores, and three of six CFA fit criteria were satisfied (as they had been in the 16-item version that did not include this item). Further, while CFA factor loadings were slightly lower for Attention, they remained strong for Internalizing and Externalizing, and the strength of item-factor relationships remained essentially the same, as did the factor correlations. As a result, we elected to add Item 7 to the 16-item model and recommend a 17-item short form of the PSC-Y that uses the same 17 items on the same three subscales as the parent-reported PSC-17. Tests of measurement invariance across gender suggest that this solution adequately meets criteria for scalar invariance, allowing for meaningful comparison of factor scores between males and females. The analyses reported in this study support a 17-item short form of the PSC-Y and provide preliminary evidence of its reliability and utility. Further research is needed, including additional CFA on samples more representative of community and primary care screening populations, and evaluations of temporal stability (e.g., test-retest reliability), construct validity, and screening performance (e.g., sensitivity, specificity, predictive values, etc.). There are limitations to this study. First, the data are obtained from an online screen completed by self-selecting respondents who reported being 11–17 years old and predominantly female. Second, a disproportionately high number of PSC-Y overall and subscale scores fell into the AtRisk category suggesting that the current sample is more representative of a case-finding population than a general screening population. This limitation is mitigated to some extent by findings from one previous study (Montaño et al., 2011) that used the PSC-Y with the identical three factors as reported in this paper. In a much less impaired sample of 358 patients seen in a pediatric emergency department and their parents, Montaño and associates compared total and subscale scores on the PSC-Y with parents’ reports on the same scales on the parent PSC and found very similar mean scores and rates of impairment on the attention, externalizing and internalizing subscales (8%, 8%, 18% on PSC-Y compared to 9%, 8%, 14% on the parent PSC) and overall score (14% vs. 13%, respectively), comparably high Cronbach’s α (.90 vs. .97) and a moderate level of agreement on total score caseness (κ = .41). The above limitations notwithstanding, it seems likely that this short form of the PSC-Y will be able to fill a need in clinical practice and in research for brief general selfreport psychosocial screen for adolescents. As future studEuropean Journal of Psychological Assessment (2020), 36(1), 56–64
P. Bergmann et al., Identification of PSC-Y Short Form
ies are conducted using this 17-item version short form of the PSC-Y (the PSC-17-Y), they should be able to fill in the gaps noted here.
Electronic Supplementary Materials The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000486 ESM 1. Figures and Tables (.pdf) Figure 1 and Tables 6–19: Scree plot and tables supporting exploratory and confirmatory factor analyses. ESM 2. Stata Data File (.dta) Prepared data in STATA format. ESM 3. LISREL Input and Output Files (.pdf) LISREL OUT files containing syntax and results of EFAs, CFAs, and gender MI. ESM 4. LISREL Data – 35-Item EFA (.LSF) LISREL data system File containing all 35 PSC-Y items for EFA (n = 9585) ESM 5. LISREL Data – 17-Item EFA (.LSF) LISREL data system File containing 17 PSC-Y items for EFA (n = 9585) ESM 6. LISREL ACM – 25-Item CFA (.acm) Asymptotic covariance matrix for CFA of 25-item model. ESM 7. LISREL ACM – 17/16-Item CFA (.acm) Asymptotic covariance matrix for CFA of 17-item model. ESM 8. LISREL ACM – Short Form – Females (.acm) Asymptotic covariance matrix for gender MI (Females) ESM 9. LISREL ACM – Short Form – Males (.acm) Asymptotic covariance matrix for gender MI (Males) ESM 10. LISREL PCM and MNS (.txt) Polychoric correlation matrices and Means for EFAs, CFAs and gender MI.
References American Academy of Pediatrics. (2010). The case for routine psychosocial screening. Pediatrics, 125, s133–s139. https:// doi.org/10.1542/peds.2010-0788J American Academy of Pediatrics. (2010). Enhancing pediatric mental health care: Strategies for preparing a community. Pediatrics, 125(Supplement 3), S75–S86. https://doi.org/ 10.1542/peds.2010-0788D Brown, T. A. (2014). Confirmatory factor analysis for applied research. New York, NY: Guilford Press. Browne, M. W., Cudeck, R., Bollen, K. A., & Long, S. J. (1993). Alternative ways of assessing model fit. SAGE Focus Editions, 154, 136–136. Cassidy, L. J., & Jellinek, M. S. (1998). Approaches to recognition and management of childhood psychiatric disorders in pediatric primary care. Pediatric Clinics, 45, 1037–1052. https://doi. org/10.1016/S0031-3955(05)70061-4
Ó 2018 Hogrefe Publishing
P. Bergmann et al., Identification of PSC-Y Short Form
Claudius, I., Mahrer, N., Nager, A. L., & Gold, J. I. (2012). Occult psychosocial impairment in a pediatric emergency department population. Pediatric Emergency Care, 28, 1334–1337. https:// doi.org/10.1097/PEC.0b013e318276b0bc Forero, C. G., Maydeu-Olivares, A., & Gallardo-Pujol, D. (2009). Factor analysis with ordinal indicators: A Monte Carlo study comparing DWLS and ULS estimation. Structural Equation Modeling, 16, 625–641. https://doi.org/10.1080/ 10705510903203573 Gall, G., Pagano, M. E., Desmond, S. M., Perrin, J. M., & Murphy, J. M. (2000). Utility of psychosocial screening at a school-based health center. Journal of School Health, 70, 292–298. https:// doi.org/10.1111/j.1746-1561.2000.tb07254.x Gardner, W., Lucas, A., Kolko, D. J., & Campo, J. V. (2007). Comparison of the PSC-17 and alternative mental health screens in an at-risk primary care sample. Journal of the American Academy of Child and Adolescent Psychiatry, 46, 611– 618. https://doi.org/10.1097/chi.0b013e318032384b Gardner, W., Murphy, M., Childs, G., Kelleher, K., Pagano, M. E., Jellinek, M., . . . Chiappetta, L. (1999). The PSC-17: A brief pediatric symptom checklist with psychosocial problem subscales. A report from PROS and ASPN. Ambulatory Child Health, 5, 225–236. Hacker, K., Arsenault, L., Franco, I., Shaligram, D., Sidor, M., Olfson, M., & Goldstein, J. (2014). Referral and follow-up after mental health screening in commercially insured adolescents. Journal of Adolescent Health, 55, 17–23. https://doi.org/ 10.1016/j.jadohealth.2013.12.012 Hacker, K., Penfold, R., Arsenault, L., Zhang, F., Murphy, M., & Wissow, L. (2014). Screening for behavioral health issues in children enrolled in Massachusetts Medicaid. Pediatrics, 133, 46–54. https://doi.org/10.1542/peds.2013-1180 Hogan, M. F. (2003). New Freedom Commission report: The President’s New Freedom Commission: Recommendations to transform mental health care in America. Psychiatric Services, 54, 1467–1474. Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6, 1–55. https://doi.org/10.1080/10705519909540118 Jellinek, M. S., Murphy, J. M., & Burns, B. J. (1986). Brief psychosocial screening in outpatient pediatric practice. The Journal of Pediatrics, 109, 371–378. https://doi.org/10.1016/ S0022-3476(86)80408-5 Jellinek, M. S., Murphy, J. M., Little, M., Pagano, M. E., Comer, D. M., & Kelleher, K. J. (1999). Use of the Pediatric Symptom Checklist to screen for psychosocial problems in pediatric primary care: A national feasibility study. Archives of Pediatrics & Adolescent Medicine, 153, 254–260. https://doi.org/10.1001/ archpedi.153.3.254 Jellinek, M. S., Murphy, J. M., Robinson, J., Feins, A., Lamb, S., & Fenton, T. (1988). Pediatric Symptom Checklist: Screening school-age children for psychosocial dysfunction. The Journal of Pediatrics, 112, 201–209. https://doi.org/10.1016/S00223476(88)80056-8 Jöreskog, K., & Sörbom, D. (2017). LISREL 9.3 for Windows [Computer software]. Skokie, IL: Scientific Software International, Inc Kleinman, R. E., Hall, S., Green, H., Korzec-Ramirez, D., Patton, K., Pagano, M. E., & Murphy, J. M. (2002). Diet, breakfast, and academic performance in children. Annals of Nutrition & Metabolism, 46(Suppl. 1), 24–30. https://doi.org/10.1159/ 000066399 Kolko, D. J., Campo, J. V., Kelleher, K., & Cheng, Y. (2010). Improving access to care and clinical outcome for pediatric behavioral problems: A randomized trial of a nurse-adminis-
Ó 2018 Hogrefe Publishing
63
tered intervention in primary care. Journal of Developmental and Behavioral Pediatrics: JDBP, 31, 393. https://doi.org/ 10.1097/DBP.0b013e3181dff307 Kuhlthau, K., Jellinek, M. S., White, G., VanCleave, J., Simons, J., & Murphy, J. M. (2011). Increases in behavioral health screening in pediatric care for Massachusetts Medicaid patients. Archives of Pediatrics & Adolescent Medicine, 165, 660–664. https://doi. org/10.1001/archpediatrics.2011.18 Lowenthal, E., Lawler, K., Harari, N., Moamogwe, L., Masunge, J., Masedi, M., . . . Murphy, J. M. (2011). Validation of the pediatric symptom checklist in HIV-infected Batswana. Journal of Child and Adolescent Mental Health, 23, 17–28. https://doi.org/ 10.2989/17280583.2011.594245 MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and determination of sample size for covariance structure modeling. Psychological Methods, 1, 130. Mann, C. (2013). CMCS Informational Bulletin: Prevention and early identification of mental health and substance use conditions. Retrieved from https://www.medicaid.gov/federal-policy-guidance/downloads/cib-03-27-2013.pdf Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525–543. https://doi. org/10.1007/bf02294825 Montaño, Z., Mahrer, N. E., Nager, A. L., Claudius, I., & Gold, J. I. (2011). Assessing psychosocial impairment in the pediatric emergency department: child/caregiver concordance. Journal of Child and Family Studies, 20, 473–477. https://doi.org/ 10.1007/s10826-010-9414-3 Murphy, J. M. (2016). Review of Research on the PSC-17 Pediatric Symptom Checklist. Retrieved from https://www.massgeneral. org/ psychiatry/services/psc_17 Murphy, J. M., Bergmann, P., Chiang, C., Sturner, R., Howard, B., Abel, M. R., & Jellinek, M. S. (2016). The PSC-17: Subscale scores, reliability, and factor structure in a new national sample. Pediatrics, 138(3), e20160038. https://doi.org/ 10.1542/peds.2016-0038 Murphy, J. M., Nguyen, T., Lucke, C., Chiang, C., Plasencia, N., & Jellinek, M. S. (2017). Adolescent Self-screening for mental health problems; demonstration of an internet-based approach. Academic Pediatrics, 18, 59–65. https://doi.org/ 10.1016/j.acap.2017.08.013 Muthén, B. O. (1993). Goodness of fit with categorical and other nonnormal variables. SAGE Focus Editions, 154, 205– 205. O’Connell, M. E., Boat, T., & Warner, K. E. (2009). Preventing mental, emotional, and behavioral disorders among young people: Progress and possibilities. Washington, DC: National Academies Press. Okuda, M., Sekiya, M., Okuda, Y., Kunitsugu, I., Yoshitake, N., & Hobara, T. (2013). Psychosocial functioning and self-rated health in Japanese school-aged children: A cross-sectional study. Nursing & Health Sciences, 15, 157–163. https://doi.org/ 10.1111/nhs.12005 Pagano, M. E., Cassidy, L. J., Little, M., Murphy, J. M., & Jellinek, M. S. (2000). Identifying psychosocial dysfunction in school-age children: The Pediatric Symptom Checklist as a self-report measure. Psychology in the Schools, 37, 91–106. https://doi. org/10.1002/(SICI)1520-6807(200003)37:2<91::AID-PITS1>3.0. CO;2-3 Sayal, K., & Taylor, E. (2004). Detection of child mental health disorders by general practitioners. The British Journal of General Practice, 54, 348–352. Semansky, R. M., Koyanagi, C., & Vandivort-Warren, R. (2003). Behavioral health screening policies in Medicaid programs nationwide. Psychiatric Services, 54, 736–739. https://doi.org/ 10.1176/appi.ps.54.5.736
European Journal of Psychological Assessment (2020), 36(1), 56–64
64
Sheldrick, R. C., Merchant, S., & Perrin, E. C. (2011). Identification of developmental-behavioral problems in primary care: A systematic review. Pediatrics, 128, 356–363. https://doi.org/ 10.1542/peds.2010-3261 Simonian, S. J., & Tarnowski, K. J. (2001). Utility of the Pediatric Symptom Checklist for behavioral screening of disadvantaged children. Child Psychiatry and Human Development, 31, 269– 278. StataCorp. (2015). Stata statistical software: Release 14. College Station, TX: StataCorp LP. Wu, A. D., Li, Z., & Zumbo, B. D. (2007). Decoding the meaning of factorial invariance and updating the practice of multi-group confirmatory factor analysis: A demonstration with TIMSS data. Practical Assessment. Research and Evaluation, 12, 1–26. Retrieved from http://pareonline.net/getvn.asp?v=12&n=3 History Received May 31, 2017 Revision received February 13, 2018 Accepted February 28, 2018 Published online December 19, 2018 EJPA Section/Category: Short scales
European Journal of Psychological Assessment (2020), 36(1), 56–64
P. Bergmann et al., Identification of PSC-Y Short Form
Acknowledgments Authors have no financial relationships relevant to this article to disclose. Funding All phases of this study were supported by the Fuss Family Foundation. Conflict of Interest Theresa Nguyen is a paid staff member of MHA. The other authors have no conflicts of interest to disclose. John Michael Murphy Child Psychiatry Service Massachusetts General Hospital Yawkey 6A Boston, MA 02114 USA mmurphy6@partners.org
Ó 2018 Hogrefe Publishing
Original Article
Psychometric Properties of the Strengths and Difficulties Questionnaire in Children Aged 5–12 Years Across Seven European Countries Mathilde M. Husky,1 Roy Otten,2,3,4 Anders Boyd,5 Ondine Pez,6 Adina Bitfoi,7 Mauro G. Carta,8 Dietmar Goelitz,9 Ceren Koç,10 Sigita Lesinskiene,11 Zlatka Mihova,12 and Viviane Kovess-Masfety6,13 1
Laboratoire de Psychologie EA 4139, Institut Universitaire de France, Université de Bordeaux, France Behavioural Science Institute, Radboud University Nijmegen, The Netherlands
2 3
Pluryn Research & Development, Nijmegen, The Netherlands
4
REACH-Institute, Department of Psychology, Arizona State University, Phoenix, AZ, USA
5
INSERM, UMR_S1136, Institut Pierre Louis d’Epidémiologie et de Santé Publique, Paris, France
6
École des Hautes Études en Santé Publique, Sorbonne Paris Cite, Paris, France
7
The Romanian League for Mental Health, Bucharest, Romania
8
Centro di Psichiatria di Consulenza e Psicosomatica Azienda Ospedaliero Universitaria di Cagliari, Italy
9
Department of Humanities, Social Sciences and Theology, Friedrich-Alexander-Universität, Erlangen-Nürnberg, Germany Yeniden Health and Education Society, Istanbul, Turkey
10 11
Psychiatry Clinic, School of Medicine, University of Vilnius, Lithuania
12
Department of Psychology, New Bulgarian University, Sophia, Bulgaria
13
Paris Descartes University EA 4057, Paris, France
Abstract: The Strengths and Difficulties Questionnaire (SDQ) has been used extensively to screen for possible mental disorders in epidemiological studies around the world. The present study aimed to compare the internal consistency of both the parent- and teacher-SDQ across seven European countries: Italy, Germany, the Netherlands, Lithuania, Bulgaria, Romania, and Turkey, and to determine the ability of the SDQ to discriminate cases from non-cases of disorders against the well-established Development and Well-Being Assessment (DAWBA). The sample included 541 assessments of children aged 5–12 years. Internal consistency ranged from .74 to .85 for the teacher-SDQ, and .60 to .85 for the parent-SDQ with significant between-country differences. The SDQ further proved to be an adequate screening instrument for the detection of any mental disorder (area under the receiving operator characteristic [AUROC] = .74, 95% CI: .69–.78), and for externalizing disorders in particular (AUROC = .80, 95% CI: .76–.84). There were no differences in AUROC between countries (p = .09), yet sample sizes were limited thus restricting our ability to detect between-country differences in AUROCs. The results reinforce existing research on the SDQ and support its use in detecting probable cases of psychiatric disorders in children across Europe. Keywords: SDQ, DAWBA, cross-national, psychometrics, validation
Child mental health problems are associated with academic difficulties as well as psychiatric disorders and functional impairment through adolescence and into adulthood (Costello, Mustillo, Erkanli, Keeler, & Angold, 2003; Vander Ó 2018 Hogrefe Publishing
Stoep, Weiss, Saldanha, Cheney, & Cohen, 2003). Due to the burden of mental health problems on society (Wittchen et al., 2011), the European Union (EU) invested in identifying instruments that allowed for the assessment of mental
European Journal of Psychological Assessment (2020), 36(1), 65–76 https://doi.org/10.1027/1015-5759/a000489
66
health in children across its member states (Kovess-Masfety et al., 2016). For large scale use, it is paramount that such screening instruments be short, easily administered, available in a number of different languages, and validated in each of the populations intended for use. An EU-funded project, aimed at providing a kit for measuring child mental health status in diverse EU countries, proposed to calibrate a selected set of instruments across seven countries, namely Italy, Germany, the Netherlands, Lithuania, Bulgaria, Romania, and Turkey with which large-sample surveys among primary school children could be conducted (Kovess et al., 2015). The present study represents the validation part of this project, of which the primary objective was to examine the psychometric properties of the parent and teacher versions of the Strengths and Difficulties Questionnaire (SDQ; R. Goodman, 1997; R. Goodman, 2001) and to determine its capacity to discriminate cases from non-cases of specific disorders. The Strengths and Difficulties Questionnaire (SDQ) has been used extensively to screen for possible mental disorders in epidemiological studies around the world. The SDQ is a short selfreport instrument available in a parent-, teacher-, and child-version and can easily be incorporated in large scale studies. The SDQ has been translated into numerous languages and used extensively in Europe, including Denmark (Niclasen et al., 2012), Norway (Rønning, Handegaard, Sourander, & Mørch, 2004), the Netherlands (Muris, Meesters, & van den Berg, 2003), Finland (Koskelainen, Sourander, & Kaljonen, 2000), Germany (Klasen et al., 2000), France (Shojaei, Wazana, Pitrou, & Kovess, 2009), and several Southern European countries (Marzocchi et al., 2004). While many of the aforementioned countries have published psychometric information on either the parent- or teacher-version of the SDQ or both versions among children of various age ranges (Meltzer, Gatward, Goodman, & Ford, 2000; Stone, Otten, Engels, Vermulst, & Janssens, 2010), information on the internal consistency of each subscale in samples of the same age and recruited based on the same procedures has yet to be performed across a large number of countries and may considerably strengthen the available psychometric information on the SDQ. In addition to the scale’s basic psychometrics, it is important to determine how the SDQ performs in its capacity to discriminate cases from non-cases of mental disorders which only a minority of studies have examined as reported in a recent review (Stone et al., 2010). In order to achieve this, well-established structured clinical interviews must be used as a gold standard. Indeed, clinicians from various countries are likely to have different practices in terms of identifying and labeling childhood disorders. The Development and Well-Being Assessment (DAWBA; R. Goodman, Ford, Richards, Gatward, & Meltzer, 2000) combines European Journal of Psychological Assessment (2020), 36(1), 65–76
M. M. Husky et al., SDQ Versus DAWBA in Seven European Countries
interviews with expert ratings and has the advantage of being available in numerous languages. However, few studies have examined this important point. Such studies have been conducted both in clinical and population-based samples in the UK (R. Goodman, Ford, Richards, et al., 2000; R. Goodman, Ford, Simmons, Gatward, & Meltzer, 2000), and in Germany (Becker, Woerner, Hasselhorn, Banaschewski, & Rothenberger, 2004). These studies yielded positive results regarding the SDQ’s ability to identify cases, in particular for externalizing disorders. A more recent international study conducted in Bangladesh, Brazil, Britain, India, Norway, Russia, and Yemen casted doubts on the possibility of using the SDQ in largely diverse cultures and called for additional research that would go beyond comparing SDQ scores cross-nationally (A. Goodman et al., 2012). The present study was designed to collect comparable samples of children using similar methods, as well as data from parent and teacher informants, in seven European countries in order to achieve the following: (1) examine the internal consistency of both parent- and teacher-SDQ across countries, (2) evaluate the capacity of the SDQ to discriminate cases from non-cases of disorders as assessed by the DAWBA, and (3) determine the accuracy of the SDQ in detecting cases with and without considering the level of impairment, as measured by the supplemental impact section of the SDQ.
Methods Participants The present study was part of the School Children Mental Health in Europe (SCMHE) study, a cross-sectional survey of European school children aged 5–12 years. The SCMHE study was conducted in Germany, Italy, the Netherlands, Lithuania, Romania, Bulgaria, and Turkey. The present study was designed to provide data on the validity of the SDQ in the context of large cross-national mental health surveys in Europe. Data were collected in 2009 and 2010. In each country, similar procedures were used to identify a sample that included cases of externalizing disorders, internalizing disorders, as well as non-cases so as to determine the ability of the SDQ to discriminate cases from non-cases. In order to achieve this, each country was instructed to enroll approximately 100 children: 40 children with externalizing problems including oppositional defiant disorder (ODD), conduct disorder (CD), or attention deficit/hyperactivity (ADHD), 40 children with internalizing problems including specific phobia, major depressive disorder, separation anxiety, or generalized anxiety, and 20 Ó 2018 Hogrefe Publishing
M. M. Husky et al., SDQ Versus DAWBA in Seven European Countries
children with no psychiatric diagnosis. Children with a psychiatric disorder were recruited from child psychiatric clinics or other such clinical settings. Patients were included until target numbers were obtained. Because the SDQ was designed to screen for externalizing and internalizing disorders, and because the purpose of the present study was to provide data that would support the use of the SDQ in large population-based surveys, children were excluded if they were diagnosed with a severe mental disorder including psychotic disorder, autism spectrum disorders, or if they suffered from intellectual disabilities. Children who did not have psychiatric, behavioral or emotional problems and who were seen for a routine medical appointment were enrolled from general practitioners or pediatricians. Each country strived to reach the target goal of approximately 80 cases and 20 non-cases in the allotted time. Once eligible children were identified in participating settings through an examination of medical records and/or their treating physician, parents were approached by a member of the research staff and provided with information on the study as well as an opportunity to sign a written consent form and schedule an appointment. During the session, parents completed the parent-SDQ and provided basic socio-demographic information. They were then invited to meet with a member of the research staff to obtain information regarding the parent-DAWBA. In addition, parents were given a paper version of the teacher-SDQ and teacherDAWBA for them to distribute to their child’s teacher, and a prepaid envelope to return teacher forms to the research staff. The teacher-DAWBA was administered online. After having received both parent- and teacher-DAWBA reports, a trained clinician was asked to review the combined parentand teacher-DAWBA results to generate standardized clinical diagnoses for each child. Each country received approval by the relevant ethical committees. The sample used in the present study included 541 assessments with data on both parent-SDQ and parentDAWBA. Some countries did not reach the target number mainly because they failed to obtain the SDQ from parents; the SDQ from teachers were even more difficult to obtain since we relied on parents to obtain them from teachers. Children were 5–12 years old (Mage = 8.7, SD = 1.4) and 62.8% of the sample were male. Countries were similar with regard to gender distribution (w2 = 5.25, df = 6, p = .512), and mean age, F = 1.66, df = 6, p = .129 (Table 1).
Materials Strengths and Difficulties Questionnaire (SDQ) Child psychopathology was assessed using the parent- and teacher-versions of the SDQ (R. Goodman, 1997, 2001). At the time of the study, the SDQ was available in all required Ó 2018 Hogrefe Publishing
67
languages: Dutch (Van Widenfelt, Goedhart, Treffers, & Goodman, 2003), German (Klasen, Woerner, Rothenberger, & Goodman, 2003), Italian (Di Riso et al., 2010), Romanian, Turkish (Güvenir et al., 2008), Lithuanian (Gintilienė et al., 2004), and Bulgarian. The parent- and teacher-SDQ each contain 25 items. Each item is scored as “not true” (0 or 2), “somewhat true” (1), or “certainly true” (2 or 0). The questionnaire consists of five subscales including five items each: hyperactivity/inattention, emotional problems, conduct problems, peer problems, and prosocial behaviors. The SDQ can also assess the level of impairment. This “impact supplement” contains items reflecting overall distress and impairment. These items are summed to generate an impact score that ranges from 0 to 10 for the parent-SDQ, and from 0 to 6 for the teacher-SDQ. In addition, a total difficulties score is computed representing the sum of the first four subscales listed above (emotional symptoms, conduct, hyperactivity/inattention and peer relationship problems). There were two ways in which SDQ cases were determined. First, SDQ cases were determined based on score cutoffs proposed by the instrument’s author for each subscale (https://www.sdqinfo.org). These cutoffs identify “normal,” “borderline,” and “abnormal” scores which can then be recoded to represent the absence (normal or borderline score) or presence (abnormal score) of each class of disorders: internalizing disorders (emotional problems subscale), attention deficit hyperactivity disorder (hyperactivity/inattention subscale), and conduct disorders (conduct problems subscale). Conduct disorders and hyperactivity disorders were combined into a broader category of externalizing disorders reflecting the presence of either disorders or both. Second, we determined SDQ cases taking into account the above mentioned scales and the level of impairment as measured by the impact subscale. In order to do so, probable cases of disorders were limited to those who also had an impact score above the cutoff for the abnormal range. Development and Well-Being Assessment (DAWBA) The DAWBA (R. Goodman, Ford, Richards, et al., 2000) is a structured computerized interview designed to generate DSM-IV (American Psychiatric Association, 1994) psychiatric diagnoses on 5- to 17-year-old children and adolescents (http://dawba.info/a0.html). The disorders assessed for this study include: separation anxiety, specific phobia, generalized anxiety disorder, major depression, ADHD/hyperkinesis, ODD, and CD. The DAWBA was administered to parents by a member of the research team, while the teacher-DAWBA was self-administered by teachers. A trained clinician reviewed the results of both parent- and teacherDAWBA reports and indicated whether the child in fact met criteria for any given disorder thereby identifying DAWBA cases. In the present study, one or two clinicians European Journal of Psychological Assessment (2020), 36(1), 65–76
68
M. M. Husky et al., SDQ Versus DAWBA in Seven European Countries
Table 1. Sample characteristics Total (n = 541) n
%
Italy (n = 66) n
%
Germany (n = 49) n
%
The Netherlands (n = 90) n
%
Lithuania (n = 99)
Bulgaria (n = 66)
Romania (n = 69)
Turkey (n = 100)
n
n
n
n
%
%
%
%
Gender Boy
340
62.8
46
69.7
28
57.1
54
57.4
57
57.6
45
68.2
42
60.9
67
67.0
Girl
201
37.2
20
30.3
21
42.9
36
42.6
42
42.4
21
31.8
27
39.1
33
33.0
Mean age (SD)
8.66
(1.43) 8.74
(1.53) 8.76
(1.53) 8.24
(1.74) 8.70
(1.17) 8.64
(1.54) 8.77
(1.20)
8.83
(1.28)
Data availability Parent-SDQ
541
100.0
66
100.0
49
100.0
90
100.0
99
100.0
66
100.0
69
100.0
100
Teacher-SDQ
445
82.3
66
100.0
29
59.2
73
81.1
95
96.0
66
100.0
23
33.3
91
100.0 91.0
DAWBA
541
100.0
66
100.0
49
100.0
90
100.0
99
100.0
66
100.0
69
100.0
100
100.0
M
(SD)
(SD)
M
(SD)
Summary characteristics SDQ scales Parent-SDQ Total difficulties
M
(SD)
M
(SD)
M
(SD)
M
(SD)
M
(SD)
M
13.95
(7.27) 13.62
(6.25) 15.27
(7.07) 12.59
(7.90) 15.65
(6.65) 14.47
(7.57) 14.67
(8.55) 12.19
(6.43)
Emotional problems
3.40
(2.65) 2.91
(2.41) 3.52
(2.56) 3.28
(2.86) 4.02
(2.54) 3.23
(2.81) 3.80
(2.72)
3.00
(2.55)
Conduct problems
2.64
(2.30) 2.85
(2.06) 3.69
(2.20) 2.31
(2.37) 3.09
(2.42) 2.35
(1.96) 2.91
(2.77)
1.81
(1.81)
Hyperactivity/inattention
5.22
(2.87) 5.63
(2.62) 5.49
(2.72) 4.92
(3.16) 5.67
(2.82) 5.18
(2.70) 5.26
(3.25)
4.60
(2.72)
Peer problems
2.70
(2.22) 2.23
(2.26) 2.57
(2.29) 2.08
(2.06) 2.87
(2.34) 3.71
(2.56) 2.70
(2.32)
2.78
(1.73)
Prosocial behavior
7.43
(2.19) 7.50
(2.08) 7.39
(1.92) 7.90
(1.80) 6.90
(2.44) 7.03
(2.27) 6.20
(2.15)
8.69
(1.64)
Teacher-SDQ Total difficulties Emotional problems
10.38
(8.33) 11.80
(7.80) 7.62
(8.32) 8.68
(7.81) 12.29
(7.98) 13.97
(7.31) 4.77
(8.02) 11.93
(7.94)
2.43
(2.59) 2.64
(2.48) 1.39
(2.37) 2.16
(2.78) 3.21
(2.50) 3.06
(2.44) 1.04
(1.97)
(2.65)
2.83
Conduct problems
1.93
(2.34) 2.42
(2.36) 1.78
(2.44) 1.32
(1.85) 2.37
(2.64) 2.35
(2.41)
.98
(2.10)
2.17
(2.25)
Hyperactivity/inattention
3.81
(3.31) 4.53
(3.08) 2.88
(3.33) 3.38
(3.48) 4.18
(3.17) 5.12
(2.98) 1.71
(3.03)
4.41
(3.06)
Peer problems
2.20
(2.28) 2.21
(2.31) 1.57
(2.26) 1.82
(2.23) 2.53
(2.46) 3.44
(2.10) 1.03
(1.81)
2.52
(2.03)
Prosocial behavior
5.45
(3.49) 7.21
(2.46) 3.39
(3.47) 5.80
(3.52) 6.57
(2.99) 6.17
(2.52) 2.23
(3.45)
5.63
(3.30)
from each country received specific training directly from the author of the instrument (Dr. Goodman) in order to ensure that the interviews would be properly rated. Additional information on the clinical rating including an online manual is available on the instruments’ website (http://dawba.info/manual/m0.html). DAWBA diagnoses have been shown to have high inter-rater reliability and to discriminate effectively between clinical and non-clinical cases (Fleitlich-Bilyk & Goodman, 2004; Ford, Goodman, & Meltzer, 2003). Specifically, one study reported an agreement between clinical raters was 0.93 for any disorder, 0.91 for internalizing disorders, and 1.0 for externalizing disorders (Fleitlich-Bilyk & Goodman, 2004). At the time of the study, the DAWBA was available in some of the languages necessary for the study: Dutch, German, Italian, and Lithuanian. The Turkish versions had been initiated (Dursun et al., 2013) and was finalized in collaboration with our team and the author of the instrument. The Romanian and Bulgarian versions were prepared following WHO recommendations for instrument translation, which required translation and back-translation. DAWBA clinical diagnoses were categorized into internalizing disorders, which
European Journal of Psychological Assessment (2020), 36(1), 65–76
included all anxiety disorders or depression, and externalizing disorders, which included ADHD, ODD, or CD.
Data Analysis Internal consistency was examined using Cronbach’s α (Cohen, 1977) and its 95% confidence interval (CI) with both overall and country-stratified estimates. For individual scales, we statistically compared alphas between countries using a method developed by Diedenhofen and Musch (2016). In subsequent analyses, the SDQ was considered as the predictor (or questionnaire to be evaluated) and the DAWBA was used to define “true” presence of a disorder (or gold standard). Sensitivity (Se), specificity (Sp), and kappa inter-rater agreement were then calculated. Area under the receiving operator characteristic (AUROC) was used to estimate the diagnostic capacity, defined as (Se + Sp)/2, of the SDQ to establish DAWBA cases, regardless of comorbid diagnoses. Individual country AUROCs were compared using a test of equality from an algorithm Ó 2018 Hogrefe Publishing
M. M. Husky et al., SDQ Versus DAWBA in Seven European Countries
69
Table 2. Internal consistency of the parent- and teacher-SDQ scales by country Total (n = 541) α
Italy (n = 66) α
Germany (n = 49) α
The Netherlands (n = 90) α
Lithuania (n = 99) α
Bulgaria (n = 66) α
Romania (n = 69) α
Turkey (n = 100) α
Parent-SDQ Total difficulties*
.84 (.82–.86) .76 (.67–.84) .85 (.78–.91)
.88 (.84–.91)
.80 (.73–.85) .87 (.82–.91) .89 (.85–.92) .78 (.72–.84)
Emotional problems*
.74 (.71–.77) .65 (.50–.77) .74 (.61–.84)
.81 (.75–.87)
.71 (.61–.79) .81 (.72–.87) .76 (.65–.84) .69 (.58–.77)
Conduct problems*
.73 (.69–.76) .62 (.45–.75) .74 (.61–.84)
.79 (.71–.85)
.74 (.65–.81) .65 (.50–.77) .81 (.72–.87) .62 (.49–.72)
Hyperactivity/inattention* .78 (.75–.80) .65 (.50–.77) .82 (.72–.89)
.84 (.78–.89)
.77 (.69–.84) .82 (.74–.88) .83 (.76–.89) .72 (.62–.80)
Peer problems*
.60 (.55–.65) .60 (.42–.73) .67 (.50–.80)
.67 (.55–.77)
.67 (.55–.76) .72 (.60–.82) .62 (.46–.75) .41 (.21–.58)
Prosocial behaviora
.73 (.69–.76) .55 (.36–.70) .71 (.55–.82)
.66 (.53–.76)
.75 (.66–.82) .73 (.61–.82) .83 (.76–.89) .62 (.48–.72)
Teacher-SDQ Total difficulties*
.85 (.83–.87) .87 (.81–.91) .85 (.78–.90)
.86 (.81–.90)
.85 (.80–.89) .85 (.79–.90) .86 (.81–.91) .85 (.81–.89)
Emotional problems*
.76 (.73–.79) .71 (.58–.81) .85 (.77–.91)
.85 (.80–.90)
.70 (.59–.78) .79 (.69–.86) .70 (.58–.80) .76 (.68–.83)
Conduct problems*
.75 (.72–.78) .72 (.60–.82) .75 (.62–.85)
.68 (.56–.77)
.79 (.72–.85) .75 (.64–.84) .81 (.73–.87) .73 (.63–.80)
Hyperactivity/inattentionb .81 (.79–.83) .79 (.70–.86) .87 (.80–.92)
.90 (.86–.93)
.78 (.70–.84) .82 (.74–.88) .82 (.75–.88) .75 (.67–.82)
Peer problemsc
.66 (.61–.70) .70 (.57–.80) .83 (.74–.89)
.72 (.62–.80)
.70 (.59–.78) .49 (.27–.66) .68 (.54–.79) .51 (.34–.65)
Prosocial behaviord
.82 (.79–.84) .71 (.58–.81) .80 (.69–.87)
.75 (.66–.82)
.85 (.79–.89) .83 (.75–.88) .79 (.70–.86) .88 (.84–.92)
Notes. SDQ = Strengths and Difficulties Questionnaire. 95% confidence intervals are provided in parenthesis. Cronbach’s α compared using the method of Diedenhofen and Musch (2016). *No significant differences between countries. aSignificant differences between Romania versus Italy (p = .001). bSignificant difference between the Netherlands versus Turkey (p = .0008). cSignificant differences between Germany versus: Bulgaria (p = .002) and Turkey (p = .002). d Significant differences between Italy versus Turkey (p = .0008).
developed by DeLong and collaborators (DeLong, DeLong, & Clarke-Pearson, 1988). All analyses were performed for the SDQ first without and then with considering the impact supplement score as suggested by prior research (R. Goodman, 1999), which reflects the level of impairment associated with symptoms. We also performed additional analysis while using only information from the parent-SDQ in predicting DAWBA diagnoses, as the parent-SDQ is often used without the teacher-SDQ although the reverse is far less likely. Analyses were performed using SPSS (v20.0, IBM Corp., Armonk, NY), Stata (v12.1, College Station, TX), and R (v3.2.0, R Core Team, Vienna, Austria) statistical software. P-values were corrected for multiple hypothesis testing using Šidák’s correction, while a p-value of < .0024 was considered significant when comparing between countries (for 21 comparisons overall) and < .0073 when individual countries were compared to all other countries (for seven comparisons overall).
Results Internal Consistency Internal consistency reliability estimates are presented in Table 2 for the parent- and teacher-SDQ, both overall and separately by country. Overall, the results show that the internal consistency of the total difficulties scale for Ó 2018 Hogrefe Publishing
the parent-SDQ (.84) is close to that of the teacher-SDQ (.85). The alphas in the overall sample for specific scales follow a similar pattern for parent- and teacher-versions with peer problems having the lowest αs, with .60 for parents and .66 for teachers. All other scales have αs greater than .70 for both parent- and teacher-versions. Six countries display a similar pattern, in which alphas from the parent-SDQ are the lowest for peer problems. One exception is Italy, where the alpha is lower for prosocial behavior than for peer problems. The highest alphas are obtained in the total difficulties score for all countries, except Turkey with a greater alpha for the hyperactivity/ inattention subscale. Overall there were no significant between-country differences in parent-SDQ alphas for most scales with the exception of prosocial behavior. For the teacher-version, again, the majority of countries have the lowest alphas in the peer problems subscale, with the exception of Germany, the Netherlands, and Turkey with the lowest alphas in conduct problems. Turkey has higher alphas for the teacher-version as compared to the alphas obtained in the parent-version for each scale.
Capacity of SDQ Cases Alone to Discriminate DAWBA Cases From Non-Cases Table 3 presents the capacity of the SDQ cases, without taking into account the SDQ-reported impact of the disorder, to discriminate cases and non-cases of DAWBA clinical European Journal of Psychological Assessment (2020), 36(1), 65–76
70
M. M. Husky et al., SDQ Versus DAWBA in Seven European Countries
Table 3. Accuracy of parent- or teacher-SDQ cases without impact in predicting behavioral disorders across countries Classification probabilities Parent- or teacher-SDQ casesb (n = 541) Any disorder
DAWBA casesc (n = 541)
Se
Sp
AUC
Lower 95% CI
Upper 95% CI
κ
SE .04
366
307
.88
.59
.74
.69
.78
.49
Italy (n = 66)
45
43
.79
.52
.66
.51
.80
.32
.12
Germany (n = 49)
34
20
1.00
.52
.76
.62
.89
.48
.10
The Netherlands (n = 90)
56
34
.97
.59
.78
.68
.87
.50
.08
Lithuania (n = 99)
80
76
.89
.48
.69
.55
.82
.40
.11
Bulgaria (n = 66)
42
43
.86
.78
.82
.71
.94
.64
.10
Romania (n = 69)
43
35
.94
.70
.82
.72
.93
.65
.09
Turkey (n = 100)
66
56
.82
.54
.68
.57
.79
.38
.09
Internalizing Disorders
215
179
.67
.74
.70
.65
.75
.39
.04
Italy (n = 66)
19
24
.50
.83
.67
.52
.81
.35
.12
Germany (n = 49)a
21
3
–
–
–
–
–
The Netherlands (n = 90)
30
18
.78
.78
.78
.65
.90
–
–
.44
.10
Lithuania (n = 99)
53
60
.65
.64
.65
.53
.76
.28
.10
Bulgaria (n = 66)
22
26
.54
.80
.67
.53
.81
.35
.12
Romania (n = 69)
30
15
.93
.70
.82
.71
.93
.47
.10
Turkey (n = 100)
40
33
.73
.76
.74
.64
.85
.46
.09
293
205
.91
.68
.76
.84
.55
.03
Externalizing Disorders
.80
Italy (n = 66)
37
28
.93
.71
.82
.71
.92
.61
.09
Germany (n = 49)
30
17
1.00
.60
.80
.67
.92
.50
.10
The Netherlands (n = 90)
45
25
.92
.66
.79
.69
.89
.47
.08
Lithuania (n = 99)
64
57
.91
.71
.81
.72
.91
.64
.08
Bulgaria (n = 66)
34
24
.92
.71
.81
.71
.92
.58
.09
Romania (n = 69)
33
28
.93
.83
.88
.79
.97
.74
.08
Turkey (n = 100)
50
26
.81
.61
.71
.60
.82
.32
.08
234
147
.88
.73
.80
.76
.84
.51
.04
Italy (n = 66)
31
24
.92
.78
.85
.75
.95
.66
.09
Germany (n = 49)
24
12
1.00
.67
.84
.73
.95
.50
.11
The Netherlands (n = 90)
37
14
1.00
.70
.85
.77
.93
.42
.08
Lithuania (n = 99)
48
48
.77
.78
.78
.68
.87
.55
.08
ADHD/Hyperactivity
Bulgaria (n = 66)
29
16
.87
.70
.79
.66
.91
.45
.10
Romania (n = 69)
29
23
.91
.83
.87
.78
.96
.69
.09
Turkey (n = 100)
36
10
.90
.70
.80
.67
.93
.28
.08
222
128
.85
.72
.78
.74
.83
.46
.04
Italy (n = 66)
26
15
.80
.72
.76
.62
.90
.42
.11
Germany (n = 49)a
26
6
–
–
–
–
–
–
–
ODD/Conduct disorders
The Netherlands (n = 90)
29
11
.82
.75
.78
.64
.93
.33
.10
Lithuania (n = 99)
48
47
.85
.85
.85
.77
.93
.70
.07
Bulgaria (n = 66)
27
16
.81
.72
.77
.63
.90
.43
.11
Romania (n = 69)
26
21
.90
.85
.88
.79
.97
.71
.09
Turkey (n = 100)
40
12
.92
.67
.79
.68
.91
.29
.08
Notes. SDQ = Strengths and Difficulties Questionnaire; ODD = Oppositional Defiant Disorder; ADHD = Attention Deficit/Hyperactivity Disorder; DAWBA = Development and Well-Being Assessment; Se = Sensitivity; Sp = Specificity; AUC = Area Under the Curve; SE = Standard Error. aGermany was excluded due to the small number of DAWBA cases. bParent- or teacher-SDQ cases represent children for whom either the parent or the teacher reported a score above the cutoff for abnormality on the SDQ subscale of interest. cDAWBA cases represent children who received a clinical diagnosis regarding each of the disorders examined.
diagnoses regarding any disorders, internalizing disorders, externalizing disorders, ADHD, and ODD. For any disorder, the overall AUC value is .74. No significant differences in AUCs between countries are present European Journal of Psychological Assessment (2020), 36(1), 65–76
(p = .09). The best AUC values are obtained for externalizing disorders with .80 overall, and range from .71 to .88 in the seven different countries with no significant differences (p = .2). Overall, internalizing disorders, as predicted by Ó 2018 Hogrefe Publishing
M. M. Husky et al., SDQ Versus DAWBA in Seven European Countries
SDQ cases of emotional problems, are associated with lower AUC values at .70, ranging from .65 to .82 and again with no significant differences between countries (p = .09). When SDQ emotional disorder cases are computed using emotional problems or peer relation problems rather than emotional problems alone, AUC values are substantially lower (.66 overall). Regarding specific externalizing problems, ADHD obtains good AUC values at .80 overall, ranging from .78 in Lithuania and .87 in Romania with no significant between-country differences (p = .7). Conduct problems also show good AUC values with .78 overall, ranging from .76 in Italy to .88 in Romania, again with no significant between-country differences (p = .4).
Capacity of SDQ Cases With Impact Supplement to Discriminate DAWBA Cases From Non-Cases Table 4 presents the capacity of SDQ cases to discriminate cases and non-cases of DAWBA clinical diagnoses when SDQ-reported impact of the disorder is taken into account. Overall, SDQ cases with impact yield AUC values of .72 for any disorder, .61 for internalizing disorders, and .78 for externalizing disorders. AUC values for any disorder range from .71 to .83 between countries, with the exception of Turkey (.61). Significantly lower AUC were observed in Turkey (p = .005) compared to all other countries. As was seen with SDQ cases without impact, the capacity of the SDQ to discriminate is greater for externalizing disorders than it was for internalizing disorders. For internalizing disorders, the overall ROC comparison does not yield significant between-country differences (p = .16). For externalizing disorders, only Romania (p < .001) had significantly higher and lower AUCs, respectively, compared to all other countries. With regards to the externalizing disorder subscales, the AUC in diagnosing hyperactivity was significantly higher in Romania (p < .001) compared to all other countries. The AUC for diagnosing conduct disorder are comparable across countries, yet Romania had a significantly higher AUCs compared to all other countries (p < .001).
Sensitivity and Specificity Overall, the sensitivity of SDQ cases for any disorder without impairment is .88, highest for externalizing disorders (.91) and lowest for internalizing problems (.67). Specificity was generally lower with .59 for any disorders, .74 for internalizing problems, and .68 for externalizing disorders.
Ó 2018 Hogrefe Publishing
71
SDQ cases with impairment yield substantially lower sensitivity with .60 for any disorder, .69 for externalizing disorders, and .39 for internalizing disorders. However, the specificity is higher than what was observed without impact scores: .84 for any disorder, .83 for internalizing disorders, and .86 for externalizing disorders.
Agreement Between SDQ Cases and DAWBA Cases For any disorder when SDQ cases without impairment are considered, overall κ is .49 (SE = .04). Country-specific analyses show that kappas in Italy, Turkey, and Lithuania are lower than Bulgaria and Romania (Table 3). However, the kappas in Germany and the Netherlands are not substantially different from all other countries. For internalizing problems, the κs range from .28 (SE = .10) in Lithuania to .47 (SE = .10) in Romania and no substantial between-country differences are observed. Externalizing disorders yield an overall κ of .55 (SE = .03). Turkey obtains lower kappas than Bulgaria, Italy, Lithuania, and Romania. The Netherlands has a lower kappa compared to Lithuania or Romania. The ADHD and CD subscales obtain overall κs of .51 (SE = .04) and .46, (SE = .04), respectively, with evidence of cross-national differences between the lower and higher kappas. When impairment is considered (Table 3) the general performance of the SDQ is lower, with κs of .41 (SE = 04) for any disorder ranging from .21 in Turkey to .65 in Romania. For internalizing disorders, κs are low overall (.23), ranging from .14 in Turkey to .40 in Bulgaria and Romania. For externalizing disorders, κs are .56 (SE = .04) overall, and vary widely between countries (.31 in Turkey to .88 in Romania).
Accuracy of Parent-SDQ Cases Without and With Impact in Predicting DAWBA Cases Table 5 replicates Tables 3 and 4 for the main categories of disorders in the overall population using the parent-SDQ cases only. The results without impact generally show greater agreement between the SDQ and the DAWBA, with kappas slightly higher than those using parent- or teacherSDQ (.52 for any disorder and .58 for externalizing disorders). The results also show greater sensitivity in detecting any disorder, internalizing or externalizing disorder. This is also true for the SDQ cases with impact, yet specificity is only greater for any disorder with and without impact, and for externalizing disorders without impact.
European Journal of Psychological Assessment (2020), 36(1), 65–76
72
M. M. Husky et al., SDQ Versus DAWBA in Seven European Countries
Table 4. Accuracy of parent- or teacher-SDQ cases with impact in predicting behavioral disorders across countries Classification probabilities Parent- or teacher-SDQ cases with impact (n = 541) Any disorder
DAWBA cases (n = 541)
Se
Sp
AUC
Lower 95% CI
Upper 95% CI
κ
SE .04
221
307
.60
.84
.72
.67
.76
.41
Italy (n = 66)
24
43
.51
.91
.71
.59
.84
.36
.09
Germany (n = 49)
34
20
1.00
.52
.76
.62
.89
.47
.10
The Netherlands (n = 90)
39
34
.79
.78
.79
.69
.89
.56
.09
Lithuania (n = 99)
43
76
.54
.91
.73
.62
.83
.30
.07
Bulgaria (n = 66)
30
43
.67
.96
.81
.71
.92
.56
.09
Romania (n = 69)
29
35
.74
.91
.83
.72
.93
.65
.09
Turkey (n = 100)
22
56
.32
.91
.61
.51
.72
.21
.07
132
179
.39
.83
.61
.56
.66
.23
.04
9
24
.25
.93
.59
.44
.74
.21
.11
Germany (n = 49)a
21
3
–
–
–
–
–
–
–
The Netherlands (n = 90)
23
18
.55
.82
.69
.54
.84
.34
.11
Internalizing Disorders Italy (n = 66)
Lithuania (n = 99)
27
60
.35
.85
.60
.49
.71
.17
.08
Bulgaria (n = 66)
18
26
.50
.87
.69
.55
.82
.40
.11
Romania (n = 69)
21
15
.67
.80
.73
.58
.88
.40
.12
Turkey (n = 100)
13
33
.21
.91
.56
.44
.68
.14
.09
189
205
.69
.86
.78
.73
.82
.56
.04
Externalizing Disorders Italy (n = 66)
22
28
.68
.92
.80
.68
.92
.62
.09
Germany (n = 49)
29
17
1.00
.62
.81
.69
.93
.54
.10
The Netherlands (n = 90)
31
25
.76
.81
.79
.68
.90
.54
.09
Lithuania (n = 99)
40
57
.63
.90
.77
.67
.86
.51
.08
Bulgaria (n = 66)
25
24
.71
.81
.76
.63
.89
.51
.10
Romania (n = 69)
24
28
.86
1.00
.93
.85
1.00
.88
.06
Turkey (n = 100)
18
26
.38
.89
.64
.50
.77
.31
.11
157
147
.68
.85
.77
.72
.82
.52
.04
Italy (n = 66)
19
24
.67
.93
.80
.67
.92
.62
.10
Germany (n = 49)
19
12
.83
.76
.79
.65
.94
.49
.13
The Netherlands (n = 90)
27
14
.93
.81
.87
.78
.97
.54
.10
Lithuania (n = 99)
31
48
.52
.88
.70
.60
.81
.41
.09
Bulgaria (n = 66)
23
16
.75
.78
.76
.62
.90
.46
.11
Romania (n = 69)
23
23
.87
.93
.90
.81
.99
.80
.08
Turkey (n = 100)
13
10
.40
.90
.65
.45
.85
.26
.14
150
128
.65
.84
.75
.69
.80
.47
.04
Italy (n = 66)
16
15
.67
.88
.77
.62
.93
.54
.12
Germany (n = 49)a
25
6
–
–
–
–
–
–
–
ADHD/Hyperactivity
ODD/Conduct disordersa
The Netherlands (n = 90)
22
11
.64
.81
.72
.55
.90
.31
.12
Lithuania (n = 99)
31
47
.60
.94
.77
.67
.87
.55
.08
Bulgaria (n = 66)
20
16
.62
.80
.71
.56
.87
.39
.12
Romania (n = 69)
21
21
.90
.96
.93
.85
1.00
.86
.07
Turkey (n = 100)
14
12
.42
.90
.66
.47
.84
.29
.13
Notes. SDQ = Strengths and Difficulties Questionnaire; ODD = Oppositional Defiant Disorder; ADHD = Attention Deficit/Hyperactivity Disorder; DAWBA = Development and Well-Being Assessment; Se = Sensitivity; Sp = Specificity; AUC = Area Under the Curve; SE = Standard Error. aGermany was excluded due to the small number of DAWBA cases. bParent- or teacher-SDQ cases represent children for whom either the parent or the teacher reported a score above the cutoff for abnormality on the SDQ subscale of interest and on the impact supplement. cDAWBA cases represent children who received a clinical diagnosis regarding each of the disorders examined.
European Journal of Psychological Assessment (2020), 36(1), 65–76
Ó 2018 Hogrefe Publishing
M. M. Husky et al., SDQ Versus DAWBA in Seven European Countries
73
Table 5. Accuracy of parent-SDQ cases without and with impact in predicting behavioral disorders with parent informant only across countries Classification probabilities Parent-SDQ cases (n = 110)
DAWBA cases (n = 110)
Se
Sp
AUC
Lower 95% CI
Upper 95% CI
κ
SE
Any disorder
64
45
.91
.65
.78
.69
.87
.52
.07
Internalizing disorders
41
18
.83
.72
.77
.66
.89
.36
.08
Externalizing disorders
53
36
.92
.73
.82
.74
.90
.58
.07
ADHD/Hyperactivity
38
27
.85
.82
.84
.74
.93
.59
.08
ODD/Conduct disorders
42
22
.86
.74
.80
.70
.90
.45
.08
Se
Sp
AUC
Parent-SDQ cases with impact
DAWBA cases
Lower 95% CI
Upper 95% CI
κ
SE
Any disorders
46
45
.71
.78
.75
.65
.84
.49
.08
Internalizing disorders
18
32
.61
.77
.69
.55
.83
.29
.10
Externalizing disorders
41
36
.78
.82
.80
.71
.89
.58
.08
ADHD/Hyperactivity
31
27
.74
.87
.80
.70
.91
.58
.09
ODD/Conduct disorders
35
22
.77
.79
.78
.67
.90
.46
.09
Notes. SDQ = Strengths and Difficulties Questionnaire; ODD = Oppositional Defiant Disorder; ADHD = Attention Deficit/Hyperactivity Disorder; DAWBA = Development and Well-Being Assessment; Se = Sensitivity; Sp = Specificity; AUC = Area Under the Curve; SE = Standard Error. aGermany was excluded due to the small number of DAWBA cases. bParent-SDQ cases represent children for whom either the parent or the teacher reported a score above the cutoff for abnormality on the SDQ subscale of interest and on the impact supplement. cDAWBA cases represent children who received a clinical diagnosis regarding each of the disorders examined.
Discussion The present study aimed to investigate the psychometric properties of the parent- and teacher-SDQ in samples of children aged 5–12 years across seven European countries and its capacity to discriminate cases from non-cases in comparable samples. Several important findings were obtained. First, the internal consistency of the SDQ in the overall sample was satisfactory for most subscales with the exception of the peer problems subscale. Second, variations in the internal consistency of parent and teacher subscales were observed across countries. Third, the results showed the ability of the SDQ to adequately discriminate cases from non-cases of disorder, in particular for externalizing disorders. Importantly, this held true across all countries considered. Finally, using the impact score of the SDQ yielded mixed results and was associated with between-country variation in the ability of the SDQ to discriminate cases from non-cases. The internal consistency established in the present sample for both parent- and teacher-SDQ were very similar to what has been observed with the original version of the SDQ (R. Goodman, 2001). A 2010 review of the psychometric properties of the SDQ summarizing 48 studies also reported similar results, with the peer problems subscale as the only scale with α below .70 in both the parent- and teacher-SDQ (Stone et al., 2010). However, unlike what was reported in the review, emotional, conduct, and prosocial subscales for the parent-version, all had α greater than or equal to .70 suggesting adequate internal consistency of the parent-version with regard to these specific subscales. Furthermore, while the review suggested marked Ó 2018 Hogrefe Publishing
differences in the internal consistency of the parent- and teacher-versions, our results suggest that the two versions were very similar with the exception of prosocial behavior which had good internal consistency for teachers, and .70 for parents. As in prior work (Alyahri & Goodman, 2006), the lowest internal consistency was observed in the peer problems subscale, in particular in the parent-version, and to a lesser degree in the teacher-version, a pattern observed previously (R. Goodman, 2001). In the parent-version, while most countries behaved in a similar pattern, Turkey and Italy stood out with lower alphas in particular with very poor consistency for peer problems. In the teacher-version, both countries yielded higher alphas though, Turkey obtained poor alphas on the conduct and peer problems subscales. Prior research using the Italian version of the teacherSDQ had shown internal consistency ranging from .73 to .89 (Marzocchi et al., 2004). One study of the Turkish version of the parent-SDQ in a somewhat older population (children aged 4–16 years) revealed quite similar results: alphas of .84 for the total difficulties score, .73 for emotional problems, .80 for ADHD, .73 for prosocial behaviors, .65 for conduct problems, and .37 for peer problems (Güvenir et al., 2008). The results suggest that further work is needed in Turkey using the instrument in order to have a better understanding of what prevented the instrument from performing as well as it did in other populations. Furthermore, differences across countries may be interpreted in light of cultural differences, methodological issues stemming from country samples or issues in the transcultural adaptation of instruments (Gudmundsson, 2009; Van Widenfelt, Treffers, De Beurs, Siebelink, & Koudijs,
European Journal of Psychological Assessment (2020), 36(1), 65–76
74
2005). Taken together, these results also suggest that it may be preferable to exclude peer problems from the estimation of internalizing disorders, as suggested in previous work showing that using higher order categories may be useful in certain situations but not in others (A. Goodman, Lamping, & Ploubidis, 2010). The present findings show that the SDQ’s externalizing problems subscales were good indicators of the presence of a clinical diagnosis of externalizing disorder. The performance of both the hyperactivity and the conduct problem subscales was good in identifying related disorders. In the case of internalizing disorders, however, the performance of the emotional subscale was not satisfactory. The superiority of the externalizing problems scale has been discussed in prior research (Becker et al., 2004). The poor performance of the SDQ internalizing disorders scale and evidence that the Child Behavior Checklist’s (CBCL; Achenbach & Edelbrock, 1983) composite internalizing problems score may perform better in the detection of internalizing (Becker et al., 2004) may suggest the need to complement the SDQ with a more thorough assessment of internalizing symptoms. This poor performance of the internalizing subscale also further raises the issue of informant discrepancies in the reporting of child psychopathology (De Los Reyes et al., 2015). As children have been shown to report internalizing problems more frequently than do parents or teachers (Kuijpers, Otten, Krol, Vermulst, & Engels, 2013), self-reported screening is therefore important to capture internalizing disorders which are more likely to attract less attention from parents and teachers. It is important to note that the data from Germany could not be used to calculate the capacity of the SDQ to discriminate cases from non-cases as there were too few cases of internalizing disorders identified by the DAWBA in that sample. That being said, studies conducted in German samples demonstrated that the SDQ externalizing problems is a good predictor of externalizing disorders (Becker et al., 2004). Importantly, the ability of the SDQ to discriminate cases from non-cases did not vary significantly between countries. This finding fully supports the use of the SDQ in the countries examined despite the differences observed in the internal consistency. Overall, the results with impact did not significantly improve the ability of the SDQ to identify cases. Including the impact supplement has been discussed in prior research and has been shown to improve the capacity of the SDQ to discriminate community from psychiatric samples (R. Goodman, 1999). In clinical samples, however, it may not be necessary to include impact as it is more likely than not to be significant as help-seeking for a behavioral or emotional problem was prompted by the burden of the problem on the child and/or on his family. Another possible explanation for this finding is that the assessment of impairment adds European Journal of Psychological Assessment (2020), 36(1), 65–76
M. M. Husky et al., SDQ Versus DAWBA in Seven European Countries
subjective estimates which do not override the strength of the SDQ subscales taken separately. In light of these results, it may not be useful to use the impact scale of the SDQ when using this instrument to screen children and young adolescents for probable mental disorders. However, future research should examine this question further in larger samples. Importantly, when using the impact supplement, the ability of the SDQ to discriminate cases from non-cases significantly varied between countries for any disorder and for externalizing disorders. Taken together, these findings suggest that it may be best to use the standard SDQ when performing cross-national comparisons. Finally, the present study sought to examine whether using information from the same informant for the SDQ and the DAWBA by selecting the subsample of individuals for whom the DAWBA was only completed by the parent and not by the teacher, which we compared to the parent-version of the SDQ. Our results showed that although the performance of the SDQ was slightly better with or without impairment, the performance of the emotional problems subscale remained insufficient. However, it has been shown that when DAWBA information from both informants is available, clinicians are 6% more likely to diagnose a disorder (Meltzer et al., 2000), suggesting that our results regarding parent-only SDQ and DAWBA may have been affected by this phenomenon. Furthermore the results regarding parent-SDQ and parent-DAWBA should be interpreted with caution as 43.6% of these data were from Romania, 28.2% from Germany, and 20.0% from the Netherlands, while the other countries had very few to no cases. Certain limitations should be considered when interpreting the results. First, the sample sizes, though reasonable, might not have been sufficient for some of the analyses presented. In post hoc simulations, statistical comparisons between countries with largest absolute differences in AUROCs were unable to render power > 0.80 to determine a significant difference in any disorder, emotional disorders or externalizing disorders. Larger population sizes would be needed to firmly establish any between-country differences. Second, a portion of the teacher measures were not completed for each child for whom parent-data were available. Having data on all three informants for every participant is likely to have improved the SDQ’s ability to identify cases (R. Goodman, Ford, Simmons, et al., 2000). However, this suggests that our data likely underestimated rather than overestimated the capacity of the SDQ to correctly identify cases. Nevertheless, the present findings should be replicated in larger clinical samples across Europe. Third, our clinical sample excluded children diagnosed with psychotic disorder, autism spectrum disorders, or if they suffered from intellectual disabilities. Consequently, the present findings do not extend to children with these Ó 2018 Hogrefe Publishing
M. M. Husky et al., SDQ Versus DAWBA in Seven European Countries
conditions. The decision not to include children with these disorders was based on the fact that the SDQ was not designed to identify disorders among children who already suffer from severe psychiatric disorders. Finally, selection procedures for non-cases might not have been representative of the target population. The present study, however, was designed to determine the adequacy of the SDQ as an instrument to screen for externalizing or internalizing problems among children, while it could later be used in general population settings in the context of large crossnational surveys. To conclude, these results further encourage the use of this short instrument to perform cross-national comparisons of the presence of probable mental disorders in children aged 5–12 years, in particular for externalizing disorders. Cross-national comparisons using the SDQ should nonetheless be cautious when considering the emotional problems subscale as its ability to adequately identify cases of internalizing disorders in the present international sample was only moderate. The present study also reinforces the available research on the SDQ to support its use in detecting probable cases of psychiatric disorders in children in seven European countries. The diversity of the countries studied here, added to prior work in the UK and Nordic countries underline the potential of this instrument for use across Europe to provide meaningful comparisons of risk factors and access to care in countries where mental health resources vary greatly (Kovess-Masfety et al., 2017).
Acknowledgment This study was funded by the European Union, Grant Number 2006336 (V. Kovess-Masfety).
References Achenbach, T. M., & Edelbrock, C. S. (1983). Manual for the child behavior checklist: And revised child behavior profile. Burlington, VT: Department of Psychiatry, University of Vermont. Alyahri, A., & Goodman, R. (2006). Validation of the Arabic Strengths and Difficulties Questionnaire and the development and well-being assessment. Eastern Mediterranean Health Journal, 12(2), 138–146. American Psychiatric Association. (1994). Diagnostic and statistical manual of mental disorders (4th ed.). Washington, DC: American Psychiatric Association. Becker, A., Woerner, W., Hasselhorn, M., Banaschewski, T., & Rothenberger, A. (2004). Validation of the parent and teacher SDQ in a clinical sample. European Child & Adolescent Psychiatry, 13, ii11–ii16. https://doi.org/10.1007/s00787-004-2003-5 Cohen, J. (1977). Statistical power analysis for the behavioral sciences (Revised ed.). New York, NY: Academic Press. Costello, E. J., Mustillo, S., Erkanli, A., Keeler, G., & Angold, A. (2003). Prevalence and development of psychiatric disorders in
Ó 2018 Hogrefe Publishing
75
childhood and adolescence. Archives of General Psychiatry, 60, 837–844. https://doi.org/10.1001/archpsyc.60.8.837 De Los Reyes, A., Augenstein, T. M., Wang, M., Thomas, S. A., Drabick, D. A., Burgers, D. E., & Rabinowitz, J. (2015). The validity of the multi-informant approach to assessing child and adolescent mental health. Psychological Bulletin, 141, 858. https://doi.org/10.1037/a0038498 DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics, 43, 837–845. Di Riso, D., Salcuni, S., Chessa, D., Raudino, A., Lis, A., & Altoè, G. (2010). The strengths and difficulties questionnaire (SDQ). Early evidence of its reliability and validity in a community sample of Italian children. Personality and Individual Differences, 49, 570– 575. https://doi.org/10.1016/j.paid.2010.05.005 Diedenhofen, B., & Musch, J. (2016). Cocron: A web interface and r package for the statistical comparison of Cronbach’s alpha coefficients. International Journal of Internet Science, 11, 51–60. Dursun, O., Guvenir, T., Aras, S., Ergin, C., Mutlu, C., Baydur, H., . . . Iscanli, L. (2013). A new diagnostic approach for Turkish speaking populations DAWBA Turkish version. Epidemiology and Psychiatric Sciences, 22, 275–282. https://doi.org/ 10.1017/S2045796012000479 Fleitlich-Bilyk, B., & Goodman, R. (2004). Prevalence of child and adolescent psychiatric disorders in southeast brazil. Journal of the American Academy of Child and Adolescent Psychiatry, 43, 727–734. https://doi.org/10.1097/01.chi.0000120021.14101.ca Ford, T., Goodman, R., & Meltzer, H. (2003). The british child and adolescent mental health survey 1999: The prevalence of DSMIV disorders. Journal of the American Academy of Child and Adolescent Psychiatry, 42, 1203–1211. https://doi.org/10.1097/ 00004583-200310000-00011 Gintilienė, G., Girdzijauskienė, S., Černiauskaitė, D., Lesinskienė, ras, D. (2004). Lietuviškas SDQ – S., Povilaitis, R., & Pu standartizuotas mokyklinio amžiaus vaikų ,,galių ir sunkumų klausimynas“ [Lithuanian SDQ – The standardized Strengths and Difficulties Questionnaire]. Psichologija, 29, 88–105. Goodman, A., Heiervang, E., Fleitlich-Bilyk, B., Alyahri, A., Patel, V., Mullick, M. I., . . . Goodman, R. (2012). Cross-national differences in questionnaires do not necessarily reflect comparable differences in disorder prevalence. Social Psychiatry and Psychiatric Epidemiology, 47, 1321–1331. https://doi.org/ 10.1007/s00127-011-0440-2 Goodman, A., Lamping, D. L., & Ploubidis, G. B. (2010). When to use broader internalising and externalising subscales instead of the hypothesised five subscales on the Strengths and Difficulties Questionnaire (SDQ): Data from British parents, teachers and children. Journal of Abnormal Child Psychology, 38, 1179–1191. https://doi.org/10.1007/s10802-010-9434-x Goodman, R. (1997). The Strengths and Difficulties Questionnaire: A research note. Journal Child Psychol Psychiatry, 38, 581–586. https://doi.org/10.1111/j.1469-7610.1997.tb01545.x Goodman, R. (1999). The extended version of the Strengths and Difficulties Questionnaire as a guide to child psychiatric caseness and consequent burden. Journal of Child Psychology and Psychiatry, 40, 791–799. Goodman, R. (2001). Psychometric properties of the Strengths and Difficulties Questionnaire. Journal of the American Academy of Child and Adolescent Psychiatry, 40, 1337–1345. https://doi. org/10.1097/00004583-200111000-00015 Goodman, R., Ford, T., Richards, H., Gatward, R., & Meltzer, H. (2000). The development and well-being assessment: Description and initial validation of an integrated assessment of child and adolescent psychopathology. Journal of Child Psychology and Psychiatry, 41, 645–655.
European Journal of Psychological Assessment (2020), 36(1), 65–76
76
Goodman, R., Ford, T., Simmons, H., Gatward, R., & Meltzer, H. (2000). Using the Strengths and Difficulties Questionnaire (SDQ) to screen for child psychiatric disorders in a community sample. The British Journal of Psychiatry, 177, 534–539. Gudmundsson, E. (2009). Guidelines for translating and adapting psychological instruments. Nordic Psychology, 61, 29–45. Güvenir, T., Özbek, A., Baykara, B., Arkar, H., S ß entürk, B., & _Incekasß, S. (2008). Psychometric properties of the Turkish version of the Strengths and Difficulties Questionnaire (SDQ). Çocuk ve Gençlik Ruh Sağlığı Dergisi/Turkish Journal of Child and Adolescent Mental Health, 15, 65–74. Klasen, H., Woerner, W., Rothenberger, A., & Goodman, R. (2003). German version of the Strengths and Difficulties Questionnaire (SDQ-German) – Overview and evaluation of initial validation and normative results. Praxis der Kinderpsychologie und Kinderpsychiatrie, 52, 491–502. Klasen, H., Woerner, W., Wolke, D., Meyer, R., Overmeyer, S., Kaschnitz, W. E., . . . Goodman, R. (2000). Comparing the German versions of the Strengths and Difficulties Questionnaire (SDQ-Deu) and the Child Behavior Checklist. European Child & Adolescent Psychiatry, 9, 271–276. Koskelainen, M., Sourander, A., & Kaljonen, A. (2000). The Strengths and Difficulties Questionnaire among Finnish school-aged children and adolescents. European Journal of Child and Adolescent Psychiatry, 9, 277–284. Kovess-Masfety, V., Husky, M. M., Keyes, K., Hamilton, A., Pez, O., Bitfoi, A., . . . Mihova, Z. (2016). Comparing the prevalence of mental health problems in children 6–11 across Europe. Social Psychiatry and Psychiatric Epidemiology, 51, 1093–1103. https://doi.org/10.1007/s00127-016-1253-0 Kovess-Masfety, V., Van Engelen, J., Stone, L., Otten, R., Carta, M. G., Bitfoi, A., . . . Mihova, Z. (2017). Unmet need for specialty mental health services among children across Europe. Psychiatric Services, 68, 789–795. https://doi.org/10.1176/appi. ps.201600409 Kovess, V., Carta, M. G., Pez, O., Bitfoi, A., Koç, C., Goelitz, D., . . . Otten, R. (2015). The school children mental health in Europe (SCMHE) project: Design and first results. Clinical Practice and Epidemiology in Mental Health CP & EMH, 11, 113–123. https:// doi.org/10.2174/1745017901511010113 Kuijpers, R. C., Otten, R., Krol, N. P., Vermulst, A. A., & Engels, R. C. (2013). The reliability and validity of the dominic interactive: A computerized child report instrument for mental health problems. Child & Youth Care Forum, 42(1), 35–52. https://doi. org/10.1007/s10566-012-9185-7 Marzocchi, G., Capron, C., Di Pietro, M., Duran Tauleria, E., Duyme, M., Frigerio, A., . . . Thérond, C. (2004). The use of the Strengths and Difficulties Questionnaire (SDQ) in southern European countries. European Child & Adolescent Psychiatry, 13, ii40– ii46. https://doi.org/10.1007/s00787-004-2007-1 Meltzer, H., Gatward, R., Goodman, R., & Ford, T. (2000). Mental health of children and adolescents in Great Britain. London, UK: TSO. Muris, P., Meesters, C., & van den Berg, F. (2003). The Strengths and Difficulties Questionnaire (SDQ). European Child & Adolescent Psychiatry, 12, 1–8. https://doi.org/10.1007/s00787-0030298-2
European Journal of Psychological Assessment (2020), 36(1), 65–76
M. M. Husky et al., SDQ Versus DAWBA in Seven European Countries
Niclasen, J., Teasdale, T. W., Andersen, A.-M. N., Skovgaard, A. M., Elberling, H., & Obel, C. (2012). Psychometric properties of the Danish Strengths and Difficulties Questionnaire: The SDQ assessed for more than 70,000 raters in four different cohorts. PLoS One, 7, e32025. https://doi.org/10.1371/journal. pone.0032025 Rønning, J., Handegaard, B., Sourander, A., & Mørch, W.-T. (2004). The Strengths and Difficulties Self-Report Questionnaire as a screening instrument in Norwegian community samples. European Child & Adolescent Psychiatry, 13, 73–82. https://doi.org/ 10.1007/s00787-004-0356-4 Shojaei, T., Wazana, A., Pitrou, I., & Kovess, V. (2009). The Strengths and Difficulties Questionnaire: Validation study in French school-aged children and cross-cultural comparisons. Social Psychiatry and Psychiatric Epidemiology, 44, 740–747. https://doi.org/10.1007/s00127-008-0489-8 Stone, L. L., Otten, R., Engels, R. C., Vermulst, A. A., & Janssens, J. M. (2010). Psychometric properties of the parent and teacher versions of the Strengths and Difficulties Questionnaire for 4to 12-year-olds: A review. Clinical child and family psychology review, 13, 254–274. https://doi.org/10.1007/s10567-0100071-2 Van Widenfelt, B. M., Goedhart, A. W., Treffers, P. D., & Goodman, R. (2003). Dutch version of the Strengths and Difficulties Questionnaire (SDQ). European Child & Adolescent Psychiatry, 12, 281–289. https://doi.org/10.1007/s00787-003-0341-3 Van Widenfelt, B. M., Treffers, P. D., De Beurs, E., Siebelink, B. M., & Koudijs, E. (2005). Translation and cross-cultural adaptation of assessment instruments used in psychological research with children and families. Clinical Child and Family Psychology Review, 8, 135–147. https://doi.org/10.1007/s10567-005-4752-1 Vander Stoep, A., Weiss, N., Saldanha, E., Cheney, D., & Cohen, P. (2003). What proportion of failure to complete secondary school in the US population is attributable to adolescent psychiatric disorder? Journal of Behavioral Health Services & Research, 30, 119–124. Wittchen, H.-U., Jacobi, F., Rehm, J., Gustavsson, A., Svensson, M., Jönsson, B., . . . Faravelli, C. (2011). The size and burden of mental disorders and other disorders of the brain in Europe 2010. European Neuropsychopharmacology, 21, 655–679. https://doi.org/10.1016/j.euroneuro.2011.07.018. Received October 14, 2016 Revision received February 9, 2018 Accepted March 7, 2018 Published online September 25, 2018 EJPA Section/Category Clinical psychology Mathilde M. Husky Laboratoire de Psychologie EA 4139 Institut Universitaire de France Université de Bordeaux 3 ter, place de la Victoire 33076 Bordeaux France mathilde.husky@u-bordeaux.fr
Ó 2018 Hogrefe Publishing
Original Article
Pediatric Symptom Checklist-17 Testing Measurement Invariance of a Higher-Order Factor Model Between Boys and Girls Jin Liu,1 Christine DiStefano,1 Yin Burgess,1 and Jiandong Wang2 1
Department of Educational Studies, University of South Carolina, Columbia, SC, USA
2
Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA
Abstract: The Pediatric Symptom Checklist-17 (PSC-17) is a screener designed to measure children’s behavioral and emotional problems. The measurement invariance of the scale’s higher-order factor structure was investigated in the current study. Gender invariance was established through a series of tests for configural invariance (baseline model), metric invariance, scalar invariance, residual variance invariance of items, higher-order factor loadings invariance, intercepts invariance of first-order factors, disturbances invariance of first-order factors, and factor variance invariance of a higher-order factor. The latent mean difference of the higher-order factor indicates that boys exhibited more problems with a strong effect size (d = .870). As invariance holds, the PSC-17 may be an option to identify preschool children’s behavioral and emotional problems in Response to Intervention programs in school-based settings. Keywords: higher-order factor model, gender invariance, ordered categorical data, tutorial, preschool
The last four decades have witnessed an increase in the number of preschoolers attending public schools and center-based programs (Barnett et al., 2016). This increase provides an opportunity for more young children to engage in early academic learning and adjustment to the school environment. As enrollment increases, the number of children in preschool exhibiting behavioral and social-emotional difficulties also increases (Conroy & Brown, 2004). Therefore, schools have begun offering intervention programs and services to children demonstrating social/emotional problems after such behaviors are identified (Levitt, Saka, Romanelli, & Hoagwood, 2007). The Response to Intervention (RtI) framework encompasses a multi-tiered framework to aid school-aged children for prevention and early intervention services (Duda, Fixsen, & Blasé, 2013), where the intensity of intervention and services provides support to serve students’ needs as necessary. This framework has been recently extended to the preschool level (Carta & Greenwood, 2013). An RtI framework includes three tiers. The first tier of RtI commonly includes a universal screening method applied to all children to identify those with behavioral and emotional problems as with behavioral risk. In Tier 2, targeted interventions are applied to children who are identified as high-risk for behavioral and emotional problems in Tier 1. Intensive intervention for children with significant behavioral and emotional difficulties is provided in Tier 3.
Ó 2018 Hogrefe Publishing
Screeners are often used in the first tier, where instruments are given to all children to identify those with emerging or pronounced behavioral risk. Psychometrically sound screening tools are essential to identify children in need of intervention in Tiers 2 and 3. The RtI framework is highly dependent on general education classroom teachers as they can assess children’s behaviors relative to their peers in classroom (DiStefano & Kamphaus, 2007). The current study focused on teacher-completed scales used as part of a school-wide universal screening program. Pediatric Symptom Checklist-17 (PSC-17) was selected for the current study. The scale is free and available online (http://www.massgeneral.org/psychiatry/assets/PSC17_English.pdf). The PSC-17 was developed by shortening the full PSC form (Gardner et al., 1999) in primary care settings. Parents provided ratings for their children (aged 4– 15 years) during primary care visits. The PSC-17 has been well-validated in primary care settings (i.e., Blucker et al., 2014; Murphy et al., 2016; Stoppelbein, Greening, Moll, Jordan, & Suozzi, 2012). There are many reasons why the PSC17 may be appropriate for universal screening in the school environment. First, given that the full 35-item PSC child self-report form was administered in a school environment (Pagano, Cassidy, Little, Murphy, & Jellinek, 2000) and it was shown to be an “easily administered tool for large-scale mental health screening in schools” (Pagano et al., 2000, p. 91). This recommendation may also apply to the short
European Journal of Psychological Assessment (2020), 36(1), 77–83 https://doi.org/10.1027/1015-5759/a000495
78
version of the scale PSC-17. In addition, similar items are included in the PSC-17 as compared with other screening forms typically used in the school environment, such as the Behavioral and Emotional Screening System (BESS; Kamphaus & Reynolds, 2015) and the Strength and Difficulties Questionnaire (SDQ; Goodman, 2001). In addition, there is a need for a brief, free teacher completed scale for preschool screening purpose due to the cost-benefit consideration. Many scales that are currently available are lengthy and expensive for universal screening. The length of the PSC-17 (i.e., 17 items) does not place an undue burden upon teachers completing many forms simultaneously. Also, the scale is available online and incurs minimal costs for schools. These features make the PSC-17 a good choice for schools interested in conducting universal screening for behavioral and emotional risk. While the PSC-17 may be an attractive option, validation of the form by teachers in school environment is lacking. Up to date, only one study (DiStefano, Liu, & Burgess, 2017) has been conducted to validate the factor structure of PSC-17 in the preschool environment with teacher ratings as used for universal screening. The authors indicated that future studies on invariance testing for different subgroups are needed to determine how ratings differ across subgroups. Measurement invariance is a significant component of psychometric quality, especially when researchers are interested in comparing group differences as invariance means that the scale has the same measurement and scaling properties across groups (Van de Schoot, Lugtig, & Hox, 2012). If measurement invariance holds, then researchers can interpret the results across groups in the same way. On the other hand, without invariance many researchers suggest that group comparisons are not meaningful because measures are not comparable without such evidence (e.g., Byrne & Watkins, 2003; Chen, Keith, Weiss, Zhu, & Li, 2010; Chen, Sousa, & West, 2005). It is well known that boys and girls behave differently from a very early age, thus gender differences of behavioral and emotional problems are of interest to researchers and educators. For instance, behavioral rating scales, such as The Behavioral and Emotional Screening Systems (Kamphaus & Reynolds, 2015), consider gender differences in their scoring procedures. However, only a few studies have discussed gender invariance of preschoolers’ behavioral problems (e.g., Sette, Baumgartner, & MacKinnon, 2015; Vancraeyveldt, Verschueren, Wouters, Van Craeyevelt, & Colpin, 2014). No studies have been conducted to investigate the gender invariance of the PSC-17 or the full PSC. Testing measurement invariance usually consists of examination of a series of successively restrictive models. The process starts with establishing the best factor structure model of the target scale. As our validation is conducted in European Journal of Psychological Assessment (2020), 36(1), 77–83
J. Liu et al., PSC-17 Gender Invariance
the preschool environment, the higher-order factor model concluded by DiStefano and colleagues (2017) is selected as an alternative to the previously validated three-factor model of the PSC-17 scale in primary care settings (Blucker et al., 2014; Gardner et al., 1999; Murphy et al., 2016). The higher-order factor model includes an additional higherorder overarching factor that influences the first-order factors. In other words, the first-order factors are aggregated into a general factor. The higher-order factor accounts for covariations among lower-order factors in an alternative manner to correlated confirmatory factor models (Gignac, 2008). The higher-order factor model has been used in psychology-related fields to denote the underlying structure of similar behavioral screening instruments (e.g., BESS; Wiesner & Schanding, 2013). Invariance tests involving a higherorder factor model include additional steps that are needed to test the higher-order factor invariance. Furthermore, the PSC-17 used three response categories at the item-level. An inappropriate estimation method could lead to biased results in model fits, parameter estimates, associated significance tests, and the theory being tested (Finney & DiStefano, 2013, p. 439). Thus, with analyses, researchers should consider estimation techniques appropriate for ordinal data (Finney & DiStefano, 2013). Previous validations of PSC-17 have considered the nature of the ordinal data in the analysis by using weighted least squares with mean and variance (WLSMV) correction (DiStefano et al., 2017) or unweighted least squares (ULS; Murphy et al., 2016). However, neither study has focused on invariance testing. The scarcity of invariance testing on higher-order factor models using ordinal data opens new ground for future research in addition to the substantive significance. The purpose of the study is to investigate the gender invariance of a higher-order factor model using the PSC-17. Second, if invariance can be substantiated, does a latent mean difference exist by gender?
Methods The PSC-17 is a brief screener measuring children’s behavioral and emotional disorders used by pediatricians and other health professionals (Gardner et al., 1999). The scale is appropriate for children aged from 3 to 16 years. Respondents rate the frequency of the target behavior using a 3-point Likert scale: 0 = Never, 1 = Sometimes or 2 = Often based on occurrence of the described behaviors. The screener has been validated and used successfully in related fields (e.g., DiStefano et al., 2017; Gardner et al., 1999; Murphy et al., 2016). Although the validated parent-completed PSC-17 in primary care settings (Gardner et al., 1999; Murphy et al., 2016) included three subscales: Externalizing Ó 2018 Hogrefe Publishing
J. Liu et al., PSC-17 Gender Invariance
79
Figure 1. The higher-order factor structure of the PSC-17 (“*” indicated the marker variable of a factor).
Problems (7 items), Internalizing Problems (5 items), and Attention Problems (5 items), a recent factor analysis with teacher completed PSC-17 in preschool identified three subscales: Externalizing Problems (7 items), Internalizing Problems (6 items), and Attention Problems (6 items) with a higher-order factor (i.e., Maladaptive Behavior) with high internal consistencies (DiStefano et al., 2017). Two crossloading items (i.e., “Daydreams too much” and “Does not listen to rules”) are identified where item loadings are roughly equal on two factors (Figure 1). In the fall of 2012, preschool teachers from 12 public schools or child development centers in South Carolina provided ratings of the PSC-17 for all children in their classrooms. Institutional Review Board approval and informed consent were obtained prior to involving teachers and children. Teachers’ participation was voluntary and those participating received a small monetary stipend for their contribution ($25). Teachers rated 836 preschool children as to their behaviors at school. Most children were 4 years old. Boys (49.8%) and girls (50.2%) were evenly distributed. The sampled children consisted of 37.1% nonHispanic Whites, 36.8% African-Americans, 6.8% were Hispanic Americans, and 19.3% unknown race/ethnicity. Ó 2018 Hogrefe Publishing
Mplus (version 7.31; Muthén & Muthén, 2012) was used for data analysis. The WLSMV, the default estimation method for categorical data, was used following recommendation by Finney and DiStefano (2013) and DiStefano and Morgan (2014). The theta parameterization was used to estimate the residual variances of the latent response variables of the categorical factor indicators. First, the higher-order factor structure was tested on two groups separately to ensure that the higher-order factor structure can be used for subsequent invariance tests. Then, a series of models were sequentially tested, whereas latter models included more constrained levels of invariance. Constraints were added to the model with eight steps in total to test measurement invariance (Table 1). The latent mean difference for the higher-order factor was tested after first-order factor intercepts invariance was built. One group was chosen as the reference group and its factor mean was assumed to be zero. The other group was chosen as the comparison group and its latent mean estimate represented the difference between two groups. The difference showed which group endorsed higher scores on the latent variable. A z-test was used to test the statistical significance, and the standardized effect size (i.e., Cohen’s d) of latent mean European Journal of Psychological Assessment (2020), 36(1), 77–83
80
J. Liu et al., PSC-17 Gender Invariance
Table 1. Test of invariance across gender χ2Δ (df) p-value
w2 (df)
RMSEA
Step 1: Configural invariance (baseline model)
543.824 (228)
.058
–
.980
.976
Step 2: Metric invariance
557.309 (241)
.056
19.700 (13) 0.103
.980
.978
Models
CFI
TLI
Step 3: Scalar invariance
553.873 (258)
.052
17.972 (17) .391
.981
.980
Step 4a: Residual variance invariance of items
552.842 (275)
.049
31.585 (17) .017*
.982
.983
Step 4b: Residual variance invariance of items (release one item)
548.603 (274)
.049
26.019 (16) .054
.983
.983
Step 5: Higher-order factor loadings invariance
516.806 (276)
.046
2.223 (2) .329
.985
.985
Step 6: Intercepts invariance of first-order factors
522.618 (278)
.046
8.195 (2) .017*
.985
.985
Step 7: Disturbances invariance of first-order factors
525.299 (281)
.046
8.804 (3) .032*
.985
.985
Step 8: Factor variance invariance of a higher-order factor
433.644 (282)
.036
0.283 (1) .595
.990
.991
Note. *p < .05.
difference (Hancock, 2001) was used to examine practical significance. Cohen’s d was used to indicate the standardized mean difference between two groups (Cohen, 1988). The following fit indices were used to compare differences across models (Hu & Bentler, 1999; SchermellehEngel, Moosbrugger, & Müller, 2003). The comparative fit index (CFI) value of .95 or higher and the root mean square error of approximation (RMSEA) of .06 or lower were determined to demonstrate good model fits. The Tucker-Lewis index (TLI) measures relative model fits and a value of .95 or higher were indicative of acceptable fit. In each step, two consecutive models were compared using the scaled chi-square (w2) difference tests between the more restrictive model and the less restrictive model to determine if invariance can be established across groups. An insignificant chi-square difference test (p > .05) suggested invariance, indicating gender difference was not present with imposing constraints. The DIFFTEST option, a special command available in Mplus was used to estimate the Satorra-Bentler Scaled chi-square difference (Muthén & Muthén, 2012). Researchers have argued that chi-square tests are sensitive to a large sample size and it is an unrealistic criterion of invariance evidence (Byrne & Stewart, 2006). Other fit indices, RMSEA, CFI, and TLI, were investigated across different models. The performance of fit indices using the WLMSV estimation is not the same as the maximum likelihood estimation, in which more restrictive models tend to have lower values of fit indices (e.g., Chen et al., 2005). Ideally, all models should meet the acceptable cut-off values discussed above. Model modification indices were referred to allow for partial invariance if full invariance was not supported.
Results The higher-order factor structure without considering gender differences was established first (see Figure 1; DiStefano et al., 2017). We fixed one of the factor loadings
European Journal of Psychological Assessment (2020), 36(1), 77–83
to 1 (i.e., marker variable) in each factor for easy interpretation (Chen et al., 2005) and to set the measurement scale. This model was estimated separately for boys and girls. Results showed that the higher-order factor model had good fit and similar standardized factor loadings for each group. Therefore, we proceeded with invariance testing; detailed information of each step was included for researchers who are interested in replicating the procedures in other scales. In Step 1, the baseline model was used to test if the PSC17 possessed the same factor structure in both groups. This model allowed parameter estimates and model fit estimates for both groups simultaneously (Byrne & Stewart, 2006). All items were specified on the same first-order factors and all the first-order factors were loaded on the higherorder factor; however, all the first-order factor (item) loadings and thresholds were freely estimated in each group. The residual variances were fixed at 1 in both groups for identification. The higher-order factor loadings and firstorder factor disturbances were separately estimated in each group. The factor means (first- and higher-order) were fixed to zero in each group. The higher-order factor variances were freely estimated across groups as marker variables were used. The baseline model illustrated good fit (w2(228) = 543.824, RMSEA = .058, CFI = .980, TLI = .976), indicating a similar factor structure between boys and girls. Next, first-order factor loadings across groups were constrained to be equal, to assess metric invariance (Finney & Davis, 2003). The first threshold of each item was held equal across groups for identification; the second threshold of each item was freely estimated across groups; and residual variances were constrained to 1 in the reference group (girls) and freely estimated for the comparison group (boys). The factor means of both levels were fixed at zero in one group and freely estimated in the other group. The remaining model settings were similar as Step 1. The model in Step 2 still showed adequate fits (w2(241) = 557.309, RMSEA = .056, CFI = .980, TLI = .978). The chi-square difference test results (χ2Δ = 19.70 df = 13, p = .103) indicated that Ó 2018 Hogrefe Publishing
J. Liu et al., PSC-17 Gender Invariance
the strength of the first-order factor loadings did not vary across groups. Scalar invariance in Step 3 imposed additional constraints to determine whether the item thresholds were equivalent across groups. This level of invariance indicates that ratings from both groups have the same unit of measurement and origin (Chen et al., 2005). The only difference compared with the last step was that all item thresholds were constrained equal between two groups. Again, the chi-square difference test was non-significant (χ2Δ = 17.972, df = 17, p = .391). There was no difference between boys and girls on the thresholds estimates of the observed indicators. In other words, the differences between observed group means could be attributed to latent mean differences. In typical confirmatory factor models, latent mean differences of factors may be checked if scalar invariance holds. However, due to the nature of higher-order factor models, latent mean differences were not examined for first-order factors; instead, latent means of first-order factors were tested for invariance in Step 6. The final step of the first-order factors invariance was to add additional constraints on the residuals. Measurement error invariance would be held if invariance held in this step. This level of invariance indicated that gender differences of the items had the same level of measurement error (Finney & Davis, 2003). Residual invariance is a strict invariance requirement (Byrne & Stewart, 2006; Cheung & Rensvold, 1999), and is considered stringent and of little substantive value (Byrne & Stewart, 2006). Constraining residual invariance terms (Step 4a) yielded worse fit than Step 3 – Scalar invariance (χ2Δ = 31.585, df = 17, p = .017), meaning that not all item residuals could be considered invariant across groups. Modification indices were examined to identify the source of non-invariance. The indices suggested that the residual variance for one item (“Feels hopeless”) contributed the largest amount of model misfit. After allowing this item to be freely estimated across groups, partial residual invariance (Step 4b) was supported (χ2Δ = 26.019, df = 16, p = .054). Similar procedures were followed to test higher-order factor invariance. First, higher-order factor loadings were constrained to be equal across groups except for the marker variables for identification. This form of invariance (Step 5) was nested within Model 4b, partial residual invariance. The chi-square difference test results indicated that the two models were nonsignificantly different from each other (χ2Δ = 2.223, df = 2, p = .329). This indicated that the strength of relationship between the higher-order factor and firstorder factors was invariant. Until Step 5, CFI, TLI, and RMSEA values were above the cut-off values and additional constraints did not worsen the model fits. Next, additional constraints for the first-order factor intercepts/means (Step 6) were added to the model. This Ó 2018 Hogrefe Publishing
81
step examined if the predicted value of the intercepts of the first-order factors was equal across groups. This was the prerequisite of testing the higher-order factor latent mean difference. Although the model in Step 6 fit slightly worse than the model in Step 5 (χ2Δ = 8.195, df = 2, p = .017), we concluded that invariance of the first-order factor intercepts was held for boys and girls based on similar values of the other fit indices (i.e., RMSEA, CFI and TLI). To compare the latent mean difference, girls were selected as the reference group and the latent mean of the boys was estimated to show the difference between groups. The factor mean for boys was statistically significant (z = 5.429, p < .001), and the standardized effect size was large at Cohen’s d = .870. The results indicated that preschool-aged boys demonstrated more maladaptive behaviors as compared with girls, with the difference reflecting a large effect size (Cohen, 1988). The factor error (i.e., disturbance) constraints (Step 7) were added to the model to examine if disturbances of first-order factors were equal across groups. Although the result of the chi-square difference test was significant (χ2Δ = 8.804, df = 3, p = .032), other fit indices were the same compared with the model tested in Step 6 (RMSEA = .046, CFI = .985, TLI = .985), indicating that the model with constrained factor disturbances did not differ from the model with constrained first-order factor intercepts, supporting invariance. In Step 8, the variance of the higher-order factor was constrained to be equal between groups. The chi-square difference test and the other fit indices both indicated that variance did not vary across groups (χ2Δ = 0.283, df = 1, p = .595, RMSEA = .036, CFI = .990, TLI = .991). It was noted that the other fit indices produced better fit in Step 8 compared with Step 7. The results indicated that the variance of the higher-order factor did not vary between boys and girls.
Discussion Following the higher-order factor structure, the measurement invariance of the PSC-17 between boys and girls was supported by testing a series of models including: configural invariance, metric invariance, scalar invariance, residual variance invariance of observed variables, higherorder factor loadings invariance, intercepts invariance of first-order factors, disturbances invariance of first-order factors, and factor variance invariance of a higher-order factor. Only one item (i.e., “Feels hopeless”) with different error variance across groups was identified, which indicated that the unexplained variance of this item was not the same between boys and girls. Although the first-order factor European Journal of Psychological Assessment (2020), 36(1), 77–83
82
intercepts and disturbance constraints showed significant chi-square differences, invariance was supported by comparing the performance of other fit indices. Boys exhibited higher levels of behavioral and emotional problems than girls. This finding was in line with the previous studies. For instance, boys were rated with more externalizing behavioral issues (hyperactivity, opposition, and physical aggression) than girls (Vancraeyveldt et al., 2014). Girls had higher social competence and lower anxiety-withdrawal levels compared with boys (Sette et al., 2015). This suggests that gender differences should be considered when educators interpret the PSC-17 results. Limitations and future implications to this study are present. For example, data was collected at one time point and all teacher ratings are from one state. While results suggested gender invariance for the PSC-17, future studies may replicate the procedure to investigate if the findings can be generalized across different settings. Given that the measurement invariance has been established for the PSC-17 in the preschool environment and gender differences were detected, the next task is to determine appropriate cut-off scores (i.e., Tier 1 of the RtI framework) for boys and girls to correctly identify children with behavioral and emotional problems. After at-risk students are identified, school personnel may consider the gender differences in targeted and intensive interventions which can be designed accordingly (i.e., Tiers 2 and 3 of RtI framework). Finally, measurement invariance of other demographic variables, such as ethnicity, can be tested in future studies. Access to a high-quality screening scale ensures that young children can be detected for additional comprehensive testing and intervention as proposed in the RtI framework. Measurement invariance is a key characteristic of high quality instruments. Measurement invariance between boys and girls indicates that the PSC-17 works acceptably regardless of child gender. Considering it is succinct and free, the PSC-17 may be an ideal screener for behavioral and emotional intervention programs with young children. Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000495 ESM 1. Syntax and Output (.pdf) This file includes the syntax and output of the baseline model (i.e., Step 1) from M-plus. ESM 2. Syntax and Output (.pdf) This file includes the syntax and output of the metric invariance model (i.e., Step 2) from M-plus. ESM 3. Syntax and Output (.pdf) European Journal of Psychological Assessment (2020), 36(1), 77–83
J. Liu et al., PSC-17 Gender Invariance
This file includes the syntax and output of the higher-order factor variance invariance model (i.e., Step 8) from M-plus. The syntax and output of other steps can be accessed by contacting the corresponding author.
References Barnett, W. S., Friedman-Krauss, A. H., Gomez, R. E., Horowitz, M., Weisenfeld, G. G., & Squires, J. H. (2016). The state of preschool 2015: State preschool yearbook. New Brunswick, NJ: National Institute for Early Education Research. Byrne, B. M., & Stewart, S. M. (2006). The MACS approach to testing for multigroup invariance of a second-order structure: A walk through the process. Structural Equation Modeling, 13, 287–321. https://doi.org/10.1207/s15328007sem1302_7 Byrne, B. M., & Watkins, D. (2003). The issue of measurement invariance revisited. Journal of Cross-Cultural Psychology, 34, 155–175. https://doi.org/10.1177/0022022102250225 Blucker, R. T., Jackson, D., Gillaspy, J. A., Hale, J., Wolraich, M., & Gillaspy, S. R. (2014). Pediatric behavioral health screening in primary care: A preliminary analysis of the Pediatric Symptom Checklist-17 with functional impairment items. Clinical Pediatrics, 53, 449–455. https://doi.org/10.1177/ 0009922814527498 Carta, J. J., & Greenwood, C. R. (2013). Promising future research directions in response to intervention in early childhood. In V. Buysse & E. Peisner-Feinberg (Eds.), Handbook of Response to Intervention (RTI) in early childhood (pp. 421–431). Baltimore, MD: Paul H. Brookes. Chen, F. F., Sousa, K. H., & West, S. G. (2005). Testing measurement invariance of second-order factor models. Structural Equating Modeling, 12, 471–492. https://doi.org/10.1207/ s15328007sem1203_7 Chen, H., Keith, T. Z., Weiss, L., Zhu, J., & Li, Y. Q. (2010). Testing for multigroup invariance of second-order WIS-IV structure across China, Hong Kong, Macau, and Taiwan. Personality and Individual Differences, 49, 677–682. https://doi.org/10.1016/ j.paid.2010.06.004 Cheung, G. W., & Rensvold, R. B. (1999). Testing factorial invariance across groups: A reconceptualization and proposed new method. Journal of Management, 25, 1–27. https://doi.org/ 10.1016/s0149-2063(99)80001-4 Cohen, J. (1988). Statistical power analysis for the behavioral science (2nd ed.). Hillsdale, NJ: Erlbaum. Conroy, M. A., & Brown, W. H. (2004). Early identification, prevention, and early intervention with young children at-risk for emotional or behavioral disorders: Issues, trends, and a call for action. Behavioral Disorders, 29, 224–237. https://doi.org/ 10.1177/019874290402900303 DiStefano, C. A., & Kamphaus, R. W. (2007). Development and validation of a behavioral screener for preschool-age children. Journal of Emotional and Behavioral Disorders, 15, 93–102. https://doi.org/10.1177/10634266070150020401 DiStefano, C., Liu, J., & Burgess, Y. (2017). Investigating the structure of the Pediatric Symptoms Checklist in the preschool setting: A comparison of factor analytic techniques. Journal of Psychoeducational Assessment, 35, 494–505. https://doi.org/ 10.1177/0734282916647648 DiStefano, C., & Morgan, G. B. (2014). A comparison of diagonal weighted least squares robust estimation techniques for ordinal data. Structural Equation Modeling: A Multidisciplinary Journal, 21, 425–438. https://doi.org/10.1080/ 10705511.2014.915373
Ó 2018 Hogrefe Publishing
J. Liu et al., PSC-17 Gender Invariance
Duda, M. A., Fixsen, D. L., & Blasé, K. A. (2013). Setting the stage for sustainability: Building the infrastructure for implementation capacity. In V. Buysse & E. Peisner-Feinberg (Eds.), Handbook of Response to Intervention (RTI) in early childhood (pp. 397–414). Baltimore, MD: Paul H. Brookes. Finney, S. J., & Davis, S. L. (2003, April). Examining the invariance of the achievement goal questionnaire across gender. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL. Finney, S. J., & DiStefano, C. (2013). Nonnormal and categorical data in structural equation modeling. In G. R. Hancock & R. O. Mueller (Eds.), Quantitative methods in education and the behavioral sciences: Issues, research, and teaching. Structural equation modeling: A second course (pp. 439–492). Charlotte, NC: IAP Information Age Publishing. Gardner, W., Murphy, J. M., Childs, G., Kelleher, K., Pagano, M., Jellinek, M., . . . Chiappetta, L. (1999). The PSC-17: A brief Pediatric Symptom Checklist with psychosocial problem subscales. A report from PROS and ASPN. Ambulatory Child Health, 5, 225–236. Gignac, G. E. (2008). Higher-order models versus direct hierarchical models: g as superordinate or breadth factor? Psychology Science, 50, 21–43. Goodman, R. (2001). Psychometric properties of the Strengths and Difficulties Questionnaire. Journal of the American Academy of Child and Adolescent Psychiatry, 40, 1337–1345. https://doi. org/10.1097/00004583-200111000-00015 Hancock, G. R. (2001). Effect size, power, and sample size determination for structured means modeling and MIMIC approaches to between-groups hypothesis testing of means on a single latent construct. Psychometrika, 66, 373–388. https://doi.org/10.1007/bf02294440 Hu, L. T., & Bentler, P. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55. https://doi. org/10.1080/10705519909540118 Kamphaus, R. W., & Reynolds, C. R. (2015). The Behavioral and Emotional Screening System (BESS). Austin, TX: Pearson. Levitt, J. M., Saka, M., Romanelli, L. H., & Hoagwood, K. (2007). Early identification of mental health problems in schools: the status of instrumentation. Journal of School Psychology, 45, 163–191. https://doi.org/10.1016/j.jsp.2006.11.005 Murphy, J. M., Bergmann, P., Chiang, C., Sturner, R., Howard, B., Abel, M. R., & Jellinek, M. (2016). The PSC-17: Subscale scores, reliability, and factor structure in a new national sample. Pediatrics, 138(3). https://doi.org/10.1542/peds.2016-0038 Muthén, L. K., & Muthén, B. O. (2012). Mplus user’s guide (7th ed.). Los Angeles, CA: Muthén & Muthén. Pagano, M. E., Cassidy, L. J., Little, M., Murphy, J. M., & Jellinek, M. S. (2000). Identifying psychosocial dysfunction in school-age
Ó 2018 Hogrefe Publishing
83
children: The Pediatric Symptom Checklist as a self-report measure. Psychology in the Schools, 37, 91. https://doi.org/ 10.1002/(SICI)1520-6807(200003)37:2%3C91::AID-PITS1% 3E3.0.CO;2-3 Schermelleh-Engel, K., Moosbrugger, H., & Müller, H. (2003). Evaluating the fit of structural equation models: Test of significance and descriptive goodness-of-fit measures. Methods of Psychological Research – Online, 8, 23–74. Sette, S., Baumgartner, E., & MacKinnon, D. P. (2015). Assessing social competence and behavior problems in a sample of Italian preschoolers using the social competence and behavior evaluation scale. Early Education and Development, 26, 46–65. https://doi.org/10.1080/10409289.2014.941259 Stoppelbein, L., Greening, L., Moll, G., Jordan, S., & Suozzi, A. (2012). Factor analyses of the Pediatric Symptom Checklist-17 with African-American and Caucasian pediatric populations. Journal of Pediatric Psychology, 37, 348–357. https://doi.org/ 10.1093/jpepsy/jsr103 Wiesner, M., & Schanding, G. T. (2013). Exploratory structural equation modeling, bifactor models, and standard confirmatory factor analysis models: Application to the BASC-2 behavioral and emotional screening system teacher form. Journal of School Psychology, 51, 751–763. https://doi.org/10.1016/j. jsp.2013.09.001 Van de Schoot, R., Lugtig, P., & Hox, J. (2012). A checklist for testing measurement invariance. European Journal of Developmental Psychology, 9, 486–492. https://doi.org/10.1080/ 17405629.2012.686740 Vancraeyveldt, C., Verschueren, K., Wouters, S., Van Craeyevelt, S., & Colpin, H. (2014). A multidimensional screening tool for preschoolers with externalizing behavior factor structure and factorial invariance. Journal of Psychoeducational Assessment, 32, 699–709. https://doi.org/10.1177/0734282914539898 Received April 10, 2017 Revision received March 27, 2018 Accepted April 5, 2018 Published online September 25, 2018 EJPA Section/Category Short scales Jin Liu Department of Educational Studies University of South Carolina 145 Wardlaw Columbia, SC 29208 USA liu99@mailbox.sc.edu
European Journal of Psychological Assessment (2020), 36(1), 77–83
Original Article
Development and Validation of the Multicontextual Interpersonal Relations Scale (MIRS) Melissa Simone, Christian Geiser, and Ginger Lockhart Department of Psychology, Utah State University, Logan, UT, USA
Abstract: Interpersonal relationships provide insight into a wide range of adult psychological health behaviors and well-being. Modern advancements in relational contexts (e.g., social media and phone use) have caused debate about the implications of technology use on overall interpersonal relationships and psychological health. Thus, the Multicontextual Interpersonal Relations Scale (MIRS) was developed to measure three unique processes of interpersonal relations and four unique contexts in which these activities take place. In total, N = 962 adult participants (aged 18–78 years) were recruited from the United States through Amazon Mechanical Turk, an online recruitment tool. Confirmatory factor analyses (CFAs) were conducted to examine the hypothesized factor structure, and bivariate correlations were computed to assess concurrent validity. CFA results supported a model with three process and three context (specific) factors, where face-to-face relations served as the reference context factor. Bivariate correlations revealed that the interpersonal relations factors correlated with the related constructs in the hypothesized ways. Overall, strong standardized factor loadings, item-level reliability, concurrent validity, and internal consistency support the structure and use of the MIRS. Findings suggest that participation in interpersonal relations is a multicontextual construct, requiring measurement of all unique processes and relational contexts. Keywords: interpersonal relations, instrument development, confirmatory factor analysis, social interactions
Interpersonal relationships are a critical factor in a wide range of markers of adult psychological health and wellbeing (e.g., Measelle, Stice, & Hogansen, 2006; Nelson, 2013; Rubin, Bukowski, & Parker, 2006), including depressive symptoms (Santini, Koyanagi, Tyrovolas, Mason, & Haro, 2015), disordered eating (Simone & Lockhart, 2016), and dementia (Kuiper et al., 2015). Interpersonal relationships may be formed among two individuals or a group, and may take various forms across different contexts. For example, an individual may initiate an interaction (e.g., make plans) or respond to solicited interactions from a friend. Further, interpersonal relationships can be maintained through face-to-face activities, text messages, phone calls, or social media platforms (Banjo, Hu, & Sundar, 2008). Due to the multifaceted nature of participating in interpersonal relationships, researchers are interested in examining how participating, or not participating, in unique facets of interpersonal relationships influence psychological health outcomes (e.g., Nelson, Coyne, Howard, & Clifford, 2016). In recent years, the use of technology to maintain interpersonal relations has steadily increased. Between the years 2007–2012, the rate of text messaging among American adults increased from 58% to 80% (Pew Research Center, 2012). Further, as of 2016, 68% of all Americans report European Journal of Psychological Assessment (2020), 36(1), 84–95 https://doi.org/10.1027/1015-5759/a000497
using Facebook, the social media platform, with 76% of those reporting daily use of the social media outlet (Pew Research Center, 2016). Little is known about the overall impact of technology on overall interpersonal relationships and psychological health. Currently, there are two competing theories as to whether interpersonal relations through the use of technology are helpful or harmful for overall well-being. One line of research suggests that technology-based interactions are helpful for socially anxious people (Sheldon, 2008) and help to develop social skills (Birnie & Horvath, 2002). The second line of work suggests that the use of technology may only be helpful for those who are already engaged and push those who withdraw to rely on the use of technology to maintain interpersonal relationships (Kraut et al., 2002). When individuals rely on technology, their interpersonal relationships are ultimately less rich than those that incorporate face-to-face relations (Kraut et al., 2002) and are related to poor psychological health outcomes (Nelson, Coyne, Howard, & Clifford, 2016). The inconsistent findings on the overall impact of technology may be the result of examining technology as a single facet. Specifically, interpersonal relations can be either accepted or initiated (e.g., responding or initiating a text message), each of which may differentially impact psychological Ó 2018 Hogrefe Publishing
M. Simone et al., Multicontexual Interpersonal Relations
health. For example, an individual who participates in solicited face-to-face relations but only initiates relations through the use of technology may experience more depressive symptoms than an individual who initiates relations in more than one context (Nie et al., 2002). Dimensions of interpersonal relational processes are important indicators of psychological health. For example, reciprocity, or both accepted and initiated relations, within interpersonal relationships has been related to lower depressive symptomology and greater life satisfaction (see Santini et al., 2015 for a review). More specifically, individuals who not only accept invitations to engage with friends but also initiate relations themselves report fewer depressive symptoms and higher life satisfaction than those who do not have reciprocal relationship (Santini et al., 2015). Similarly, previous research has found engagement in group activities to be predictive of later mental health (Jacobson & Newman, 2016). Taken together, it can be said that initiated, accepted, and group relational processes each play a unique role in the association between interpersonal relationships and mental health. In order to further assess the impact of interpersonal relations on psychological health, valid and reliable measures that capture the multicontextual nature of participating in interpersonal relations are needed. Existing measures that capture interpersonal relationships have been designed to assess social competence (Coroiu et al., 2015), the quality of a child’s relationship with their parents, peers, and teachers (Brooke, 1999), and to examine dominance and nurturance within interpersonal behavior (Wiggins, Trapnell, & Phillips, 1988). Thus, to date there is no known measure to capture participation in various facets of interpersonal relations, across the known relational contexts (e.g., faceto-face or through social media).
85
interpersonal relations (or lack thereof) that serve as best indicators of depression, anxiety, or other psychological health issues. Given the unique nature of each interpersonal relational process, it was hypothesized that each of the three relational processes would be represented by their own respective factors. Further, the four relational contexts included in the study capture unique methods of social interactions that are likely to result in shared variance among items within separate relational process domains. For example, a person who does not have a cell phone will probably not engage in accepted or initiated text messaging. Thus, it was hypothesized that process factors would not explain all of the variance related to specific shared contextual items, warranting the inclusion of method factors to explain the shared variance among items across different process factors. Because initiated, accepted, and group relational processes have a unique relationship to mental health (Jacobson & Newman, 2016; Santini et al., 2015), it was hypothesized that all relational processes would be negatively correlated with depression and positively correlated with life satisfaction where higher relational scores within each facet would be related with lower depressive symptomology and higher life satisfaction. Moreover, depressive symptomology and life satisfaction are influenced by many other constructs (e.g., family environment; Taylor et al., 2006) thus it was hypothesized that these correlations would be of moderate magnitude ( .30) based on Cohen’s (1988) guidelines. Given the inconsistent findings among the relation between technology-based relations and psychological health (e.g., Kraut et al., 2002; Sheldon, 2008), only relationships between the interpersonal relational processes and associated psychological health constructs were predicted whereas relations between relational contexts and psychological health were explored.
The Current Study The purpose of the current study was to develop and test the psychometric properties of a new scale designed to measure both the processes of interpersonal relations and the contexts in which these activities take place through the use of confirmatory factor analysis (CFA) and bivariate correlations between scale factors and related mental health constructs. The scale was designed to capture various interpersonal relational processes: (1) accepted interactions; (2) initiated interactions; and (3) group interactions, across four relational contexts: (1) face-to-face; (2) social media (Facebook); (3) text messaging; and (4) phone calls. By incorporating these components, researchers can examine how specific interpersonal relational processes and contexts predict, and are predicted by, various psychological health outcomes (e.g., depression), and can ultimately be used to inform clinical psychologists of the forms of Ó 2018 Hogrefe Publishing
Methods Sample Nine hundred sixty-two participants were recruited from the United States through Amazon’s Mechanical Turk (MTurk) online research recruitment tool. All surveys included in the study were written in English. The sample demographics, means and standard deviations for the study variables are provided in Table 1, alongside the demographic information from the US Census Bureau data (2015). Although quota sampling was used to closely match the racial and ethnic make-up of the sample to the US Census data, chi-square tests highlight significant discrepancies between the racial and ethnic composition of the current sample and the US Census data (p < .05). Specifically, the European Journal of Psychological Assessment (2020), 36(1), 84–95
86
M. Simone et al., Multicontexual Interpersonal Relations
Table 1. Participant demographics Characteristic
Samplea
US Census
Age [M, (SD)]
35.54 (10.88)
37.60b
Male
48.40
49.00
Female
51.60
51.00
Native American or Alaskan Native
0.40
0.08
Asian
6.10
5.10
African American or Black
7.40
12.60
Native Hawaiian or other Pacific Islander
0.10
0.20
83.60
73.60
Gender
Racec
European American Other
2.40
3.00
Biracial or Multiracial
4.80
4.70
Ethnicity Latino/Hispanic
6.80
17.10
93.20
82.90
High School/GED or less
28.40
41.10
Associates degree
16.10
8.10
Bachelors degree
29.40
18.50
Graduate or professional degree
12.50
11.20
European American/Non-Latino Education
Trade or vocational degree
2.40
NA
Other
1.00
21.10
Study variables [M, (SD)] MIRS
2.12 (0.50)
Beck Depression Inventory
12.43 (15.36)
Satisfaction with Life Scale
23.00 (7.53)
Notes. N = 962. GED = general education diploma; NA = not available. a Percentages unless otherwise noted. bOnly mean age was presented from the US Census data, as it was the only value available. cPercentage accounting for race among the current sample exceeds 100 as the question regarding whether participants were biracial or multiracial was included as a separate item from the six singular categorical options, thus biracial and multiracial participants are accounted for twice.
sample contains significantly more European Americans, fewer African Americans, and fewer people who identify as Latino than the US Census reports. However, the results from the current study are still generalizable to a large racially and ethnically diverse sample.
The Multicontextual Interpersonal Relations Scale Development The 16 items included in the Multicontextual Interpersonal Relations Scale (MIRS) were designed to capture all facets of modern interpersonal relational processes (accepted, initiated, and group) as well as various relational contexts (face-to-face, social media, text messaging, and phone calls) among individuals emerging into adulthood and adults (Banjo et al., 2008; Sheldon, 2008). The scale items were European Journal of Psychological Assessment (2020), 36(1), 84–95
developed to capture respondents’ likelihood of engaging the measured interpersonal relations, thus the scale was created to include a 4-point scale, ranging from 0 (= very unlikely) to 3 (= very likely). Thus, higher scores indicate more participation within interpersonal relations and lower scores indicate less interpersonal relation participation. To establish face validity of the MIRS, an initial bank of questions was created by a small group of experts in quantitative psychology and social epidemiology. The items were then sent to a separate small group of experts with similar focus areas, including clinical psychology, who were asked to either create additional items to reflect the hypothesized relational processes and contexts, or suggest the removal of any items viewed as unrelated to the hypothesized domains. As a result, four items were dropped from the MIRS as they were viewed to capture a separate domain unrelated to the goals of the MIRS (e.g., social anxiety). The four dropped items included: (1) “Spend more time at home than usual”; (2) “Spend more time playing video games than usual”; (3) “Spend more time watching movies, Netflix or other videos than usual”; and (4) “Purposefully avoid friends when you encounter them in public more than usual.” The final 16-item MIRS was developed as a result of this process. All items were carefully worded to ensure that they reflected their respective domain. Thus the items within the accepted relational context domain were worded so that participants would be exclusively endorsing the acceptance of interpersonal relations that were initiated by others (e.g., “Hang out with a friend if they invite me over”), rather than simultaneously endorsing participation across other domains. Specifically, items that capture accepted relational processes were created with key phrases such as “respond to...” or “if they invite me...” Similarly, items within the initiated relational process domain were worded with terms to ensure that participants exclusively endorse initiated relational contexts (e.g., “Initiate plans with friends”). Key phases among items within the initiated relational context domain include “plans you made. . .” or “initiate...” The items that were written to reflect the group relational process domain were worded with key terms such as “party,” “social outing,” or “friends” to highlight the focus on interactions with more than one friend (e.g., “Hang out with friends at a social outing”). Thus all items were carefully created to avoid a double bind, in which participants endorse multiple domains within a single item. The items written to reflect each of the relational contexts were created slightly differently from the method used to create the relational process items. Specifically, items across each relational context domain were created to reflect paired items across both accepted and initiated relational processes. For example, within the phone call contextual domain, a set of paired items were created across the Ó 2018 Hogrefe Publishing
M. Simone et al., Multicontexual Interpersonal Relations
accepted relational process domain (“Answer a call from a friend”) and the initiated relational process domain (“Call a friend to talk”). This pairing process was consistent across all relational contexts. Further, because Facebook is the most commonly used social media platform among emerging adults and adults in the United States (Pew Research Center, 2016), only questions regarding Facebook were included in the social media domain. The decision to include a single social media outlet was made because the focus of the MIRS is to capture a multidimensional set of interpersonal relational contexts without favoring any specific domain. Further, due to the nature of group relations, no items were created to capture group relations within varying relational contexts.
Measures Satisfaction With Life Scale Participants responded to the 5-item Satisfaction with Life Scale (SWLS; Diener, Emmons, Larsen, & Griffin, 1985), which was designed to measure global life satisfaction with a 7-point Likert scale, ranging from 1 (= strongly disagree) to 7 (= strongly agree). The SWLS showed good internal consistency among the current sample (α = .93). Depressive Symptoms The Beck Depression Inventory (BDI; Beck, Ward, Mendelson, Mock, & Erbaugh, 1961) was used to measure participant’s attitudes and symptoms of depression. The BDI is rated on a scale from 0–3, as described in terms of severity for each item. Due to a technical error, the BDI as used in the present study contained 20 items rather than 21. The missing item measures recent changes in the participant’s interest in sex. The BDI as used in the present study maintained acceptable internal consistency (α = .85) Demographics Participants completed a demographics questionnaire. The scale included information about participant’s age, gender identity, race, ethnicity, and educational attainment.
Statistical Analysis The factor structure of the MIRS was examined first using CFA using weighted least square mean and variance (WLSMV) adjusted estimation for categorical outcomes based on polychoric correlations (Brown, 2006) in the
1
87
Mplus software (Muthén & Muthén, 1998–2011; all input and output files are available in the Electronic Supplementary Material, ESM 1). The Mplus default setting for handling missing data with WLSMV estimation, pairwise deletion, was used (Muthén & Muthén, 1998–2011; Muthén, Muthén, & Asparouhov, 2015) which appears to work better than some other methods of handling missing data (e.g., listwise deletion; Enders, 2011). Rates of missingness within the sample were quite low, with only 0.1–1.7% missing across the MIRS items and 4.3–5.1% missingness among the mental health variables, suggesting that missing data was not an important factor in determining the MIRS scale factor structure. Several goodness-of-fit indices were examined to determine the adequacy of the model, including the chi-square test for model fit (w2), comparative fit index (CFI), Tucker-Lewis Index (TLI), and the root mean square error of approximation (RMSEA). Smaller w2 values indicate better fit. Ideally, models should have a nonsignificant p-value (p > .05). However, this is difficult to achieve with large samples as the present one. The CFI and TLI are larger than .95 in well-fitting models, whereas the RMSEA should be equal to or less than .06, however values of .08 indicate adequate fit (Hu & Bentler, 1999).1 In modelling the faceted nature of the scale (three processes and four contexts), we followed Eid, Geiser, Koch, and Heene’s (2017) confirmatory factor analytic approach, which is illustrated in Figure 1 for the present application. In Eid et al.’s (2017) approach, latent factors are included for each process and additional method (specific) factors are included for all contexts except one, which serves as reference context (see also Eid, Lischetzke, Nussbeck, & Trierweiler, 2003). In line with this approach, we predicted that the MIRS scale would measure three latent factors representing the three hypothesized forms of interpersonal relations or processes (accepted, initiated, and group), and three latent method (specific) factors representing the four relational contexts (face-to-face, social media [Facebook], text messaging, and phone calls). In this study, face-to-face relations served as the reference context, as these interactions are associated with the strongest interpersonal relations (Kraut et al., 2002). Thus, three method (specific) factors were included in the present study to account for shared residual item variance across items measuring Facebook, text messaging, and phone calls, which were contrasted against the reference facet (face-to-face relations). All method (specific) factors were uncorrelated with the interpersonal relations process factors.
Although WLSMV estimation has been shown to yield large, and therefore more desirable CFI, TLI, as well as smaller RMSEA estimates when compared to maximum likelihood (ML) estimates (Beauducel & Herzberg, 2006; Nye & Drasgow, 2011), the opposite is true when factor indicators include 4–6 response categories (Beauducel & Herzberg, 2006) in which model fit estimates closely match those received with ML estimation. Because the MIRS items include 4 response categories, the suggested cutoff values specified by Hu and Bentler (1999) were retained.
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 84–95
88
M. Simone et al., Multicontexual Interpersonal Relations
Figure 1. Standardized loadings, standard errors (in brackets), and factor correlations of the three factor CFA with three method factors.
Though it was hypothesized that a three factor-three method factor model would fit the data best, additional models were examined to test whether a more parsimonious factor structure fit the MIRS data better, which resulted in a total of six tested models. The six models included: (1) a one factor-no method factor model; (2) a one factor-three method factor model; (3) a two factor-no method factor model, in which the process factors represented initiated and accepted relations; (4) a two factorthree method factor model; (5) a three factor-no method factor model and (6) a three factor-three method factor model. After determining the best factor structure for the MIRS data, the final model was tested with gender, race, and age as covariates. Specifically, gender, race, and age were modeled to predict process factors and correlate with method factors. In a final step, bivariate correlations of the scale factors with depression and life satisfaction were examined to test the concurrent validity of the scale. It was hypothesized that the scale factors related to relational processes would be positively correlated with life satisfaction (Peterson, Park, European Journal of Psychological Assessment (2020), 36(1), 84–95
& Seligman, 2005) and negatively correlated with depressive symptoms (Measelle et al., 2006).
Results Based on the results of the factor analysis, two items, “Answer the front door when someone is there” and “Go out in public alone,” were dropped because they did not properly load onto their hypothesized factors, likely because the two items may represent a separate construct (e.g., social anxiety). Because social anxiety represents a separate construct from multicontextual interpersonal relations, it was determined that the removal of the two items would not compromise content coverage and as a result offers a more efficient, shorter scale. The final 14-item scale can be found in the Appendix. The factor structure of the remaining 14 items was examined through the application of CFA. The CFA models examined: Ó 2018 Hogrefe Publishing
M. Simone et al., Multicontexual Interpersonal Relations
(1) a one factor-no method factor model; (2) a one factor-three method factor model; (3) a two factor-no method factor model, in which the process factors represented initiated and accepted relations; (4) a two factor-three method factor model; (5) a three factor-no method factor model; and (6) a three factor-three method factor model. The one factor model examined whether all MIRS items were accounted for by a single relational process factor, the two factor model examined whether relational processes are explained by initiated and accepted relational processes, and the three factor model examined whether relational processes are explained by initiated, accepted, and group relations. The Accepted Relations factor included items that measure the extent to which participants respond to solicited interpersonal relations that were initiated by their friends, such as answering a call from a friend or going to a friend’s house when invited over. The Initiated Relations factor included items that measure the extent to which participants reach out to friends, initiate plans with friends, and initiate relations through the use of technology. Finally, the Group Relations factor included items that measure the extent to which participants were likely to go to a party or a group outing. The three residual method factors accounted for the three unique forms of technology-based relations (social media use [Facebook], text messaging, and phone calls) as described above and shown in Figure 1. The Facebook method factor accounted for the shared residual variance in three items related to Facebook relations with friends, across the Accepted and Initiated Relations factors. The Texting method factor accounted for the shared residual variance in two items related to text message relations with friends, across the Accepted and Initiated Relations factors. Finally, the Phone method factor accounted for the shared residual variance in two items related to calling, or receiving calls from friends, across the Accepted and Initiated Relations factors. In line with Eid et al. (2003, in press), method factors were not allowed to correlate with content factors, but were allowed to correlate with other method factors. It was expected that the process factors included in the models (e.g., initiated and accepted) would be correlated, and as such models with more than one process factor were specified to allow for latent factor correlations. Model fit estimates for all tested models are provided in Table 2. As hypothesized, the three factor-three method factor model fits the MIRS data better than all other factor structures. More specifically the CFA results indicate that the one, two, and three factor models with three method factors show much better model fit than factor structures Ó 2018 Hogrefe Publishing
89
without method factors. This suggests that relational contexts account for significant variability in engagement across contexts. Further, the one factor-three method factor and two factor-three method factor models did not fit the data well according to several indices of model fit. However, three factor-three method factor model showed significant improvements among several indices of fit, which suggests that interpersonal relational processes tend to be multidimensional in nature and that group relational processes represent a separate dimension from initiated and accepted relations. According to indices of model fit, it was determined that the three factor-three method factor model was a most appropriate model for the present data. As shown in Figure 1, the three interpersonal relations factors were highly correlated with one another, ranging from .79 to .89, indicating that higher scores on one form of interpersonal relation was associated with higher scores on another. The three method specific factors were also positively correlated (range: .21 to .56). To examine relations between the interpersonal relational process and context factors and demographic characteristics, gender, race, and age were entered as covariates into the final three factor-three method factor model. This analysis allowed covariates to predict individual interpersonal relational processes and examined correlations between the relational contexts (method factors) and covariates. The three factor-three method factor model with covariates fit the data well. The results suggested that race, gender, and age were significantly associated with the factors in the model, which highlights the differential relations between the process and method factors and important demographic variables. Specifically, the results from the covariate analysis revealed that men and non-European American individuals reported higher initiated relations scores than women and European Americans. Further, identifying as a woman was correlated with higher Facebook and text messaging interpersonal relational context scores in comparison to men. The results also indicate that identifying as a non-European American was correlated with greater engagement in text message-based relations when compared to European Americans. In contrast, identifying as a European American was correlated to higher engagement in phone call-based relational contexts when compared to non-European Americans. Further, there was a positive correlation between age and the phone call-based relational context in which older age was correlated with greater phone call-based engagement. These results suggest that engagement in interpersonal relational processes and contexts vary by race, gender, and age. Taken together, the findings indicate that the three unique process factors and three method factors are the result of differential relations between process and context factors and other European Journal of Psychological Assessment (2020), 36(1), 84–95
90
M. Simone et al., Multicontexual Interpersonal Relations
Table 2. Fit statistics for different CFA models df
w2
CFI
TLI
RMSEA (CIa)
No method factors
77
2,590.73
.84
.81
.184 (.178–.190)
Three method factors
67
764.31
.95
.94
.104 (.097–.111)
No method factors
76
2374.74
.85
.82
.177 (.171–.183)
Three method factors
66
614.49
.97
.95
.093 (.086–.100)
74
2,252.43
.86
.83
.175 (.169–.181)
Model Single process factor
Two process factors
Three process factors No method factors Three method factors
61
321.45
.98
.98
.067 (.060–.074)
Three method factors and three covariates
88
427.73
.98
.97
.063 (.057–.069)
Notes. a90% confidence interval limits for testing RMSEA. All estimates were significant at p < .01. df = degrees of freedom; w2 = chi-square fit statistic; CFI = comparative fit index; TLI = Tucker-Lewis index; RMSEA = root mean square error of approximation.
important factors rather than the results of factor fractionation. Standardized factor loadings, factor correlations, means, and standard deviations for the three factor-three method factor model with covariates are shown in Table 3. Item-level R2, consistency, and method specificity values for this model are shown in Table 4. The R2 values in Table 4 refer to the total amount of variance in each item that can be explained by the latent factors (including method factors) and can be interpreted as item-level reliability coefficients. The item-level R2 values ranged from .38 to .85, with only one value below .62, indicating good itemlevel reliabilities. The consistency coefficients indicate the amount of variance in each item that can be explained by the respective factor of interpersonal relations only (excluding method factors). The consistencies ranged from .17 to .81. In contrast, the method specificity coefficients indicate the amount of variance in each item that can be explained by the specific relational context only (e.g., Facebook method factor, excluding factors of interpersonal relations). The method specificities ranged from .14 to .66. The results of the consistency and method-specificity coefficients indicate that the items of the MIRS reflect both facets, interpersonal relations and relational contexts, although to varying degrees. To test the concurrent validity of the scale, bivariate correlations among the three factor-three method factor model with covariates and theoretically related constructs were examined. The bivariate correlations included depressive symptoms and life satisfaction (see Table 3). As hypothesized, depressive symptom reports moderately negatively correlated (r range: .30 to .32) with accepted, initiated, and group relations. Specifically, lower depressive symptom scores were associated with higher accepted, initiated, and group relations. Further, life satisfaction was moderately positively correlated (r range: .27 to .33) with accepted, initiated, and group relations, where higher relational context scores were associated with higher life satisfaction scores. European Journal of Psychological Assessment (2020), 36(1), 84–95
These correlations suggest that there is a moderate relationship between participating in all interpersonal relational processes and both depressive symptomology and life satisfaction. Depressive symptoms were positively correlated with the text message method factor, where greater text messagebased relational engagement was related to higher depressive symptomology. Further, life satisfaction was negatively correlated with the phone call method factor, where greater phone call-based relational engagement was related to lower satisfaction with life scores. In contrast, the Facebook method factor was positively correlated with life satisfaction, where greater Facebook use was associated with greater life satisfaction. Taken together, these findings suggest that relations between technology-based relational contexts and psychological health vary across technology platforms. Moreover, the relation between life satisfaction and phone call-based relational contexts should be interpreted with caution given the large sample size and small correlation estimate ( .03).
Discussion The purpose of the current study was to develop and test the psychometric properties of a new scale designed to measure both the processes of interpersonal relations and the contexts in which these activities take place through the use of CFA and bivariate correlations between scale factors and related mental health constructs. In general, the results from the CFA supported the hypothesis that the MIRS contained three process factors and across three context factors, in which face-to-face contexts served as the reference context. The results indicate that interpersonal relationships are multidimensional in nature and that relational contexts are important to consider when evaluating Ó 2018 Hogrefe Publishing
M. Simone et al., Multicontexual Interpersonal Relations
91
Table 3. Standardized loadings, factor correlations, and descriptive information for the three factor-three method factor model F1
F2
F3
M1
M2
M3
BDI
SWLS
Items 1
.90**
–
–
–
–
–
2
.69**
–
–
–
.54**
–
3
.74**
–
–
–
–
4
.44**
–
–
.81**
–
–
5
.41**
–
–
.81**
–
–
7
.62**
–
–
–
–
–
–
–
–
– –
6
–
.88**
.43**
8
–
.74**
–
–
.37**
9
–
.59**
–
.67**
–
10
–
.68**
–
–
–
11
–
.80**
–
–
–
–
12
–
–
.91**
–
–
–
13
–
–
.79**
–
–
–
14
–
–
.88**
–
–
–
– .45**
Factor correlations F1
–
F2
.79**
F3
.89**
– .85**
–
M1
–
–
–
–
M2
–
–
–
.56**
–
M3
–
–
–
.21**
.21**
–
BDI
.32**
.31**
.30**
.01
.12*
.10
SWLS
.27**
.33**
.30**
.17**
.01
.03*
– .46**
–
Notes. The excluded correlations are not estimated because method factors cannot correlate with other factors in the model with which they share items. F1 = Accepted Relations; F2 = Initiated Relations; F3 = Group Relations; M1 = Facebook method; M2 = Text message method; M3 = Phone method; BDI = Beck Depression Inventory; SWLS = Satisfaction with Life Scale. *p < .05; **p < .001.
interpersonal relational engagement. In general, the current findings demonstrate the need to measure all facets of interpersonal relations (accepted, initiated, and group) and relational contexts in order to fully assess the multicontextual nature of interpersonal relations. Overall, the good model fit, strong standardized factor loadings, high item-level reliability, concurrent validity, and internal consistency support the use of the MIRS. More specifically, item-level reliability values from the current study indicate that the total amount of true score variance across process and contexts factors is adequate. Moreover, estimates of internal consistency and method specificity indicate that the items included in the MIRS reflect both interpersonal relational processes and contexts to varying degrees. Taken together, the findings suggest that the items included in the MIRS are multidimensional and that the items adequately capture the various dimensions of interpersonal relations of interest. Covariate analyses revealed that participation in multidimensional interpersonal relations vary by race and gender. Perhaps of greatest interest, men and non-European Americans were more likely to initiate relations when compared Ó 2018 Hogrefe Publishing
to women and European Americans. Although little research has examined gender or racial differences in initiated relations, previous research has found that friendships with women tend to be more reciprocal than friendships with men, (Parker & de Vries, 1993), women tend to offer more in a friendship than men (Parker & de Vries, 1993), and also tend to put forward more effort to maintain friendships than men (Hall, Larson, & Watts, 2011). Yet the current findings would suggest that men are putting forth more with interpersonal relations, as covariate analyses indicate that men initiate more relations than women. Perhaps the discrepancy in findings can be explained by differences in the measures of initiation. Specifically, the current study examined the extent to which individuals are likely to initiate relational activities, whereas the previous research included initiated relations among many other facets of relationships, such as openness to sharing personal emotional experiences. Thus, the MIRS contributes to the literature by offering an alternative way to measure differences in relational processes by gender. Finally, little research has examined racial differences across specific relational processes, however previous research suggests non-European European Journal of Psychological Assessment (2020), 36(1), 84–95
92
M. Simone et al., Multicontexual Interpersonal Relations
Table 4. Coefficients in the three factor-three method factor model Observed variables Items
Reliability (R2)
Consistency
True-score variables Method Specificity
Consistency
Method Specificity
Accepted relations Face-to-face 1
.81
.81
–
1.00
7
.38
.38
–
1.00
4
.85
.19
.66
0.22
.78
5
.83
.17
.66
0.20
.80
.77
.48
.29
0.62
.38
.18
0.75
.25
Text 2 Phone 3
.71
.53
Initiated relations Face-to-face 6
.77
.77
–
1.00
11
.64
.64
–
1.00
.80
.35
.45
0.44
.56
.66
.46
.20
0.70
.30
.14
0.80
.20
Facebook 9 Text 10 Phone 8
.69
.55 Group relations
Face-to-face 12
.83
.83
–
1.00
13
.62
.62
–
1.00
14
.77
.77
–
1.00
Notes. Consistency = proportion of variability explained by interpersonal relations factor. Method specificity = proportion of variance explained by relational context factor. The reliability coefficient (R2) represents the sum of consistency and method specificity coefficients for a given item. Face-to-face interactions served as the reference facet. Therefore, method specificities are not computed for this facet.
Americans tend to have more fulfilling relationships than those of European Americans (Nguyen, 2017). Thus, the current study adds to the measurement of interpersonal relationships by offering new ways to capture relational and model unique relational processes. The findings from the covariate analysis also suggest that future research should examine differences in relational contexts across race and ethnicity, as there were correlations of small magnitude between technology-based relational contexts and the demographic constructs of interest. Tests of concurrent validity found that the MIRS relational processes factors were related to other variables in predicted ways. Specifically, consistent with the study hypotheses, bivariate correlations revealed a moderate and positive correlation between accepted, initiated, and group relational processes and depressive symptomology, as well as a moderate and negative correlation between the relational processes and life satisfaction. Consistent with previous findings (e.g., Jacobson & Newman, 2016; European Journal of Psychological Assessment (2020), 36(1), 84–95
Santini et al., 2015), the tests of concurrent validity highlight that each relational process is uniquely related to psychological health. Above and beyond the expected correlations, life satisfaction was positively correlated with the Facebook use method factor, where greater Facebook use was associated with greater life satisfaction. Further, depressive symptoms were positively correlated with the text message method factor, where more text messaging was associated with greater depressive symptomology. Thus, individuals who text more may experience more depressive symptoms than those who use text messagebased relational engagement less often (Ferraro, Holfeld, Frankl, Frye, & Halvorson, 2015). The correlations among interpersonal relational contexts and psychological health are small, and thus highlight the need for future research to more thoroughly examine these relations. Altogether, the findings from the current studies support the reliability, validity, and utility of the MIRS, and its various interpersonal relations and relational contexts. Ó 2018 Hogrefe Publishing
M. Simone et al., Multicontexual Interpersonal Relations
By accounting for all facets of interpersonal relations, researchers may further assess how unique relations and contexts influence psychological health. Specifically, by examining specific relational contexts, researchers can determine which contexts are most important for potential interventions. Future researchers interested in using the MIRS should use the three factor-three method factor model with covariates to capture all relational processes and contexts, while capturing both the variances related to both interpersonal processes and contexts captured in the scale items. Moreover, because interpersonal relations vary by race and gender, future researchers should consider including these demographic characteristics as covariates when using the MIRS. Although the study provides support for the hypothesized factor structure of the scale, there are several limitations. First, tests of reliability and validity are limited to the cross-sectional and single sample analyses included in the current study. Future studies should seek to expand upon the current study by examining the longitudinal psychometric properties of the MIRS (e.g., predictive validity). Further, tests of reliability and validity should be examined among additional samples. Moreover, while the current study examined a sample that was diverse in age, race, and ethnicity, results may not generalize well to underserved populations. Thus, future research should extend this research by examining the factor structure of the scale in different communities. Additionally, while the MIRS measures Facebook as a common interpersonal relational context among individuals within the United States, Facebook is not a common interpersonal context across all cultures. Thus, the utility of the MIRS in other cultures still needs to be examined. To address this limitation, researchers outside of the United States may consider modifying the MIRS to include common social media outlets in their region or removing the questions associated with Facebook. Despite these limitations, the current studies provide a valuable contribution to the field of psychology. The addition of a new, multicontextual scale will allow researchers to answer novel questions about modern interpersonal relations and differences in risk across methods. Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000497 ESM 1. Final CFA input and output files (.pdf) This document includes the input information for Mplus statistical software for the final CFA model. The document also includes the output information, just below the corresponding input information.
Ó 2018 Hogrefe Publishing
93
References Banjo, O., Hu, Y., & Sundar, S. (2008). Cell phone usage and social interaction with proximate others: Ringing in a theoretical model. The Open Communication Journal, 2, 127–135. Beauducel, A., & Herzberg, P. Y. (2006). On the performance of maximum likelihood versus means and variance adjusted weighted least squares estimation in CFA. Structural Equation Modeling, 13, 186–203. Beck, A. T., Ward, C. H., Mendelson, M., Mock, J., & Erbaugh, J. (1961). An inventory for measuring depression. Archives of General Psychiatry, 4, 561–571. https://doi.org/10.1001/ archpsyc.1961.01710120031004 Birnie, S. A., & Horvath, P. (2002). Psychological predictors of Internet social communication. Journal of Computer-Mediated Communication, 7, 13–27. https://doi.org/10.1111/j.10836101.2002.tb00154.x Brooke, S. L. (1999). Assessment of interpersonal relations: A test review. Measurement and Evaluation in Counseling and Development, 32, 105–110. Brown, T. (2006). Confirmatory factor analysis for applied research. New York, NY: Guilford Press. Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Erlbaum. Coroiu, A., Meyer, A., Gomez-Garibello, C. A., Brähler, E., Hessel, A., & Körner, A. (2015). Brief form of the Interpersonal Competence Questionnaire (ICQ-15). European Journal of Psychological Assessment, 31, 272–279. https://doi.org/10.1027/ 1015-5759/a000234 Diener, E., Emmons, R. A., Larsen, R. J., & Griffin, S. (1985). The Satisfaction with Life Scale. Journal of Personality Assessment, 49, 71–75. https://doi.org/10.1207/s15327752jpa4901_13 Eid, M., Geiser, C., Koch, T., & Heene, M. (2017). Anomalous results in g-factor models: Explanations and alternatives. Psychological Methods, 22(3), 541–562. https://doi.org/10.1037/ met0000083 Eid, M., Lischetzke, T., Nussbeck, F. W., & Trierweiler, L. I. (2003). Separating trait effects from trait-specific method effects in multitrait-multimethod models: A multiple-indicator CT-C(M-1) model. Psychological Methods, 8, 38–60. Enders, C. K. (2011). Missing not at random models for latent growth curve analyses. Psychological Methods, 16, 1–16. https://doi.org/10.1037/a0022640 Ferraro, F. R., Holfeld, B., Frankl, S., Frye, N., & Halvorson, N. (2015). Texting/iPod dependence, executive function and sleep quality in college students. Computers and Human Behavior, 49, 44–49. https://doi.org/10.1016/j.chb.2015.02.043 Hall, J. A., Larson, K. A., & Watts, A. (2011). Satisfying friendship maintenance expectations: The role of friendship standards and biological sex. Human Communication Research, 37, 529– 552. Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6, 1–55. https://doi.org/10.1080/10705519909540118 Jacobson, N. C., & Newman, M. G. (2016). Perceptions of close and group relationships mediate the relationship between anxiety and depression over a decade later. Depression and Anxiety, 33, 66–74. https://doi.org/10.1002/da.22402 Kraut, R., Kiesler, S., Boneva, B., Cummings, J., Helgeson, V., & Crawford, A. (2002). Internet paradox revisited. Journal of Social Issues, 58, 49–74. https://doi.org/10.1111/1540=4560.00248 Kuiper, J. S., Zuidersma, M., Voshaar, R. C. O., Zuidema, S. U., van den Heuvel, E. R., Stolk, R. P., & Smidt, N. (2015). Social relationships and risk of dementia: A systematic review and
European Journal of Psychological Assessment (2020), 36(1), 84–95
94
meta-analysis of longitudinal cohort studies. Ageing Research Reviews, 22, 39–57. Measelle, J. R., Stice, E., & Hogansen, J. M. (2006). Developmental trajectories of co-occurring depressive, eating, antisocial, and substance abuse problems in female adolescents. Journal of Abnormal Psychology, 115, 524–538. https://doi.org/10.1037/ 0021-843X.115.3.524 Muthén, L. K., & Muthén, B. O. (1998–2011). Mplus user’s guide (6th ed.). Los Angeles, CA: Muthén & Muthén. Muthén, B. O., Muthén, L. K., & Asparouhov, T. (2015). Estimator choices with categorical outcomes. Retrieved from http://www. statmodel.com/download/EstimatorChoices.pdf Nelson, L. J. (2013). Going it alone: Comparing subtypes of withdrawal on indices of adjustment and maladjustment in emerging adulthood. Social Development, 22, 533–538. Nelson, L. J., Coyne, S. M., Howard, E., & Clifford, B. N. (2016). Withdrawing to a virtual world: Associations between subtypes of withdrawal, media use, and maladjustment in emerging adults. Developmental Psychology, 52, 933–942. https://doi. org/10.1037/dev0000128 Nguyen, A. W. (2017). Variations in social network type membership among older African Americans, Caribbean blacks, and non-Hispanic whites. The Journals of Gerontology: Series B, 72, 716–726. https://doi.org/10.1093/geronb/gbx016 Nie, N. H., Hillygus, D. S., & Erbring, L. (2002). Internet use, interpersonal relations, and sociability: A time diary study. In B. Wellman & C. Haythornthwaite (Eds.), The Internet in everyday life (pp. 215–243). Malden, MA: Blackwell. Nye, C. D., & Drasgow, F. (2011). Assessing goodness of fit: Simple rules of thumb simply do not work. Organizational Research Methods, 14, 548–570. https://doi.org/10.1177/ 1094428110368562 Parker, S., & de Vries, B. (1993). Patterns of friendship for women and men in same and cross-sex relationships. Journal of Social and Personal Relationships, 10, 617–623. Peterson, C., Park, N., & Seligman, M. E. P. (2005). Orientations to happiness and life satisfaction: The full life versus the empty life. Journal of Happiness Studies, 6, 25–41. Pew Research Center. (2012). Cell phone activities 2012: The Internet & American life project. Washington, DC: Duggan, M. & Rainie, L. Pew Research Center. (2016). Social media update 2016. Washington, DC: Greenwood, S., Perrin, A. & Duggan, M.
European Journal of Psychological Assessment (2020), 36(1), 84–95
M. Simone et al., Multicontexual Interpersonal Relations
Rubin, K. H., Bukowski, W., & Parker, J. G. (2006). Peer interactions, relationships, and groups. In N. Eisenberg (Ed.), Handbook of child psychology: Social, emotional, and personality development (pp. 571–645). New York, NY: Wiley. Santini, Z. I., Koyanagi, A., Tyrovolas, S., Mason, C., & Haro, J. M. (2015). The association between social relationships and depression: A systematic review. Journal of Affective Disorders, 175, 53–65. https://doi.org/10.1016/j.jad.2014.12.049 Sheldon, P. (2008). The relationship between unwillingness-tocommunicate and students’ Facebook use. Journal of Media Psychology, 20, 67–75. https://doi.org/10.1027/18641105.20.2.67 Simone, M., & Lockhart, G. (2016). Two distinct mediated pathways to disordered eating in response to weight stigmatization and their application to prevention programs. Journal of American College Health, 64, 520–526. https://doi.org/ 10.1080/07448481.2016.1188106 Taylor, S. E., Way, B. M., Welch, W. T., Hilmert, C. J., Lehman, B. J., & Eisenberger, N. I. (2006). Early family environment, current adversity, and serotonin transporter promoter polymorphism, and depressive symptomology. Biological Psychiatry, 60, 671– 676. https://doi.org/10/1016/j.biopsych.2006.04.019 US Census Bureau. (2015). Selected housing characteristics, 2011–2015 American community survey 5-year estimates. Retrieved from https://factfinder.census.gov/faces/nav/ jsf/pages/searchresults.xhtml?refresh=t Wiggins, J. S., Trapnell, P., & Phillips, N. (1988). Psychometric and geometric characteristics of the revised Interpersonal Adjective Scales (IAS-R). Multivariate Behavioral Research, 23, 517–530. Received June 15, 2017 Revision received May 23, 2018 Accepted May 23, 2018 Published online September 18, 2018 EJPA Section/Category Clinical Psychology Melissa Simone Department of Psychology Utah State University 2810 Old Main Hill Logan, UT, 84322 USA m.simone@aggiemail.usu.edu
Ó 2018 Hogrefe Publishing
M. Simone et al., Multicontexual Interpersonal Relations
95
Appendix Table A1. 14-Item Multicontextual Interpersonal Relations Scale Prompt
Please indicate how likely it is that you would do the following
Response Categories
Very Unlikely (0) Unlikely (1) Likely (2) Very Likely (3)
Items
1. Hang out with a friend if they invite me over 2. Respond to text messages from friends 3. Answer a phone call from a friend 4. Respond to a Facebook message 5. Respond to Facebook comments from friends 6. Initiate plans with friends 7. Follow through with plans you made 8. Initiate text message conversations with friends 9. Initiate Facebook conversations with friends 10. Call a friend to talk 11. Hang out with a friend in your own home 12. Hang out with friends at a social outing 13. Go to a party you were invited to 14. Hang out with friends in a public space
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 84–95
Original Article
Does Speededness in Collecting Reasoning Data Lead to a Speed Factor? Florian Zeller, Siegbert Reiß, and Karl Schweizer Department of Psychology, Goethe University Frankfurt, Frankfurt am Main, Germany
Abstract: The consequences of speeded testing for the structure and validity of a numerical reasoning scale (NRS) were investigated. Confirmatory factor models including an additional factor for representing working speed and models without such a representation were employed for investigating reasoning data collected in speeded paper-and-pencil testing and in only slightly speeded testing. For achieving a complete account of the data, the models also accounted for the item-position effect. The results revealed the factor representing working speed as essential for achieving a good fit in data originating from speeded testing. The reasoning factors based on data due to speeded and slightly speeded testing showed a high correlation among each other. The factor representing working speed was independent of the other factors derived from reasoning data but related to an external score representing processing speed. Keywords: confirmatory factor analysis, item-position effect, speeded testing, non-speeded testing, working speed
Speeded testing means that the participants are only allowed a limited time span for completing the items of the test; this time span is insufficient for at least some or even all participants (Lu & Sireci, 2007). The effect of a time limit has been a major topic of assessment research in the past (Gulliksen, 1950). For a time, the research focused on the difference between speed and power testing (Lord & Novick, 1968). Later on, the focus was on speededness in general since it turned out that the effect of the time limit varied as a function of the participants’ working speed and characteristics of the items (Oshima, 1994). The more recent research has concentrated on the consequences for test fairness (van der Linden, 2011), for parameter estimation (Bolt, Cohen, & Wolack, 2002; Goegebeur, De Boeck, Wollack, & Cohen, 2008), for validity (Estrada, Román, Abad, & Colom, 2017; Lu & Sireci, 2007) and the possibilities to neutralize the effect (van der Linden & Xiong, 2013). While the recent research has mostly been conducted within the item response theory (IRT) approach, the other major approach, the factor-analytic (FA) one, almost completely ignored this topic although speededness is affecting a property that has so far been considered as its genuine field of research: the structural validity of scales that is suggested to be impaired (Lu & Sireci, 2007). Using FA, there appears to be only one recent study providing evidence of such impairment by relating latent variables reflecting
European Journal of Psychological Assessment (2020), 36(1), 96–104 https://doi.org/10.1027/1015-5759/a000498
speeded and only slightly speeded testing and mental speed to each other (Wilhelm & Schulze, 2002). Such impairment gives rise to the expectation that the underlying structure is no more one-dimensional but two-dimensional. The major aim of the research work reported in this paper was to find out whether speededness of a scale for the assessment of numerical reasoning scale (NRS; Horn, 1983) could be identified by means of a latent variable (= factor) as part of a confirmatory factor model that also included the latent variable representing the core of numerical reasoning in using the FA approach. In order to accomplish this aim, the standard model of measurement was replaced by another one that was less likely to neutralize additional effects. Another aim was to investigate the nature of the additional latent variable in order to assure that this additional latent variable did not represent reasoning but speededness.
On the Representation of the Effects of Latent Sources on Responding The tendency of the standard model of measurement of confirmatory factor analysis (CFA) to neutralize minor effects characterizing data besides the main effect is presumably a major reason for neglecting FA in investigating speededness. Neutralization occurs in estimating the free
Ó 2018 Hogrefe Publishing
F. Zeller et al., ;Speededness in Achievement Testing
factor loadings of the congeneric model (Jöreskog, 1971) since implicitly this model is accommodated to the other effects so that they no longer become apparent as deviations or sources of model misfit. Other models of measurement that can be incorporated in CFA (Graham, 2006) exclude this kind of adaptability to data. This prevents the neutralization of effects due to other latent sources. The replacement of the estimation of factor loadings by their fixation to predefined values implicitly means the fixation of the discriminability of the items (Lucke, 2005). Such fixation also characterizes the oneparameter IRT model, that is, the Rasch model (Rasch, 1980), the tau-equivalent model (Lord & Novick, 1968), the early growth curve model (McArdle, 1986), and the fixed-links model (Schweizer, 2008). Fixations for representing specific effects are to be considered as hypotheses on how the latent source contributes to the individual items. If a hypothesis is correct, the fixed factor loadings lead to the same degree of model fit as the corresponding free factor loadings. However, the fixation of factor loadings to predefined values also includes a downside: Every additional effect that is not represented by the model of measurement is likely to lead to model misfit. As a consequence, it is necessary to explicitly represent each major effect by an own component in order to achieve a good model fit that means an own latent variable and own factor loadings.
97
On the Representation of Other Effects The necessity to represent each major effect requires the consideration of other possible effects that may be characteristic of the investigated scale or may be expected due to external sources. There is the item-position effect repeatedly observed in reasoning scales (e.g., Birney, Beckmann, Beckmann, & Double, 2017; Debeer, & Janssen, 2013; Verguts & De Boeck, 2000) and also in other scales (Hartig, Hölzel, & Moosbrugger, 2007). The term item-position effect describes an increasing dependency among successively completed items and is probably a misnomer since this effect appears to reflect a kind of learning while completing similar items (Carlstedt, Gustafsson, & Ullstadius, 2000; Embretson, 1991; Verguts & De Boeck, 2000). In order to avoid the association of this effect with method bias, we prefer to refer to it as learning effect. Increasing functions serve well for the representation of the learning effect. However, since the model of measurement including factor loadings is subsequently transformed into the model of the covariance matrix and used for parameter and fit estimation, there is a conflict between the representations of the speed effect and the learning effect: A strong speed effect excludes a strong learning effect. As a consequence, the representation of the learning effect has to be reduced as the representation of the speed effect unfolds.
On Scale Adaptation and Estimation On the Representation of Speededness The effect of speediness is apparent in not reached items and also in random responses (Oshima, 1994). Recent research even reveals that there is an increasing tendency to check response options at random before the time limit is reached instead of staying with omissions (Must & Must, 2013). This behavior can lead to a larger score than otherwise because a random response can be correct due to chance. As a consequence, the not reached items may not provide the perfect basis for the fixation of factor loadings. It is convenient to exclude all items that show no omissions from the search for values that are suitable for the fixation of factor loadings on the latent variable representing speededness. This search can make use of a reported attempt to investigate the effect of a time limit that finds its expression in omissions by means of a CFA model (Schweizer & Ren, 2013). This model comprises an additional component for capturing the effect in order to prevent the impairment of model fit due to the effect. The course of the impact of the effect is represented by means of the logistic function. It creates a curve with a turning point of steepness arranged near to the item showing half of the maximum number of omissions of an item. Ó 2018 Hogrefe Publishing
Since CFA requires that the data show interval scale and normal distribution but reasoning data are usually dichotomous and follow the binomial distribution, adaptation of the data is necessary and may extend to model estimation. It can be achieved in different ways. For example, tetrachoric correlations can be used as input to CFA, item factor analyses can be conducted, or diagonally weighted least squares (DWLS) estimation can be employed. A further way of conducting CFA referred to as the threshold-free approach is available that accomplishes the adaptation without threshold estimation (Schweizer, 2013; Schweizer, Ren, & Wang, 2015). Since threshold estimation characterizing some other ways demands very large samples to avoid input matrices that are not positive definite if the data originate from very easy or very difficult items, the present study employs the threshold-free approach. It starts with the computation of probability-based covariances (McDonald & Ahlawat, 1974; Schweizer, Ren, & Wang, 2015) and includes a link transformation of the variances and covariances of the model. This link transformation is implemented by computing item-specific weights reflecting the probability of a correct response in completing the corresponding item (see Appendix). These weights serve as European Journal of Psychological Assessment (2020), 36(1), 96–104
98
multipliers to the factor loadings. The results section includes a comparison between this approach and other approaches.
F. Zeller et al., Speededness in Achievement Testing
Bennett, 2005). Each item consisted of a 3 3 matrix of geometric forms, where one of the nine elements was missing. Participants were asked to select one out of eight alternatives to complete the matrix. The time limit was 20 min.
The Present Study It is investigated whether speededness in the administration of the NRS scale transforms the underlying structure by constituting a second latent dimension and, thus, impairs the structural validity of this scale. A CFA model of measurement including a speed latent variable besides the latent variable representing the construct of interest serves this purpose. Furthermore, the convergent validity and discriminant validity of the latent variables of the model are investigated by relating them to corresponding latent variables obtained from only slightly speeded reasoning data and an independent measure of working speed.
Method Sample The sample consisted of 287 university students who received either course credit or a financial reward for participating in the study. Thirteen participants were excluded because of missing speed data or insufficient performance (more incorrect than correct speed responses). The average age was 22.8 years (SD = 4.2 years). There were twice as many females than males.
Measures Numerical Reasoning Scale (NRS) The fourth scale of Horn’s (1983) LPS intelligence test battery was applied as a measure of the participants’ reasoning ability in speeded testing. The NRS consists of 40 items. However, the first to twenties items were so easy that virtually every participant was able to solve them. Therefore, data originating from these items were excluded from the statistical investigations. The time limit to complete all 40 items was 8 min. Participants were not instructed to be especially fast. Raven’s Advanced Progressive Matrices (APM) A shortened version of Raven’s APM (Raven, Raven, & Court, 1997), including 18 items, was used as a measure of fluid intelligence in slightly speeded testing. These 18 items covered the same range of item difficulties as the original version (first set constructed by Mackintosh & European Journal of Psychological Assessment (2020), 36(1), 96–104
Sustained Attention Test (SA) To obtain a measure of working speed, the Sustained Attention Test (SA; Ren, Schweizer, & Xu, 2013) was applied. Hundred digits (from 0 to 9) were presented in five rows on a computer screen. Each digit was combined with one to four dashes, either above or underneath the digit. Participants were asked to click on each “9” – using the computer mouse – that was combined with exactly two dashes, but only if it was not preceded by a “5.” Whereas the measure of sustained attention was derived from the correct responses, the measure of working speed was obtained by averaging the time spans registered per computer mouse click.
Procedure Participants were tested in groups of two to four. The tests were part of a testing session that included several other cognitive tasks and took about 2.5–3 hr. The above described tasks were applied in the following order: APM, NRS, and SA.
Models First, there were the CFA models for NRS data: Besides two one-factor models representing reasoning (with freely estimated, respectively with fixed factor loadings), there were 3 two-factor models additionally including an either linear, quadratic, or logarithmic representation of the learning effect, and 6 three-factor models additionally including the representation of speed and assuming the termination of the learning effect as either interruption or fading out. Second, there were SEM (structural equation modeling) models for investigating convergent and discriminant validity. One type of model (SEM model 1) incorporated the best-fitting CFA models for speeded data plus the CFA model for slightly speeded data without considering learning and the speed latent variable based on SA data. Since there was only one SA working-speed score, the factor loading on the SA speed latent variable was fixed to one and the error component to zero. In the second type of SEM model (SEM model 2), the CFA model for slightly speeded data also included the learning latent variable. The Appendix provides all definitions and functions used for computing factor loadings. Ó 2018 Hogrefe Publishing
F. Zeller et al., ;Speededness in Achievement Testing
Statistical Analysis Statistical analyses were conducted by means of LISREL (Jöreskog & Sörbom, 2006), using the maximum likelihood estimation (MLE) method. Several model fit indices were computed in order to enable the evaluation of each model, and to compare different models: w2, root mean square error of approximation (RMSEA; .06), standardized root-mean-square residual (SRMR; .08), comparative fit index (CFI; .95), goodness-of-fit index (GFI; .95), and Akaike information criterion (AIC). The numbers provided in parentheses served as cutoffs, indicating a good model fit (see DiStefano, 2016). Additionally, there is the convention to consider a CFI of between .90 and .95 as acceptable but not yet good fit (Bentler, 1990; Hu & Bentler, 1999). However, conventions do not really apply to binary data. Such data pose a special challenge to the CFI concept since this statistic compares the specified model with the independence model regarding the covariances and variances. In binary data, the covariances of extremely easy and extremely difficult items are virtually zero. Their reproduction using the specified model can also be expected to be close to zero. Therefore, the reproduced covariances are similar to the zero covariances assumed by the independence model, and correspondence of the results for these two models leads to the indication of bad model fit. Hence, an acceptable CFI statistic is usually considered as an agreeable result in investigating binary data. Furthermore, for model comparison, the difference in CFI results was considered, as a difference of more than .01 was found to reflect a substantial difference in model fit (Cheung & Rensvold, 2002), and AIC. Besides that, scaled variance parameters of the latent variables were computed (Schweizer, 2011). This enabled comparisons among the estimates of the variance parameters of the latent variables and, therefore, interpretations regarding the importance of a latent variable.
Results The mean of the sum scores of correct responses was 31.3 (SD = 3.3) for NRS, and 11.8 (SD = 2.3) for APM. The correlation between the sum scores of the two reasoning scales including omissions as zeros was .27 (p < .01). The mean of the SA speed scores was 1299 milliseconds (SD = 274 ms). The SA speed scores showed outliers that were reduced to the one percent boundary of the distribution. Three APM and 15 NRS items were not completed by all participants, and about 90%, respectively 30% of the participants attempted the last item. Ó 2018 Hogrefe Publishing
99
The first step in analyzing the NRS data was to examine whether there was the learning effect. For this purpose, one-factor models with freely estimated factor loadings and fixed factor loadings were compared with two-factor models including either a linear, a quadratic, or a logarithmic representation of the learning effect, referred to as learning models. The fit results are included in Table 1. The two-factor model including a quadratic representation of the learning effect showed the best model fit. The fit statistics for this model indicated acceptable fit according to RMSEA and SRMR but misfit according to CFI and GFI. Especially because of the CFI results, it was concluded that the one- and two-factor models were insufficient for representing the data. The scaling of the estimates of the variance parameters (Schweizer, 2011) using one as scale-length constant revealed values of 0.35 (t = 6.69, p < .01) for the abilityspecific and of 0.47 (t = 8.43, p < .01) for the learning-specific latent variables of the two-factor model with quadratic increase. The replacement of the quadratic increase by the linear increase led to values of 0.27 (t = 4.77, p < .01) and 0.62 (t = 8.17, p < .01) and by the logarithmic increase to 0.09 (t = 1.34, ns) and 0.63 (t = 7.18, p < .01). In the onefactor model with fixed factor loadings, the value of the variance parameter was 0.53 (t = 9.20, p < .01). After the replacement of the fixed by free factor loadings, the value was 0.67 (t = 10.10, p < .01). The summation of the scaled estimates of the latent variances revealed values of .89, .82, and .72 for the two-factor models with linear, quadratic, and logarithmic increases in corresponding order. Next, three-factor models with the two different ways of reducing the learning effect because of the replacement by the speed effect were investigated. We compared the combinations of the linear, quadratic, and logarithmic representations of the learning effect and two ways of reduction (interruption and fading out). These models including the representation of speed are referred to as speed models. The fit results are provided in Table 2. Virtually all fit statistics indicated the better model fit for fading out in the models with linear and quadratic representations of the learning effect, whereas in the other models, the two ways of reduction did almost equally well. Three out of the six models show CFIs that could be accepted for investigating binary data. The best model fit was observed for the three-factor model with the quadratic representation of the learning effect combined with fading out. The CFI differences to the other models identified this model as substantially better fitting model. After scaling, the total amount of variance according to the variance parameters at the latent level was 0.984 for the model including a quadratic increase and fading out and 0.933 for the model with a linear increase and fading out. All variance parameters were significant according to the Wald test. European Journal of Psychological Assessment (2020), 36(1), 96–104
100
F. Zeller et al., Speededness in Achievement Testing
Table 1. Fit statistics observed in investigating NRS data by means of one-factor models with free and fixed factor loadings and two-factor models including a learning-specific factor with factor loadings according to three functions besides the ability-specific factor (N = 274) w2
Model One-factor (free) One-factor (fix)
SRMR
CFI
GFI
AIC
548.5
170
df
RMSEA .090
.082
.836
.833
628.5
1,035.0
189
.128
.142
.697
.725
1,077.0
548.8
188
.084
.107
.835
.833
592.8
Two-factor with linear quadratic
495.5
188
.077
.097
.853
.846
539.5
logarithmic increase in the second factor
710.7
188
.101
.118
.786
.793
754.7
Table 2. Fit statistics observed in investigating NRS data by means of three-factor models with either fading out (i* = 15), and interruption of the representation of the learning effect (i* = 14) (N = 274) w2
df
Fading out
343.4
Interruption
354.9
Second latent variable: Type of increase
RMSEA
SRMR
CFI
GFI
AIC
187
.055
.085
.912
.888
389.4
187
.057
.078
.911
.885
400.9
Type of ceasing Linear increase
Quadratic increase Fading out
313.8
187
.050
.080
.924
.897
359.8
Interruption
431.5
187
.069
.088
.884
.864
475.5
Logarithmic increase Fading out
386.4
187
.062
.092
.896
.876
432.4
Interruption
386.7
187
.063
.093
.897
.876
432.7
Furthermore, SEM models with linear and quadratic increases and fading out regarding NRS (L-SEM and QSEM) were investigated to explore the similarities and differences of the latent variables based on speeded and slightly speeded data and to find out about the nature of the speed latent variable. The latent variables based on the speeded NRS data served as endogenous variables and the latent variables based on the slightly speeded APM data and SA data as exogenous variables. The first and second SEM models excluded the learning latent variable, whereas the third and fourth SEM models included it. The latter models were expected to show the better degree of model fit. The fit results are presented in Table 3. Better fit results were indicated for the models including the learning latent variable by virtually all fit statistics. The third and fourth models showed an acceptable degree of model fit (CFIs .90), whereas the first and second ones did not (CFIs < .90). The model including the quadratic increase (Q-SEM model 2) showed the best fit (e.g., CFI = .906) but fell short off the .01 boundary when compared with the second model. The structures of the links between the latent variables of the two types of models are illustrated in Figure 1. In Q-SEM model 1 (see left-hand side), the ability-specific latent variable derived from the APM data showed a significant link to the ability-specific but not learning-specific European Journal of Psychological Assessment (2020), 36(1), 96–104
latent variables derived from NRS data. The speed variable derived from the measure of working speed, that is, the SA task, showed a significant link to the speed-specific latent variable derived from NRS data. It was negative as the speed variable reflected lack of working speed. In Q-SEM model 2 additionally considering the learningspecific latent variable derived from APM data (see righthand side), all expected latent links reached significance, whereas none of the other possible links proved to be substantial. This suggested that the structure assumed for the speeded data was correctly identified: The two ability-specific latent variables were exclusively related to each other; the learning-specific latent variables showed the same characteristic; the speed-specific latent variable proved to be exclusively related to the speed variable derived from the measure of working speed. Finally, the results based on the threshold-free approach were compared with the results for alternative approaches concentrating on the three-factor models with fading out (see Table 2). Since tetrachoric correlations have to serve as input to CFI in less than six categories (Finney, DiStefano, & Kopp, 2016), such correlations were computed but led to a correlation matrix that was not positive definite. The ridge option modified the matrix by replacing the ones of the main diagonal by twos. Since it was also not possible to achieve asymptotic variances and covariances, DWLS or Ó 2018 Hogrefe Publishing
F. Zeller et al., ;Speededness in Achievement Testing
101
Table 3. Fit statistics for the SEM models including representations of the learning effect (linear increasing: L-SEM, and quadratic increasing: Q-SEM) for investigating convergent and discriminant validity using additionally APM and SA data (N = 274) w2
df
RMSEA
SRMR
CFI
GFI
AIC
1
L-SEM model 1
1,056.3
734
.040
.075
.890
.834
1,148.3
Q-SEM model 11
1,023.6
734
.038
.074
.897
.839
1,115.6
L-SEM model 22
1,032.6
732
.039
.070
.900
.838
1,128.6
Q-SEM model 22
1,001.1
732
.037
.070
.906
.842
1,097.1
Model
1
2
Notes. Only a reasoning latent variable is derived from APM data. Reasoning and learning latent variables are derived from APM data.
Figure 1. Illustrations of the latent structures of the combined models with one latent variable for the data obtained by slightly speeded testing (left part) and with two latent variables for the data obtained by slightly speeded testing (right part). APM = Raven’s Advanced Progressive Matrices; SA = Sustained Attention Test.
WLSMV could not be applied but customary ML. It led to an almost acceptable model fit for the model with linear increase, w2(187) = 333.9, RMSEA = .054; SRMR = .088; CFI = .889, GFI = .891, and with quadratic increase, w2(187) = 313.2, RMSEA = .050; SRMR = .086; CFI = .899, GFI = .897. Ignoring the scale correctness and accepting a ridge option-based modification enabled robust estimation leading to somewhat better CFI results (linear increase: CFI = .903; quadratic increase: CFI = .948). The results for the threshold-free approach were comparable to these results and did neither require the manipulation of data by the ridge option nor the violation of assumptions.
Discussion Starting from the question whether it is possible to identify speededness as independent source of responding by means of CFA, data originating from speeded and slightly speeded testing were investigated. The impact of the speed-specific latent variable representing speededness on the fit statistics reveals that its modeling is even essential for achieving good model fit. The results replicate the outcome of the first attempt to represent speededness by means of a latent variable in CFA (Schweizer & Ren, 2013) and confirm that reasoning scores observed in speeded testing measure speed besides reasoning. The Ó 2018 Hogrefe Publishing
results are also in line with the results of a recent study investigating the relationships of reasoning and speed latent variables with working memory (Ren, Wang, Sun, Deng, & Schweizer, 2018). Furthermore, there is agreement with the results by Wilhelm and Schulze (2002), suggesting that speeded scores combine contributions from the construct and also mental speed. The investigation of convergent and discriminant validity (Campbell & Fiske, 1959) confirms this suggestion. The results establish convergent and discriminant validity for the speed-specific latent variable as the representation of speed and make clear that this additional latent variable is not as another representation of reasoning. The small size of the regression weight regarding the speed link presumably underestimates the relationship since there was only one indicator variable and, therefore, no switch from the manifest to the latent level. Regarding generalization, we think that an effect due to a time limit may be observable in all kinds of achievement data. However, the exact effect may not be constant because the needs of different samples may differ. The elimination of the effect due to a time limit requires the adaptation of the time limit to the participants’ needs (van der Linden & Xiong, 2013). Finally, we like to express our concern that speededness may even have influenced the elaboration of the construct of reasoning and related constructs since elaboration is usually based on empirical evidence obtained in scale applications. In sum, there is virtually always some chance that a European Journal of Psychological Assessment (2020), 36(1), 96–104
102
time limit allows processing speed to influence performance in such a way that the validity of scores, the factorial structure of data, and the possibility of reaching good model fit are impaired.
References Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238–246. https://doi.org/ 10.1037/0033-2909.107.2.238 Birney, D. P., Beckmann, J. F., Beckmann, N., & Double, K. S. (2017). Beyond the intellect: Complexity and learning trajectories in Raven’s Progressive Matrices depend on self-regulatory processes and conative dispositions. Intelligence, 61, 63–77. https://doi.org/10.1016/j.intell.2017.01.005 Bolt, D. M., Cohen, A. S., & Wolack, J. A. (2002). Item parameter estimation under conditions of test speededness: Application of a mixture Rasch model with ordinal constraints. Journal of Educational Measurement, 39, 331–348. https://doi.org/ 10.1111/j.1745-3984.2002.tb01146.x Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. https://doi.org/10.1037/h0046016 Carlstedt, B., Gustafsson, J.-E., & Ullstadius, E. (2000). Item sequencing effects on the measurement of fluid intelligence. Intelligence, 28, 145–160. https://doi.org/10.1016/S0160-2896 (00)00034-9 Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-offit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233–255. https://doi.org/10.1207/ S15328007SEM0902_5 Debeer, D., & Janssen, R. (2013). Modeling item-position effects within an IRT framework. Journal of Educational Measurement, 50, 164–185. https://doi.org/10.1111/jedm.12009 DiStefano, C. (2016). Examining fit with structural equation models. In K. Schweizer & C. DiStefano (Eds.), Principles and methods of test construction. Standards and recent advances (pp. 166–193). Göttingen, Germany: Hogrefe. Embretson, S. E. (1991). A multidimensional latent trait model for measuring learning and change. Psychometrika, 56, 495–515. https://doi.org/10.1007/BF02294487 Estrada, E., Román, F. J., Abad, F. J., & Colom, R. (2017). Separating power and speed components of standardized intelligence measures. Intelligence, 61, 159–168. https://doi. org/10.1016/j.intell.2017.02.002 Finney, S. J., DiStefano, C., & Kopp, J. P. (2016). Overview on estimation methods and preconditions for their application with structural equation modeling. In K. Schweizer & C. DiStefano (Eds.), Principles and methods of test construction. Standards and recent advances (pp. 166–193). Göttingen, Germany: Hogrefe. Goegebeur, Y., De Boeck, P., Wollack, J. A., & Cohen, A. S. (2008). A speeded item response model with gradual process change. Psychometrika, 73, 65–87. https://doi.org/10.1007/s11336007-9031-2 Graham, J. M. (2006). Congeneric and (essentially) tau-equivalent estimates of score reliability. What they are and how to use them. Educational and Psychological Measurement, 66, 930– 944. https://doi.org/10.1177/0013164406288165 Gulliksen, H. (1950). Speed versus power tests. In H. Gulliksen (Ed.), Theory of mental tests (pp. 230–244). New York, NY: John Wiley & Sons.
European Journal of Psychological Assessment (2020), 36(1), 96–104
F. Zeller et al., Speededness in Achievement Testing
Hartig, J., Hölzel, B., & Moosbrugger, H. (2007). A confirmatory analysis of item reliability trends (CAIRT): Differentiating true score and error variance in the analysis of item context effects. Multivariate Behavioral Research, 42, 157–183. https://doi.org/ 10.1080/00273170701341266 Horn, W. (1983). Leistungsprüfsystem (LPS) [Performance testing system] (2nd ed.). Göttingen, Germany: Hogrefe. Hu, L.-T., & Bentler, P. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55. https://doi. org/10.1080/10705519909540118 Jöreskog, K. G. (1971). Statistical analysis of sets of congeneric tests. Psychometrika, 36, 109–133. https://doi.org/10.1007/ BF02291393 Jöreskog, K. G., & Sörbom, D. (2006). LISREL 8.80. Lincolnwood, IL: Scientific Software International. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Lu, Y., & Sireci, S. G. (2007). Validity issues in test speededness. Educational Measurement, 26, 29–37. https://doi.org/10.1111/ j.1745-3992.2007.00106.x Lucke, J. F. (2005). The α and the ω of Congeneric Test Theory: An extension of reliability and internal consistency to heterogeneous tests. Applied Psychological Measurement, 29, 65–81. https://doi.org/10.1177/0146621604270882 Mackintosh, N. J., & Bennett, E. S. (2005). What do Raven’s Matrices measure? An analysis in terms of sex differences. Intelligence, 33, 663–674. https://doi.org/10.1016/j.intell.2005. 03.004 McArdle, J. J. (1986). Latent variable growth within behavior genetic models. Behavior Genetics, 16, 163–200. https://doi. org/10.1007/BF01065485 McDonald, R. P., & Ahlawat, K. S. (1974). Difficulty factors in binary data. British Journal of Mathematical and Statistical Psychology, 27, 82–99. https://doi.org/10.1111/j.2044-8317. 1974.tb00530.x Must, O., & Must, A. (2013). Changes in test-taking patterns over time. Intelligence, 41, 780–790. https://doi.org/10.1016/j.intell. 2013.04.005 Oshima, T. C. (1994). The effect of speededness on parameter estimation in item response theory. Journal of Educational Measurement, 31, 200–219. https://doi.org/10.1111/j.17453984.1994.tb00443.x Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests (expand. ed.). Chicago, IL: University of Chicago Press. Raven, J. C., Raven, J., & Court, J. H. (1997). Raven’s progressive matrices and vocabulary scales. Edinburgh, UK: J. C. Raven Ltd. Ren, X., Schweizer, K., & Xu, F. (2013). The sources of the relationship between sustained attention and reasoning. Intelligence, 41, 51–58. https://doi.org/10.1016/j.intell.2012.10.006 Ren, X., Wang, T., Sun, S., Deng, M., & Schweizer, K. (2018). Speeded testing in the assessment of intelligence gives rise to a speed factor. Intelligence, 66, 64–71. https://doi.org/10.1016/ j.intell.2017.11004 Schweizer, K. (2008). Investigating experimental effects within the framework of structural equation modeling: an example with effects on both error scores and reaction times. Structural Equation Modeling, 15, 327–345. https://doi.org/10.1080/ 1070551080192262 Schweizer, K. (2011). Scaling variances of latent variables by standardizing loadings: Applications to working memory and the position effect. Multivariate Behavioral Research, 46, 938– 955. https://doi.org/10.1080/00273171.2011.625312
Ó 2018 Hogrefe Publishing
F. Zeller et al., ;Speededness in Achievement Testing
Schweizer, K. (2013). A threshold-free approach to the study of the structure of binary data. International Journal of Statistics and Probability, 2, 67–75. https://doi.org/10.5539/ijsp.v2n2p67 Schweizer, K., & Ren, X. (2013). The position effect in tests with a time limit: The consideration of interruption and working speed. Psychological Test and Assessment Modelling, 55, 62–78. Schweizer, K., Ren, X., & Wang, T. (2015). A comparison of confirmatory factor analysis of binary data on the basis of tetrachoric correlations and of probability-based covariances: A simulation study. In R. E. Millsap, D. M. Bolt, L. A. van der Ark, & W.-C. Wang (Eds.), Springer Proceedings in Mathematics & Statistics. Quantitative Psychology Research (Vol. 89, pp. 273–292). Heidelberg, Germany: Springer International Publishing. van der Linden, W. J. (2011). Test design and speededness. Journal of Educational Measurement, 48, 44–60. https://doi. org/10.1111/j.1745-3984.2010.00130.x van der Linden, W. J., & Xiong, X. (2013). Speededness and adaptive testing. Journal of Educational and Behavioral Statistics, 38, 418–438. https://doi.org/10.3102/1076998612466143 Verguts, T., & De Boeck, P. (2000). A Rasch model for detecting learning while solving an intelligence test. Applied Psychological Measurement, 24, 151–162. https://doi.org/10.1177/ 01466210022031589 Wilhelm, O., & Schulze, R. (2002). The relation of speeded and unspeeded reasoning with mental speed. Intelligence, 30, 537–554. https://doi.org/10.1016/j.intell.2017.11004
Ó 2018 Hogrefe Publishing
103
History Received September 18, 2017 Revision received May 30, 2018 Accepted May 31, 2018 Published online December 19, 2018 EJPA Section/Category Methodological topics in assessment Funding This work was supported by Deutsche Forschungsgemeinschaft, Kennedyallee 40, 53175 Bonn, Germany [Grant Number SCHW 402/ 20-1]. Siegbert Reiß Goethe University Frankfurt Department of Psychology Theodor-W.-Adorno-Platz 6 60323 Frankfurt am Main Germany reiss@psych.uni-frankfurt.de
European Journal of Psychological Assessment (2020), 36(1), 96–104
104
F. Zeller et al., Speededness in Achievement Testing
(
Appendix
g ði Þ ¼
f ðiÞ
for i i
0
else
ðinterruptionÞ ði ¼ 14Þ:
A. Reasoning Function The factor loadings λi1 of the ith item on the reasoning latent variable are set equal to the constant cAbility > 0 (e.g., cAbility = 1):
λ i1 ¼ cAbility for all i ¼ 1; . . . ; p:
B. Learning Effect Functions
ei i g ðiÞ ¼ f ðiÞ 1 ðfading outÞði ¼ 15Þ: 1 þ ei i
D. Speed Effect Function The factor loadings λi3 of the ith item on the speed latent variable are set equal to function g
The factor loadings λi2 of the ith item on the learning latent variable are set equal to the following functions f
flin ðiÞ ¼ ði 1Þ=ðp 1Þ 2
fquad ðiÞ ¼ ði 1Þ =ðp 1Þ flog ðiÞ ¼ ln ðiÞ= lnðpÞ
g s ðiÞ ¼
ei i ði ¼ 15Þ: 1 þ ei i
ðlinear increaseÞ; 2
ðquadratic increaseÞ; ðlogarithmic increaseÞ:
E. Weight Function wi ¼
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PrðXi ¼ 1Þ ½1 PrðXi ¼ 1Þ :
C. Weakening of the Learning Effect (Functions) The factor loadings λi2 of the ith item on the learning latent variable are set equal to function g that incorporates the other function
European Journal of Psychological Assessment (2020), 36(1), 96–104
Ó 2018 Hogrefe Publishing
Original Article
Degrees of Freedom in Multigroup Confirmatory Factor Analyses Are Models of Measurement Invariance Testing Correctly Specified? Ulrich Schroeders1 and Timo Gnambs2,3 1
Psychological Assessment, Institute of Psychology, University of Kassel, Germany
2
Leibniz Institute for Educational Trajectories, Bamberg, Germany
3
Johannes Kepler University Linz, Austria
Abstract: Measurement invaraiance is a key concept in psychological assessment and a fundamental prerequisite for meaningful comparisons across groups. In the prevalent approach, multigroup confirmatory factor analysis (MGCFA), specific measurement parameters are constrained to equality across groups. The degrees of freedom (df) for these models readily follow from the hypothesized measurement model and the invariance constraints. In light of research questioning the soundness of statistical reporting in psychology, we examined how often reported df match with the df recalcualted based on information given in the publications. More specifically, we reviewed 128 studies from six leading peer-reviewed journals focusing on psychological assessment and recalculated the df for 302 measurement invariance testing procedures. Overall, about a quarter of all articles included at least one discrepancy with metric and scalar invariance being more frequently affected. We discuss moderators of these discrepancies and identify typical pitfalls in measurement invariance testing. Moreover, we provide example syntax for different methods of scaling latent variables and introduce a tool that allows for the recalculation of df in common MGCFA models to improve the statistical soundness of invariance testing in psychological research. Keywords: measurement invariance, structural equation modeling, MGCFA, degrees of freedom, reporting standards
Psychology as a discipline has adopted a number of strategies to improve the robustness and trustworthiness of its findings (Chambers, 2017; Eich, 2014): emphasizing statistical power (Bakker, van Dijk, & Wicherts, 2012), acknowledging uncertainty in statistical results (Cumming, 2014), and disclosing flexibility in data collection and analysis (Nelson, Simmons, & Simonsohn, 2018; Simmons, Nelson, & Simonsohn, 2011). Especially, by making all material of the study – its questionnaires, experimental manipulations, raw data, and analyses scripts – available to others, the replicability of the published findings is expected to increase (Nosek et al., 2015; Simonsohn, 2013).This transparency can be helpful to clarify why many peer-reviewed articles in psychology contain inconsistent statistical results that might impact the interpretation of its reported findings (Bakker & Wicherts, 2011; Cortina, Green, Keeler, & Vandenberg, 2017; Nuijten, Hartgerink, van Assen, Epskamp, & Wicherts, 2016). Recent reviews highlighted major weaknesses in the reporting of null-hypothesis significance tests (NHSTs) and structural equation models (SEMs) that seriously undermine the trustworthiness of psychological science. In the present study, we review potential deficits in the modeling of multigroup measurement invariance testing.
Discrepancies in Statistical Results
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 105–113 https://doi.org/10.1027/1015-5759/a000500
Statistical results of journal articles are typically vetted by multiple peer reviewers and sometimes additionally statistical editors. Despite the thorough review process, many published articles contain statistical ambiguities. For example, Bakker and Wicherts (2011) scrutinized 281 articles from 6 randomly selected psychological journals (three with high impact factor and three with low impact factor) and found around 18% of the statistical results incorrectly reported. Most recently, Nuijten, Hartgerink, van Assen, Epskamp, and Wicherts (2016) revigorated this line of research by introducing the R package statcheck that automatically scans publications for reporting errors, that is, inconsistencies between a reported test statistic (e.g., t-value, F-value), the degrees of freedom (df), and its corresponding p-value. The sobering result of scanning over 250,000 publications of eight top-tier peer-reviewed journals (Nuijten et al., 2016) was that half of the articles contained at least one inconsistent p-value. Moreover, around 12% of the articles contained a discrepancy that changed the results significantly, often in line with the researchers’ expectations. Even though the text recognition and evaluation routine
106
U. Schroeders & T. Gnambs, DF in Multigroup Confirmatory Factor Analysis
have been criticized for being too sensitive (Schmidt, 2017), the study points to serious issues in the way researchers report their findings. Considering the comprehensive methodological toolbox of psychologists, test statistics regularly used in NHST are comparatively simple. In applied research, more often sophisticated latent variable techniques are used to test structural hypotheses between several variables of interest. Recently, Cortina and colleagues (2017) reviewed 784 SEMs published in two leading organizational journals to examine whether the reported df matched the information given in the text. In case all necessary information was available to recalculate the df, they only matched in 62% of the time; the discrepancies were particularly prevalent in structural (rather than measurement) models and were often large in magnitude. Thus, the trustworthiness of model evaluations seems questionable for a significant number of SEMs reported in the literature. In test and questionnaire development, methods used to examine the internal structure, to determine the reliability, and to estimate the validity of measures typically also rely on latent variable modeling. The implementation of such procedures in standard statistical software packages also extends the spectrum of test construction – besides the traditional topics of reliability and validity – to other pressing issues such as test fairness and comparability of test scores across groups.
Measurement Invariance in Multigroup Confirmatory Factor Analysis
constraining certain types of measurement parameters to equality leads to a considerable deterioration in model fit, the invariance assumption is violated. In the first step, configural MI, all model parameters except for necessary identification constraints are freely estimated across groups. For metric or weak MI, the factor loadings are constrained to invariance across groups allowing for comparisons of bivariate relations (i.e., correlations and regressions). In the third step, scalar or strong MI, the intercepts are set to be invariant in addition to the factor loadings. If scalar invariance holds, it is possible to compare the factor means across groups. In the last step, strict MI, additionally, the item residuals are constrained to be equal across groups. Depending on the chosen identification scheme for the latent factors (i.e., marker variable method, reference group method, and effects-coding method), different additional constraints have to be introduced (see Table 1): The default setting, the marker variable method, sets the factor loading of a marker variable to 1 and its intercept is fixed to 0 in all MI steps outlined above. In the reference group method, the variances of the latent variables are set to 1 in a reference group and the factor loadings are freely estimated. This approach is preferable because the marker variable method relies on a non-invariant marker variable across groups in metric MI (and above), which might lead to convergence problems or otherwise affect the results (Millsap, 2001). In practice, researchers frequently adopt a hybrid approach by fixing the factor loading of a marker variable to 1 and the mean of the latent variables in a reference group to 0 because this allows to interpret differences in factor means directly. Other identification schemes are possible and equally valid, but require different sets for identifying constraints. For example, Little, Slegers, and Card (2006) proposed the effects-coding method, a nonarbitrary way of identifying the mean and covariance structure by constraining the mean of the loadings to 1 and the sum of the intercepts to 0 for each factor. Importantly, the choice of identification constraints does not affect the number of estimated parameters or the results of the MI tests. To facilitate the implementation of MI testing, we provide example syntax for these MI steps for all three methods of identification in lavaan (Rosseel, 2012) and Mplus (Muthén & Muthén, 1998–2017) in the Electronic Supplementary Material, ESM 1.
Measurement invariance (MI) between two or more groups is given if individual differences in psychological test results can be entirely attributed to differences in the construct in question rather than membership to a certain group (see AERA, APA, & NCME, 2014). Thus, MI is an essential prerequisite to ensure valid and fair comparisons across cultures, administration modes, language versions, or sociodemographic groups (Borsboom, 2006a). Contemporary psychometric approaches to test for MI include various latent variable modeling techniques (e.g., Raju, Laffitte, & Byrne, 2002). In a SEM framework, MI is often tested with multigroup confirmatory factor analysis (MGCFA). Analogously, in item response theory (IRT), invariance or bias is assessed by studying differential item functioning. Besides different traditions and a focus on either the scale level (SEM) or the item level (IRT), both techniques share the same logic and concepts (Millsap, 2011). In the remainder of this article, we will focus on the SEM approach. Although different sequences can be implemented to test for MI in MGCFA (Cheung & Rensvold, 2002; Wicherts & Dolan, 2010), often a straightforward procedure of four hierarchical nested steps is followed (Millsap, 2011). In case
Given several critical reviews highlighting inconsistencies in NHST and SEM (Bakker & Wicherts, 2011; Cortina et al., 2017; Nuijten et al., 2016), we were pursuing two objectives: First, we examined the extent of discrepancies
European Journal of Psychological Assessment (2020), 36(1), 105–113
Ó 2018 Hogrefe Publishing
The Present Study
U. Schroeders & T. Gnambs, DF in Multigroup Confirmatory Factor Analysis
107
Table 1. Constraints in MGCFA tests for measurement invariance Identification by marker variable λ/λm
τ/τm
ε
E(ξ)
Var(ξ)
(1) Configural invariance
*/1
*/0
*
*
*
(2) Metric invariance
c/1
*/0
*
*
*
(3) Scalar invariance
c/1
c/0
*
*
*
(4) Strict invariance
c/1
c/0
c
*
*
Identification by reference group λ
τ
ε
E(ξ)/E(ξ(r))
Var(ξ)/Var(ξ(r))
(1) Configural invariance
*
*
*
0/0
1/1
(2) Metric invariance
c
*
*
0/0
*/1
(3) Scalar invariance
c
c
*
*/0
*/1
(4) Strict invariance
c
c
c
*/0
*/1
Identification by hybrid approach λ/λm
τ
ε
E(ξ)/E(ξ(r))
Var(ξ)
(1) Configural invariance
*/1
*
*
0/0
*
(2) Metric invariance
c/1
*
*
0/0
*
(3) Scalar invariance
c/1
c
*
*/0
*
(4) Strict invariance
c/1
c
c
*/0
*
Identification by effects coding λP
τP
ε
E(ξ)
Var(ξ)
(1) Configural invariance
*
*
*
*
*
(2) Metric invariance
c
*
*
*
*
(3) Scalar invariance
c
c
*
*
*
(4) Strict invariance
c
c
c
*
*
Notes. λ = factor loading, λm = factor loading for marker variable, τ = intercept, τm = intercept for marker variable, ε = residual variance, E(ξ) = latent factor mean, E(ξ(r)) = latent factor mean in reference group, Var(ξ) = latent factor variance, Var(ξ(r)) = latent factor variance in reference group, * = parameter is freely estimated in all groups, c = parameter is constrained to equity across groups, 0/1 = parameter is fixed to a value of 0 or 1, λP/τP = sum of factor loadings and sum of intercepts in each group are constrained to a value of 1 and 0, respectively.
in MI testing with MGCFA. Because the number of df for each MI step is mathematically determined through the hypothesized measurement model, we recalculated the df for the aforementioned MI steps based on the information provided in articles that were published in major peerreviewed journals focusing on psychological assessment in the last 20 years. Second, we tried to identify potential causes for the misspecification (e.g., the complexity of the model or the used software packages). Furthermore, we highlight potential pitfalls when specifying the different steps of MI testing. To this end, we also provide example syntax for MI testing and introduce an easy to handle statistics application that allows double-checking the df in MI testing. Thus, the overarching aim is to improve the statistical soundness of MI testing in psychological research.
Inconsistent df in MI tests of MGCFA were identified among issues of six leading peer-reviewed journals during a period of 20 years (1996–2016) that regularly report on
test development and questionnaire construction: Assessment (ASMNT), European Journal of Psychological Assessment (EJPA), Journal of Cross-Cultural Psychology (JCCP), Journal of Personality Assessment (JPA), Psychological Assessment (PA), and Personality and Individual Differences (PAID). Studies were limited to reports of MGCFA that included one or more of the four MI steps outlined above. Not considered were single group tests of MI (i.e., longitudinal MI or multi-trait multi-method MI), second-order models, exploratory structural equation models, or MI testing with categorical data. We first recalculated the df for all MI models from the information given in the text, tables, and figures (e.g., regarding the number of indicators, latent factors, crossloadings). A configural model was coded as incorrect if the reported and recalculated df did not match. Then, the df for the metric, scalar, and strict MI models were also recalculated and compared to the reported df. In case inconsistent df were identified at a specific step, the df for subsequent models were also recalculated by taking the reported (inconsistent) df of the previous step into account, which adopts a more liberal perspective. For example, if an
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 105–113
Method
U. Schroeders & T. Gnambs, DF in Multigroup Confirmatory Factor Analysis
30
15
25
12.5
20
10
15
7.5
10
5
5
Percentage of discrepancies
Frequency
108
2.5
0
0 1998
2000
2002
2004
2006
2008
2010
2012
2014
2016
Year of publication Figure 1. Studies reporting measurement invariance tests over time. The thin solid line represents the number of studies reporting MI tests; the dashed line represents the number of studies with at least one discrepancy. The bold black line gives the percentage of discrepancies.
author claimed to have tested metric invariance while also constraining the factor variances across all groups, this step was coded as incorrect. However, if in scalar MI testing the intercepts were additionally set to be invariant, this was coded as correct (despite the constrained factor variances). The coding was limited to the four types of MI as outlined above, and we did not code partial MI. Both authors coded about half of the studies. In case, inconsistent df were identified, the other author independently coded the respective study again. Diverging evaluations were discussed until a consensus was reached. We provide our coding sheets and all syntax within the Open Science Framework (Soderberg, 2018) at https://osf.io/6nh9d/. All analyses were conducted with R 3.4.4 (R Development Core Team, 2018).
recalculated df remained rather stable around 5% in the last years. Out of 128 articles, 49 (38.3%) used Mplus to conduct MI testing, 24 (18.8%) used LISREL, and 23 (18.0%) used AMOS. The remaining articles relied on specialized software such as EQS (n = 10) or R (n = 4), did not report their software choice (n = 17), or used more than one program (n = 1). On average, each article reported on 2.36 MI testing sequences (SD = 2.29). Further descriptive information on the model specification grouped by journal and publication year is summarized in Table S1 of the ESM 1.
Discrepancies Between Reported and Recalculated Degrees of Freedom
We identified a total of 302 MI testing sequences that were published in 128 different research articles. Most articles were published in PA (31.3%) and PAID (23.4%), followed by EJPA (16.4%) and ASMNT (13.3%), whereas fewer articles were retrieved from JCCP and PA (7.8% each). The number of articles reporting MI testing within a MGCFA framework recorded a sharp increase in recent years. Nearly two-thirds of the articles were published within 5 years between 2012 and 2016 and over 88% within the last 10 years (see Figure 1). Whereas the absolute number of discrepancies exhibited a slight increase in recent years, the percentage of discrepancies between reported and
Half of the studies (48.4%) reported multiple MI tests (e.g., for age and sex groups); that is, the identified MI tests were not independent. Since variation of the identified discrepancies (0 = no discrepancy in df and 1 = discrepancy in df ) was found on the study level rather than the MI test level (intraclass correlation = .995), we analyzed discrepancies in df on the level of studies rather than single tests of MI. Therefore, we aggregated the results to the article level and examined for each article whether at least one inconsistent df was identified for the different models in each MI step. The analyses revealed that out of 120 studies reporting configural MI, only 7 studies showed discrepancies (5.8%, see Table 2). In contrast, tests for metric MI and scalar MI exhibited larger discrepancies between the reported and recalculated df (15.9% and 21.1%, respectively). Only two studies reported incorrect df in strict MI.
European Journal of Psychological Assessment (2020), 36(1), 105–113
Ó 2018 Hogrefe Publishing
Results
U. Schroeders & T. Gnambs, DF in Multigroup Confirmatory Factor Analysis
109
Table 2. Discrepancies between reported and recalculated degrees of freedom
Number reported
a
Configural
Metric
Scalar
Strict
120 (93.8)
126 (98.4)
95 (74.2)
40 (31.3)
7 (5.8)
20 (15.9)
20 (21.1)
2 (0.1)
Number inconsistentb
Note. Absolute numbers with percentages in parenthesis. aPercentages refer to 128 studies. bPercentages refer to the number of reported studies.
Table 3. Predicting occurrence of discrepancies based on study characteristics Predictors
B
(Intercept)
SE
z
OR
95% CI
AME
0.66*
0.80
4.57
0.03
[0.00, 0.10]
< 0.01+
< 0.01
1.76
1.00
[1.00, 1.00]
.00+
0.22*
0.08
2.68
1.25
[1.07, 1.49]
.03*
1.07
0.85
1.25
2.91
[0.54, 6.14]
.13
European Journal of Psychological Assessment (n = 21)
0.21
0.85
0.24
1.23
[0.21, 6.53]
.02
Journal of Cross-Cultural Psychology (n = 10)
1.69+
1.00
1.68
5.40
[0.72, 0.65]
.23
Journal of Personality Assessment (n = 10)
3.11*
1.03
3.01
22.41
[3.19, 196.88]
.48*
Personality and Individual Differences (n = 30)
1.38+
0.72
1.91
3.97
[1.01, 7.77]
.17+
(1) Complexity of modela (2) Year of publication (3) Journal (ref.: Psychological Assessment, n = 40) Assessment (n = 21)
(4) Software (ref: Mplus, n = 49) AMOS (n = 23)
1.85*
0.79
2.33
6.33
[1.43, 4.72]
.23*
EQS (n = 10)
3.70*
0.99
3.73
20.54
[6.64, 347.88]
.57*
LISREL (n = 24)
2.07*
0.83
2.51
7.93
[1.68, 46.27]
.26*
Remaining (n = 22)
1.42+
0.86
1.66
4.14
[0.78, 4.78]
.16
a
Notes. n = 128 studies. number of free parameters in the configural model. Logistic regression analysis with at least one discrepancy found (1) on study level versus not found (0) as an outcome. Predictors (1) and (2) were centered prior to analysis; predictors (3) and (4) were dummy-coded variables. Nagelkerke’s R2 = .37. AME = average marginal effects. *p < .05; +p< .10.
To shed further light on potential predictors of the discrepancies, we conducted a logistic regression analysis (0 = no discrepancy, 1 = at least one discrepancy). We added the (1) complexity of the model, (2) publication year, (3) journal, and (4) software package as predictors (see Table 3). The complexity of the model did not predict the occurrence of reporting errors. In contrast, the year of publication influenced the error rate with more recent publications exhibiting slightly more discrepancies. Given that most of the studies have been reported in recent years, the average marginal effect (AME; Williams, 2012) for an article including a discrepancy was about 3.0% (p = .003) per year. Across all journals, a quarter of all published articles on MI included at least one df that we were unable to replicate (see Figure 2). A comparison of the journals demonstrates subtle differences: In comparison with PA, the outlet that published most MI tests, JCCP (AME = 22.5%, p = .13) and PAID (AME = 17.4%, p = .05) reported slightly more inconsistent df. The highest rate of discrepancies between reported and recalculated df was found for JPA (AME = 48.3%, p = .001) – with 5 out of 10 studies. The most important predictor in the logistic analysis was the software package used in MI testing. In comparison with Mplus, studies using other software packages were more likely to have discrepancies: AMOS (AME = 22.3%, p = .02), LISREL (AME =
26.2%, p = .01), and most severely EQS (AME = 57.0%, p < .001).
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 105–113
Pitfalls in Testing Measurement Invariance Without inspecting the analysis syntax of the reported studies, we can only speculate about the reasons of the discrepancies. However, in our attempts to replicate the df, we spotted two likely sources of model specification: In testing metric MI, discrepancies seem to have resulted in many cases (13 of 20 flagged publications) from a misspecified model using the reference group approach for factor identification. As a reminder, the configural model includes fixing the variances of the latent variables to 1 in all groups, while freely estimating all factor loadings. The metric model, however, requires equality constraints on the factor loadings across groups, while relaxing constraints on the variances of the latent variables except for the reference group. It seems that some authors neglected to free the factor variances and, thus, instead of testing a metric MI model, evaluated a model with invariant loadings and variances. This is important because the reference group method is sometimes preferred over the marker variable method which presupposes an invariant marker variable. Fixing the factor loading of a non-invariant marker variable in metric MI might lead to convergence problems or
110
U. Schroeders & T. Gnambs, DF in Multigroup Confirmatory Factor Analysis
Percentage
100 80 60
no discrepancy discrepancy
40 20 0 ASMNT 17
EJPA 21
JCCP 10
JPA 10
PA 40
PAID 30
0.92
0.94
0.96
Figure 3. Consequences of fixing the means to 0 in scalar measurement invariance testing on model fit.
0.90
Comparative Fit Index [CFI]
Figure 2. Reporting inconsistencies across journals. n = 128. The dashed line indicates the average of the discrepancies across journals. The number below the journal abbreviation represents the number of studies. ASMNT = Assessment; EJPA = European Journal of Psychological Assessment; JCCP = Journal of Cross-Cultural Psychology; JPA = Journal of Personality Assessment; PA = Psychological Assessment; PAID = Personality and Individual Differences.
Scalar MI + all means fixed to 0 Scalar MI 0.0
0.2
0.4
0.6
0.8
1.0
1.2
Latent mean difference between groups
otherwise biased estimates (Millsap, 2001). To identify such invariant indicators, several methods have been proposed (Rensvold & Cheung, 2001). For instance, Yoon and Millsap (2007) prefer the reference group method (i.e., fixing the variance of the latent variables to one in the first group only and fixing all factor loadings to equity across groups) and then – in case of lacking full metric invariance – to systematically free loading constraints based on modification indices to identify non-invariant factor loadings and to establish partial metric invariance. Issues in reporting scalar MI can in many instances (12 out of 20 flagged studies) be traced back to a misspecified mean structure. SEM is a variance–covariance-based modeling approach, and in a single group case, researchers are usually not interested in the mean structure. Therefore, scalar MI tests, in which the mean structure plays a vital role, seem to present particular difficulties. Again, we suspect that researchers adopting the reference group or hybrid approach for factor identification neglected to free previously constrained latent factor means (see Table 1).
As a result, instead of testing for scalar MI, these models in fact evaluated a model with invariant intercepts and means fixed to 0 across groups. Such model misspecifications are not trivial and have severe consequences for model fit evaluations: In a simulated MGCFA MI example, we compared a correctly specified scalar MI model with freely estimated latent factor means (except for the necessary identifying constraint) to a model, in which all factor means were fixed to zero. Figure 3 demonstrates that already moderate differences in the latent means (d .50) result in a drop in the comparative fit index (CFI) from an initially good fitting model (CFI = .98) to values below what is usually considered acceptable (CFI .95). Thus, if the means are constrained to zero, any differences in the latent means are passed on to the intercepts; if these are also constrained to equality, the unmodeled mean differences can result in a substantial model deterioration. As a consequence, misspecified scalar MI models can lead to erroneous rejection of the scalar MI model.
European Journal of Psychological Assessment (2020), 36(1), 105–113
Ó 2018 Hogrefe Publishing
U. Schroeders & T. Gnambs, DF in Multigroup Confirmatory Factor Analysis
Discussion
111
point to a conceptual misunderstanding: Measurement invariance is, technically speaking, only concerned with the relationship between indicator variables and latent variables. Variances, covariances, and means of latent variables, however, deal with a different aspect of invariance, called structural invariance (Vandenberg & Lance, 2000), because they are concerned with the properties of the latent variables themselves (see also Beaujean, 2014). Often researchers are especially interested in how these structural parameters vary across groups. Confusing measurement and structural parameters and specifying more restrictive models than necessary can result in failing to establish MI even though the difference is in the structural parameters. Accordingly, meaningful differences might be wrongly attributed to a measurement artifact.
The concept of measurement equivalence is pivotal for psychological research and practice. To address substantial research questions, researchers depend on information about the psychometric functioning of their instruments across sex and ethnic groups, clinical populations, etc. Accordingly, reporting issues in MI testing are not restricted to a specific field but affect different disciplines such as clinical (Wicherts, 2016) and I/O psychology (Vandenberg & Lance, 2000). The extent of discrepancies we found in the psychological assessment literature was rather surprising: One out of four studies reporting MI tests included an incorrectly specified or, at least, incorrectly described model. Thus, a substantial body of literature on the measurement equivalence of psychological instruments seems to be questionable or inaccurate. This percentage is probably a lower boundary of the true error rate due to the way we coded the MI tests (i.e., no subsequent errors). Since our analysis was limited to discrepancies in the df, it is possible that additional errors may have occurred (e.g., handling of missing data, incorporating nested structures). To identify these and similar flaws, both the raw data and the analyses scripts would be necessary to reanalyze the data. As outlined above, we also did not consider single group tests of MI, (i.e., longitudinal MI or multi-trait multi-method MI), second-order models, exploratory structural equation models, or MI testing with different estimators that are more appropriate for categorical data. In our assessment, it is likely that these statistically often more complex scenarios of MI testing offer additional potential for misspecification. Regarding the cause of inconsistencies, the results of the logistic regression provide us with some valuable clues: The increased popularity of MGCFA MI testing in psychological research was accompanied by an increase in discrepancies. This is not an unusual pattern in the dissemination of psychological methods: After the formal (and often formalized) introduction of a new method by psychometricians, more and more users adopt and apply the method – sometimes without a deeper understanding of the underlying statistics. However, the strongest effect was observed for the software package used to conduct MI tests. In comparison with Mplus, other software packages performed worse, which might be due to the extensive documentation and training materials. Or, it can be more likely attributed to a selection effect, because more advanced users prefer scripting languages. Taken together, we think that the results point to a general problem with the formal methodological and statistical training of psychologists (Borsboom, 2006b). The two issues that are most predominantly causing discrepancies – keeping the inadvertently fixing of the factor variances across groups in the metric MI model and fixing the factor means to 0 in the scalar MI model – presumably
In the following, some recommendations are given to improve the accuracy of conducting and reporting statistical results in the framework of MI testing. These recommendations apply to all parties involved in the publication process – authors, reviewers, editors, and publishers. First, familiarize yourself with the constraints of MI testing using different identification strategies (see Table 1) and pay attention to the aforementioned pitfalls. Furthermore, we encourage researchers to use the effects-coding method (Little et al., 2006), which allows to estimate and test the factor loadings, variances, and latent means simultaneously. In contrast to other scaling methods, effects-coding method does not rely on fixing single measurement parameters to identify the scale, which might lead to problems in MI testing if these parameters function differently across groups, but are constrained to be equal. This method might be helpful in finding a measurement model that is only partially invariant (Rensvold & Cheung, 2001; Yoon & Millsap, 2007). Second, describe the measurement model in full detail (i.e., number of indicators, factors, cross-loadings, residual covariances, and groups) and explicitly state which parameters are constrained at the different MI steps, so that it is clear which models are nested within each other. In addition, use unambiguous terminology when referring to specific steps in MI testing. In our literature review, we found several cases, in which the description in the method section did not match the restrictions given in the respective table. One way to clarify which model constraints have been introduced is to label the invariance step by the parameters that have been fixed (e.g., “invariance of factor loadings” instead of “metric invariance”). Third, in line with the recommendations of the Association of Psychological Science (Eich, 2014) and the efforts of
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 105–113
Recommendations for Conducting and Reporting MGCFA MI Testing
112
U. Schroeders & T. Gnambs, DF in Multigroup Confirmatory Factor Analysis
the Open Science Framework (Nosek et al., 2015) to make scientific research more transparent, open, and reproducible, we strongly advocate to make the raw data and the used analysis syntax available in freely accessible data repositories. As a pleasant side effect, there is also evidence that sharing detailed research data is associated with increased citation rate (Piwowar, Day, & Fridsma, 2007). If legal restrictions or ethical considerations prevent the sharing of raw data, it is possible to create synthesized data sets (Nowok, Raab, & Dibben, 2016). Fourth, we encourage authors and reviewers to routinely double-check the df of the reported models. In this context, we welcome the recent effort of journals in psychology to include soundness checks on manuscript submission by default to improve the accuracy of statistical reporting. To this end, one may refer to ESM 1 that includes example syntax for all steps of MI in lavaan and Mplus for different ways of scaling latent variables or use our JavaScript tool to double-check the df in MI testing (https://ulrich-schroeders. de/fixed/df-mgcfa/). Fifth, statistical and methodological courses need to be taught more rigorously in university teaching, especially in structured PhD programs. A vigorous training should include both conceptual (Borsboom, 2006a; Markus & Borsboom, 2013) and statistical work (Millsap, 2011). To bridge the gap between psychometric researchers and applied working psychologists, a variety of teaching resources can be recommended that introduce invariance testing in general (Cheung & Rensvold, 2002; Wicherts & Dolan, 2010) or specific aspects of MI such as longitudinal MI (Geiser, 2013), and MI with categorical data (Pendergast, von der Embse, Kilgus, & Eklund, 2017).
Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000500. ESM 1. Table, Figure, References, Syntax (.docx) Descriptives of Samples and MI Testing Sequences, Screenshot of the JavaScript tool, References of publications included in the analysis, Syntax for Measurement Invariance Testing.
References American Educational Research Association (AERA), American Psychological Association (APA), National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
European Journal of Psychological Assessment (2020), 36(1), 105–113
Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The rules of the game called psychological science. Perspectives on Psychological Science, 7, 543–554. https://doi.org/10.1177/ 1745691612459060 Bakker, M., & Wicherts, J. M. (2011). The (mis)reporting of statistical results in psychology journals. Behavior Research Methods, 43, 666–678. https://doi.org/10.3758/s13428-0110089-5 Beaujean, A. A. (2014). Latent variable modeling using R: A step by step guide. New York, NY: Routledge/Taylor & Francis Group. Borsboom, D. (2006a). When does measurement invariance matter? Medical Care, 44, 176–181. https://doi.org/10.1097/ 01.mlr.0000245143.08679.cc Borsboom, D. (2006b). The attack of the psychometricians. Psychometrika, 71, 425–440. https://doi.org/10.1007/s11336006-1447-6 Chambers, C. (2017). The seven deadly sins of Psychology: A manifesto for reforming the culture of scientific practice. Princeton, NJ: Princeton University Press. Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-offit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233–255. https://doi.org/10.1207/ S15328007SEM0902_5 Cortina, J. M., Green, J. P., Keeler, K. R., & Vandenberg, R. J. (2017). Degrees of freedom in SEM: Are we testing the models that we claim to test? Organizational Research Methods, 20, 350–378. https://doi.org/10.1177/1094428116676345 Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25, 7–29. https://doi.org/10.1177/ 0956797613504966 Eich, E. (2014). Business not as usual. Psychological Science, 25, 3–6. https://doi.org/10.1177/0956797613512465 Geiser, C. (2013). Data analysis with Mplus. New York, NY: Guilford Press. Little, T. D., Slegers, D. W., & Card, N. A. (2006). A non-arbitrary method of identifying and scaling latent variables in SEM and MACS models. Structural Equation Modeling, 13, 59–72. https://doi.org/10.1207/s15328007sem1301_3 Markus, K. A., & Borsboom, D. (2013). Frontiers of test validity theory: Measurement, causation, and meaning. New York, NY: Routledge. Millsap, R. E. (2001). When trivial constraints are not trivial: The choice of uniqueness constraints in confirmatory factor analysis. Structural Equation Modeling, 8, 1–17. https://doi.org/ 10.1207/S15328007SEM0801_1 Millsap, R. E. (2011). Statistical approaches to measurement invariance. New York, NY: Routledge. Muthén, L. K., & Muthén, B. O. (1998–2017). Mplus user’s guide (8th ed). Los Angeles, CA: Muthén & Muthén. Nelson, L. D., Simmons, J., & Simonsohn, U. (2018). Psychology’s renaissance. Annual Review of Psychology, 69, 511–534. https://doi.org/10.1146/annurev-psych-122216-011836 Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., . . . Yarkoni, T. (2015). Promoting an open research culture. Science, 348, 1422–1425. https://doi.org/ 10.1126/science.aab2374 Nowok, B., Raab, G. M., & Dibben, C. (2016). Synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74, 1–26. https://doi.org/10.18637/jss.v074.i11 Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2016). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 48, 1205–1226. https://doi.org/10.3758/ s13428-015-0664-2 Pendergast, L., von der Embse, N., Kilgus, S., & Eklund, K. (2017). Measurement equivalence: A non-technical primer on
Ó 2018 Hogrefe Publishing
U. Schroeders & T. Gnambs, DF in Multigroup Confirmatory Factor Analysis
113
categorical multi-group confirmatory factor analysis in school psychology. Journal of School Psychology, 60, 65–82. https:// doi.org/10.1016/j.jsp.2016.11.002 Piwowar, H. A., Day, R. S., & Fridsma, D. B. (2007). Sharing detailed research data is associated with increased citation rate. PLoS One, 2, e308. https://doi.org/10.1371/journal.pone. 0000308 R Development Core Team. (2018). R: A language and environment for statistical computing. [Computer software]. Retrieved from https://www.r-project.org/ Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2002). Measurement equivalence: A comparison of methods based on confirmatory factor analysis and item response theory. Journal of Applied Psychology, 87, 517–529. https://doi.org/10.1037/0021-9010. 87.3.517 Rensvold, R. B., & Cheung, G. W. (2001). Testing for metric invariance using structural equation models: Solving the standardization problem. In C. A. Schriesheim & L. L. Neider (Eds.), Equivalence in measurement Research in management (pp. 21–50). Greenwich, CT: Information Age. Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48, 1–36. https://doi. org/10.18637/jss.v048.i02 Schmidt, T. (2017). Statcheck does not work: All the numbers. Reply to Nuijten et al. (2017). https://doi.org/10.31234/osf.io/ hr6qy Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). Falsepositive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. https://doi.org/10.1177/ 0956797611417632 Simonsohn, U. (2013). Just post it: The lesson from two cases of fabricated data detected by statistics alone. Psychological Science, 24, 1875–1888. https://doi.org/10.1177/ 0956797613480366 Soderberg, C. K. (2018). Using OSF to share data: A step-by-step guide. Advances in Methods and Practices in Psychological Science, 1, 115–120. https://doi.org/10.1177/2515245918757689
Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4–70. https://doi.org/10.1177/ 109442810031002 Wicherts, J. M. (2016). The importance of measurement invariance in neurocognitive ability testing. The Clinical Neuropsychologist, 30, 1006–1016. https://doi.org/10.1080/13854046.2016. 1205136 Wicherts, J. M., & Dolan, C. V. (2010). Measurement invariance in confirmatory factor analysis: An illustration using IQ test performance of minorities. Educational Measurement: Issues and Practice, 29, 39–47. https://doi.org/10.1111/j.1745-3992. 2010.00182.x Williams, R. (2012). Using the margins command to estimate and interpret adjusted predictions and marginal effects. Stata Journal, 12, 308–331. Yoon, M., & Millsap, R. E. (2007). Detecting violations of factorial invariance using data-based specification searches: A Monte Carlo study. Structural Equation Modeling, 14, 435–463. https://doi.org/10.1080/10705510701301677
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 105–113
History Received January 20, 2018 Revision received June 4, 2018 Accepted June 5, 2018 Published online December 19, 2018 EJPA Section/Category Methodological topics in Assessment Ulrich Schroeders Psychological Assessment Institute of Psychology University of Kassel Holländische Str. 36-38 34127 Kassel Germany schroeders@psychologie.uni-kassel.de
Original Article
Measuring Anxiety-Related Avoidance With the Driving and Riding Avoidance Scale (DRAS) Joanne E. Taylor,1 Mark J. M. Sullman,2 and Amanda N. Stephens3 1
School of Psychology, Massey University, Palmerston North, New Zealand
2
Department of Social Sciences, University of Nicosia, Nicosia, Cyprus
3
Monash University Accident Research Centre, Monash University, Clayton, VIC, Australia
Abstract: Driving anxiety is a common experience that, for those with high levels of driving anxiety, can markedly interfere with functioning, particularly because of avoidance behavior. The Driving and Riding Avoidance Scale (DRAS; Stewart & St. Peter, 2004) is a promising measure of self-reported avoidance, but its psychometric properties have been questioned as the instructions do not specifically ask respondents to report avoidance that is due to driving anxiety. The present study investigated the psychometric properties of the DRAS using revised instructions in 437 participants from the general population of New Zealand. Internal consistency for the DRAS was 0.94 and ranged from 0.79 to 0.90 for the four subscales. A two-factor solution was supported, in line with previous research using the revised instructions, supporting the distinction between general and traffic avoidance compared with weather and riding avoidance. Further work on the psychometric properties of this measure with clinical samples is needed to clarify the subscale structure. Keywords: Driving and Riding Avoidance Scale (DRAS), driving anxiety, avoidance, measurement, assessment
The research to date on anxiety and fear that is related to driving has greatly assisted understanding of this often complex and heterogeneous experience (for reviews, see Taylor, Deane, & Podd, 2002, 2008). For example, we know that driving anxiety ranges in severity from mild anxiety to driving reluctance, where essential journeys for work or other important daily activities are made but nonessential journeys are avoided or tolerated with anxiety, to the more severe levels of anxiety and avoidance that characterize phobia (Blanchard & Hickling, 2004; Ehlers, Hofmann, Herda, & Roth, 1994; Taylor & Deane, 1999, 2000; Taylor, Deane, & Podd, 2007). It is also now apparent that driving anxiety is a relatively common experience, with 52% of a population-based sample endorsing mild driving anxiety and 16% moderate to severe driving anxiety (Taylor, 2018). Various efforts to measure driving anxiety and avoidance have been put forward, some of which are specific to vehicle crash survivors (e.g., Accident Fear Questionnaire: Kuch, Cox, & Direnfeld, 1995; Travel Phobia Questionnaire and Safety Behaviors Questionnaire: Ehring, Ehlers, & Glucksman, 2006), while others have been developed to assess driving phobia symptoms, irrespective of vehicle crash involvement (e.g., Driving Situations Questionnaire: Ehlers et al., 1994; Driving and Riding Avoidance Scale: Stewart & St. Peter, 2004; Fear of Driving
Inventory: Walshe, Lewis, Kim, O’Sullivan, & Wiederhold, 2003). While there has been relatively little research on any of these measures, the Driving and Riding Avoidance Scale (DRAS: Stewart & St. Peter, 2004) has received the most attention, probably because of its potential as a relatively brief measure of driving-related avoidance that is not tied to vehicle crash involvement. The 20-item DRAS provides a total avoidance score as well as subscale scores for general avoidance of driving and riding (along with pursuit of transport alternatives; e.g., “I chose to walk or ride a bicycle someplace to avoid driving in the car”), avoidance of traffic and busy roads (e.g., “I rescheduled making a drive in the car to avoid traffic”), avoidance of bad weather conditions or darkness (e.g., “I avoided driving the car because the weather was bad”), and avoidance of riding in a car (e.g., “I avoided riding in a car if I could”). The DRAS was developed in three US studies using different samples of undergraduate students who were vehicle crash survivors. The studies demonstrated its internal consistency (α = 0.92, n = 386), 4-week test–retest reliability (r = 0.83, n = 67), convergent and discriminant validity (n = 118), and that a four-factor model, consistent with the subscales, provided the best fit to the data (Stewart & St. Peter, 2004). Taylor and Sullman (2009) investigated
European Journal of Psychological Assessment (2020), 36(1), 114–122 https://doi.org/10.1027/1015-5759/a000502
Ó 2018 Hogrefe Publishing
J. E. Taylor et al., Driving and Riding Avoidance Scale
the psychometric properties of the DRAS with 301 university students. Internal consistency was 0.89, temporal stability over two months was 0.71, and there was discriminant validity shown by the lack of correlation with driving-related lapses, errors, and aggressive violations measured with the Driving Behaviour Questionnaire (Reason, Manstead, Stradling, Baxter, & Campbell, 1990). The factor structure of the DRAS supported a three-factor solution, although there was significant overlap among the factors, and Taylor and Sullman (2009) raised the question of whether these findings were due to lack of clarity in the instructions for the DRAS, in that the instructions do not specifically ask for ratings of avoidance that is due to anxiety about driving. Instead, there was evidence from Taylor and Sullman’s study that the instructions “Please read the following statements and circle the response that best describes how often you behaved in this way during the past 7 days including today” were interpreted more broadly by their participants, who were not vehicle crash survivors as per Stewart and St. Peter’s samples, in that some made comments about avoidance that was motivated by factors other than anxiety, such as practical reasons of cost or living location. They recommended further research using modified instructions to ensure that the DRAS measures anxiety-related avoidance behavior. These modified instructions were used in a study with 210 Polish university students, with two factors apparent in the factor analysis, one for driving avoidance and the other for riding avoidance (Blachnio, Przepiórka, Sullman, & Taylor, 2013), in contrast to the four factors found by Stewart and St. Peter (2004). However, this study used a translated Polish version of the DRAS (Blachnio et al., 2013). The present study aimed to evaluate the psychometric properties of the English language DRAS using the modified instructions. It was hypothesized that the psychometric properties would be more similar to those reported by Stewart and St. Peter, particularly in terms of reproducing the four-factor model. It was also hypothesized that the DRAS would show evidence of convergent and discriminant validity by having stronger correlations with other measures of driving anxiety and avoidance, and weak correlations with divergent variables such as aberrant driving behavior (e.g., lapses and errors). The study was part of a larger project on the extent and characteristics of driving anxiety (Taylor, 2016).
115
compulsory voting register of adults over 18 years of age that currently represents 98% of the population. Participants were invited to complete a postal survey which used a two-stage posting schedule where the survey was mailed out, and then another survey mailed to nonresponders. Of the 1,500 participants randomly selected, 441 (29.4%) responded to the survey. Participants ranged from 18 to 87 years of age, the average age was 54 years (SD = 17), and 44% were men. This group was similar to the original random sample of 1,500 which comprised 47.5% men ranging in age from 18 to 96, although the average age in the original sample was slightly younger (49 years). Most (88%) participants were of European descent and 5% were Māori (the indigenous people of New Zealand). Over half (51%) of the sample had a post-secondary or tertiary qualification.
Measures and Procedure Ethical approval was obtained by the Massey University Human Ethics Committee (HEC: Southern B 10/75). Participants completed a 14-page postal survey comprised of demographic and driving information [age, gender, years since obtained driver’s license, driving accidents and incidents in the past year, and level of driving anxiety on a scale from 0 (= not at all anxious) to 10 (= extremely anxious), with 5 = moderately], the DRAS, and several other psychometric instruments were also included to explore the psychometric properties of the DRAS, and are briefly described below (see Taylor, 2018, for more information).
In 2012, a random sample of 1,500 adults were recruited for the study from the New Zealand electoral roll, a
Driving and Riding Avoidance Scale (DRAS) As described above, the DRAS (Stewart & St. Peter, 2004) assesses avoidance behavior for various driving and riding situations. The 20 items are rated for frequency of avoidance over the past week on a 4-point Likert scale which ranges from 0 (= avoid rarely or none of the time) to 3 (= avoid most or all of the time). The modified instructions were used, as recommended by Taylor and Sullman (2009), to ensure the DRAS was measuring anxiety-related avoidance behavior. The phrase in italics was added so the instructions were “Please read the following statements and circle the response that best describes how often you behaved this way because of anxiety during the past 7 days, including today.” The total score constitutes the sum of all 20 item ratings (range 0–60), with higher scores representing greater avoidance. There are also subscale scores for general avoidance (items 1–3, 12, 18–20), avoidance of traffic and busy roads (items 5–10, 15), avoidance of weather or darkness (items 11–14, 17), and riding avoidance (items 4, 13–16).
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 114–122
Materials and Methods Participants
116
Driving Situations Questionnaire (DSQ) The DSQ (Ehlers et al., 1994) is a 211-item measure that asks participants to rate their degree of anxiety and avoidance in response to a range of driving situations and circumstances with respect to driving alone, driving accompanied, and riding in a vehicle when someone else is driving. The modified version of the DSQ, as described by Taylor and Deane (2000), was used. Participants rated the degree to which they felt anxious in each of 39 driving situations from 0 (= no anxiety) to 4 (= extreme anxiety) and then rated their level of avoidance of the same situations from 0 (= never avoid) to 4 (= always avoid) (e.g., being tailgated by another car; driving in the fog, rain, at night; driving in an unfamiliar area). Total scores range from 0 to 156 for each scale, with higher scores indicating greater anxiety and avoidance, respectively. The DSQ has demonstrated convergent validity (Ehlers et al., 2007). Driving Cognitions Questionnaire (DCQ) The DCQ (Ehlers et al., 2007) is a 20-item scale that measures the frequency of three areas of negative cognitions in driving fear: panic-related (e.g., “I will be unable to catch my breath,” “My heart will stop beating,” “I will tremble and not be able to steer”), accident-related (e.g., “I will cause an accident,” “I will injure someone,” “I will die in an accident”), and social concerns (e.g., “Other people will notice that I am anxious,” “People will think I am a bad driver,” “I will hold up traffic and people will be angry”). Each item was rated according to how often each thought occurs while driving, using a 5-point Likert scale from 0 (= never) to 4 (= always). Scores range from 0 to 80, with higher scores reflecting more frequent negative driving-related cognitions. Psychometric properties of the DCQ have been demonstrated in three separate samples from different countries (Ehlers et al., 2007), including good internal consistency, convergent validity, and an ability to discriminate between people with and without driving phobia.
J. E. Taylor et al., Driving and Riding Avoidance Scale
frequency of anxious behavior. The DBS has shown evidence of factorial and convergent validity, as well as good to excellent internal consistency (Clapp, Olsen, Beck et al., 2011; Clapp, Olsen, Danoff-Burg, et al., 2011; Clapp, Baker, Litwack, Sloan, & Beck, 2014). Driver Behavior Questionnaire (DBQ) The 28-item version of the DBQ (Reason et al., 1990; as used by Sullman & Taylor, 2010) was used to measure self-reported aberrant driving behavior. Respondents indicate how often they committed each of the 8 lapses (problems with attention and memory), 8 errors (problems with observation and misjudgment), 6 ordinary violations (deliberate deviations from safe driving practices), and 6 aggressive violations (expressing hostility to other road users or driving aggressively) over the past year using a 6-point Likert scale from 0 (= never) to 5 (= all the time). The DBQ has been studied extensively since it was developed nearly three decades ago (Davey, Wishart, Freeman, & Watson, 2007; Ozkan, Lajunen, Chliaoutakis, Parker, & Summala, 2006; Parker, Reason, Manstead, & Stradling, 1995; Reason et al., 1990; Sullman, Meadows, & Pajo, 2002). Driver Social Desirability Scale (DSDS) The DSDS (Lajunen, Corry, Summala, & Hartley, 1997) was used to measure traffic-related socially desirable responding. The scale consists of 12 items which make up two subscales, one measuring faking one’s own driving behaviors (driver impression management; DIM) and the other measuring overly positive beliefs about one’s driving (driver self-deception, DSD). The items are stated as propositions, and respondents rate their agreement with each statement on a 7-point Likert scale from 1 (= not true) to 7 (= very true). The total score ranges from 12 to 84 and was used in the present study. The DSDS has demonstrated construct validity and reliability (Lajunen et al., 1997).
Data Analysis
Driving Behavior Survey (DBS) The DBS (Clapp, Olsen, Beck, et al., 2011) is a 21-item measure of anxious driving behavior that includes subscales for anxiety-based performance deficits, exaggerated safety/ caution behavior, and hostile/aggressive behavior. Items are rated on a 7-point Likert scale from 1 (= never) to 7 (= always) in terms of how often each behavior is performed, in general, when a stressful driving situation occurs that makes the person feel nervous, anxious, tense, or uncomfortable. Scores are calculated as the mean of endorsed scale items, and higher scores indicate greater
Data were analyzed using SPSS Version 23.0 (IBM Corp, 2015). Four participants had missing data on most items so their data were excluded, leaving a sample of 437 participants. There were 21 participants with missing data on 1–3 items of the DRAS, and means imputation was used in these cases. One-way ANOVA addressed questions about DRAS scores for those with and without driving anxiety. Reliability, correlational, and factor analyses were used to examine the primary research hypotheses about the psychometric properties of the English language DRAS using modified instructions. A confirmatory factor analysis was conducted to determine the most appropriate factor structure of the DRAS.
European Journal of Psychological Assessment (2020), 36(1), 114–122
Ó 2018 Hogrefe Publishing
J. E. Taylor et al., Driving and Riding Avoidance Scale
117
Table 1. DRAS mean total and subscale scores (and SDs) for the present sample compared with previous New Zealand (2009) and US (2004) data Scale
Present study N = 437 NZ general population
Taylor and Sullman (2009) N = 301 NZ students
Stewart and St. Peter (2004) N = 386 US students
Total DRAS
3.05 (6.64)
13.49 (10.38)
7.64 (8.88)
General avoidance
0.85 (2.50)
5.96 (4.75)
2.91 (3.78)
Avoidance of traffic
1.33 (3.00)
5.50 (4.61)
3.24 (3.88)
Avoidance of weather or darkness
1.05 (2.43)
1.64 (2.75)
1.85 (2.87)
Avoidance of riding
0.58 (1.62)
1.67 (2.46)
1.15 (2.33)
Note. Total score range 0–60. Subscale score range 0–21 (general and traffic avoidance), 0–15 (weather and riding avoidance).
Participants had held their driver’s license for an average of 35.41 years (SD = 16.69), and the average weekly mileage was 244.11 km (SD = 752.73). Most (94.51%) of the sample held a full driver’s license, while 2.75% had a restricted and 0.92% a learner’s license. Most (93.40%) of the sample reported no major vehicle accident in the last year, while 1.37% reported one such accident. Minor accidents were slightly more common, with 5.72% reporting one minor accident in the last year. Table 1 shows the DRAS total and subscale scores for the present study compared with previous studies. The present sample obtained lower scores than the student samples from the two previous studies, including Stewart and St. Peter’s sample of students who were motor vehicle crash survivors. Participants were grouped according to the severity of driving anxiety they endorsed on the 0–10 scale (0 = not at all anxious, 5 = moderately anxious, 10 = extremely anxious). Using one-way ANOVA, those in the moderate to severe driving anxiety group (ratings of 5–10) had higher
mean DRAS total and subscale scores than those in both the mild (ratings of 1–4) and no anxiety (rating of 0) groups (see Table 2). As shown by the eta-squared values, these effect sizes were in the medium range. Post hoc comparisons using the Games–Howell test indicated that those with moderate to severe driving anxiety had higher scores than both of the other two groups. Similar to previous studies, internal consistency was in the good to excellent range, with α = 0.94 for the total DRAS score, 0.82 for general avoidance, 0.87 for avoidance of traffic, 0.90 for avoidance of weather or darkness, and 0.79 for avoidance of riding. The DRAS was correlated with the validity measures to test the hypothesis regarding convergent and discriminant validity (see Table 3). The total score had high correlations with the subscale scores, and correlations between the subscales were moderate to high. As expected, there were moderate and significant correlations with other measures of driving avoidance (DSQ-Avoidance), driving anxiety (DSQ-Anxiety), and negative driving-related cognitions (DCQ). There were small to moderate correlations with the DBS subscales for anxiety-based performance deficits and exaggerated safety/caution behavior. As also expected, there were weak to no correlations with divergent variables such as hostile/aggressive behavior on the DBS, aberrant driving behavior on the DBQ, weekly mileage, and years licensed as a driver. There was no relationship between the DRAS and traffic-related socially desirable responding on the DSDS. CFA was used to examine whether the data fitted the 20item four-factor model suggested by Stewart and St. Peter (2004), the three-factor model suggested by Taylor and Sullman (2009), and the two-factor model suggested by Blachnio et al. (2013). The three models tested all produced poor fit indices, which were as follows: w2(164) = 1,053.70, p < .001, CFI = .83, RMSEA = .11 (90% CI: .11–.12), w2(167) = 1,156.12, p < .001, CFI = .82, RMSEA = .12 (.11– .12), and w2 (169) = 1,692.00, p < .001, CFI = .72, RMSEA = .15 (.14–.15), respectively. As modifications to the models violate the confirmatory nature of a CFA, exploratory factor analysis was instead conducted on the 20 items to determine the appropriate factor structure.
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 114–122
The CFA was conducted in AMOS v.22 using maximum likelihood estimation, and the Bollen–Stine p value bootstrapped on 2,000 samples was used due to skewed data. Goodness of fit was determined by chi-squared (w2), comparative fit index (CFI) > .90, and root-mean square error of approximation (RMSEA) < .06 with 90% CI, and nonsignificant pclose value (Hu & Bentler, 1995). As the CFA showed poor fit to the data (see results), subsequent exploratory factor analysis (EFA) was conducted in SPSS. The EFA used principal axis factoring (PAF) with direct oblimin rotation that accounts for correlated factors. The suitability of the data for factor analysis was assessed with the Kaiser–Meyer–Olkin value > .60 (Kaiser, 1970, 1974), and Bartlett’s Test of Sphericity (Bartlett, 1954), with statistical significance (p < .001), supporting the factorability of the correlation matrix. The original inputs and outputs of the analyses are provided in Electronic Supplementary Material (ESM 1).
Results
118
J. E. Taylor et al., Driving and Riding Avoidance Scale
Table 2. DRAS mean total and subscale scores (and SDs) for the present sample according to severity of driving anxiety on the 0–10 scale Variable
No driving anxiety (n = 131–132)
Mild driving anxiety (n = 228)
Moderate to severe driving anxiety (n = 65–67)
F
η2
Total DRAS score
1.57 (5.10)
2.63 (5.25)
7.26 (10.70)
F(2, 110.50) = 11.76***
0.08
General avoidance
0.37 (1.45)
0.66 (1.94)
2.37 (4.49)
F(2, 92.66) = 9.59***
0.07
Avoidance of traffic
0.68 (2.44)
1.05 (1.92)
3.45 (5.14)
F(2, 103.07) = 13.35***
0.10
Avoidance of weather or darkness
0.59 (1.71)
1.01 (2.39)
2.15 (3.40)
F(2, 133.90) = 7.33***
0.04
Avoidance of riding
0.29 (1.23)
0.55 (1.58)
1.30 (2.18)
F(2, 147.35) = 7.24***
0.04
Notes. Ns range due to missing data. Scores on 0–10 measure of driving anxiety: 0 = no driving anxiety; 1–4 = mild driving anxiety; 5–10 = moderate to severe driving anxiety. ***p < .001.
Table 3. Correlations between the DRAS and validity measures Scale
Total DRAS score
General avoidance
Traffic avoidance
Weather avoidance
Riding avoidance
Total DRAS score
–
0.90***
0.92***
0.85***
0.81***
General avoidance
–
–
0.77***
0.69***
0.65***
Traffic avoidance
–
–
–
0.65***
0.64***
–
–
–
–
0.85***
DSQ-Anxiety
0.53***
0.51***
0.56***
0.39***
0.37***
DSQ-Avoidance
0.67***
0.61***
0.70***
0.51***
0.47***
DCQ Total score
0.56***
0.50***
0.55***
0.44***
0.49***
DBS Anxiety
0.41***
0.42***
0.40***
0.27***
0.29***
DBS Safety
0.30***
0.29***
0.28***
0.24***
0.24***
DBS Hostile
0.19***
0.15**
0.17***
0.11*
0.19***
Weather avoidance
DBQ Lapses
0.19***
0.13***
0.19***
0.13*
0.18***
DBQ Errors
0.23***
0.17***
0.22***
0.15***
0.23***
DBQ Ordinary violations
0.04
0.02
0.03
0.01
0.10*
DBQ Aggressive violations
0.10*
0.06
0.08
0.05
0.14**
Years held driver’s license
0.01
0.04
0.08
0.09
0.002
Km traveled per week
0.03
0.02
0.06
0.07
0.01
DSDS
0.06
0.07
0.04
0.09
0.09
Notes. DBQ = Driver Behaviour Questionnaire. DBS = Driving Behavior Survey. DCQ = Driving Cognitions Questionnaire. DRAS = Driving and Riding Avoidance Scale. DSDS = Driver Social Desirability Scale. DSQ = Driving Situations Questionnaire. *p < .05; **p < .005; ***p < .001.
The DRAS items were entered into an exploratory factor analysis using principal axis factoring with direct oblimin rotation to determine the factor structure in the present sample, given that the subscales are unified by the construct of avoidance. As with the factor analyses by Stewart and St. Peter (2004) and Taylor and Sullman (2009), multivariate tests of normality revealed evidence of positive skewness, where responses clustered at lower levels of avoidance. The proportion of participants rating 0 (= avoid rarely or none of the time) ranged from 76% (item 9) to 95% (item 5). As also noted by Stewart and St. Peter, this was expected given the nonclinical nature of the sample. The departures from normality did not diminish the observed correlations (see Table 3; Hair, Anderson, Tatham, & Black, 1998). The suitability of the data for factor analysis was initially assessed. The Kaiser–Meyer–Olkin value was .89, and Bartlett’s Test of Sphericity (Bartlett, 1954) reached statisti-
cal significance (p < .001), supporting the factorability of the correlation matrix. The initial EFA indicated the presence of four components with eigenvalues greater than 1. However, the scree plot and subsequent Monte-Carlo parallel analysis (O’Connor, 2000) suggested a two-factor solution. The two-factor solution explained 56.16% of the variance. Component 1 contributed 47.55%, and Component 2 contributed 8.61%. The correlation between the two components was r = 0.68. The pattern matrix of factor loadings for each variable in the rotated solution is shown in Table 4 (loadings adjusted for the correlation among the factors). The criterion for interpreting factor loadings developed by Stevens (2002) was used, because it incorporates the effect of sample size and determines a minimum value of factor loadings that accounts for stringent statistical significance (p < .01) as well as the amount of contribution (Spicer, 2005). Using
European Journal of Psychological Assessment (2020), 36(1), 114–122
Ó 2018 Hogrefe Publishing
J. E. Taylor et al., Driving and Riding Avoidance Scale
119
Table 4. DRAS factor structure Pattern coefficients Item
Structure coefficients
Factor 1 Factor 2 Factor 1 Factor 2 Communalities
3. I avoided driving a car if I could
0.91
0.84
0.47
0.71
6. I avoided driving on busy city streets
0.76
0.82
0.59
0.68
5. I avoided driving on residential streets
0.75
0.73
0.45
0.54
7. I avoided driving on the motorway
0.75
0.67
0.36
0.45
8. I avoided driving through busy intersections
0.72
0.73
0.57
0.63
2. I chose to walk or ride a bicycle someplace to avoid driving in the car
0.71
0.68
0.40
0.46
1. I put off a brief trip or errand that required driving the car
0.68
0.74
0.43
0.48
9. I traveled a longer distance to avoid driving through heavy traffic or busy streets
0.67
0.58
0.43
0.46
20. I avoided activities that required using a car
0.56
0.71
0.55
0.54
18. I put off a brief trip or errand that required riding in a car
0.54
0.64
0.50
0.42
10. I rescheduled making a drive in the car to avoid traffic
0.52
0.64
0.52
0.43
19. I chose to ride a bus someplace to avoid driving in the car
0.36
0.55
0.53
0.36
4. I avoided riding in a car if I could
0.35
0.48
0.42
0.25
13. I avoided riding in a car because the weather was bad (e.g., fog, rain, or ice)
0.99
0.49
0.89
0.81
17. I rescheduled making a drive in the car to avoid bad weather (e.g., fog, rain, or ice)
0.83
0.59
0.87
0.75
11. I avoided driving the car because the weather was bad (e.g., fog, rain, or ice)
0.78
0.62
0.78
0.63
14. I avoided riding in a car after dark
0.65
0.45
0.75
0.56
12. I avoided driving the car after dark
0.58
0.61
0.73
0.57
15. I avoided riding in a car if I knew the traffic was heavy
0.47
0.53
0.62
0.41
16. I avoided riding in a car on motorway
0.34
0.45
0.49
0.27
Note. Blank cells indicate loadings < 0.25. Loadings .25 are in bold.
this method, the minimum value required to interpret a factor loading was .25 (obtained by doubling the .129 critical value for N = 400). As can be seen in Table 4, all items met this requirement and loaded onto one factor with a loading > .25 (represented in bold). The pattern matrix showed a reasonably clear two-factor solution, and item 18 loaded on Factor 1 instead of overlapping onto two factors, as was the case in our previous study (Taylor & Sullman, 2009). Factor 1 contained all of the general and traffic avoidance items and one of the riding avoidance items (item 4). Factor 2 was made up of all of the weather and the remaining riding avoidance items and included all four items that appear on more than one subscale of the DRAS using the Stewart and St. Peter (2004) structure. This two-factor solution largely supports the solution proposed by Blachnio et al. (2013), with the exceptions that item 18 “I put off a brief trip or errand that required riding in a car” loaded onto Factor 1 and item 17 “I rescheduled making a drive in the car to avoid bad weather (e.g., fog, rain, or ice)” loaded onto Factor 2. The two factors were moderately correlated (r = .68, p < .001), demonstrating they were related yet still measuring separate constructs. The factors also showed good internal consistency. Cronbach’s α were .92 for Factor 1 (general and traffic avoidance) and .89 for Factor 2 (weather and riding avoidance). Overall, the factor analysis supported
the distinction between two types of driving anxiety-related avoidance: general and traffic avoidance, and weather and riding avoidance, which are measured by the DRAS. Table 5 shows the intercorrelations between the two resultant factors with the Driver Behaviour Questionnaire, Driver Behaviour Survey, Driving Cognitions Questionnaire, Driving Situations Questionnaire, and the Driver Social Desirability Scale. Similar relationships were observed for the two factors as reported above (see Table 3) for the four-factor solution. In particular, general and traffic avoidance and weather and riding avoidance were both moderately positively related to increased anxiety and avoidance of driving situations measured with the Driving Situation Questionnaire (Clapp, Olsen, Beck, et al., 2011). Scores on general and traffic avoidance and weather and riding avoidance were also related to increased frequencies of self-reported errors and lapses measured with the Driver Behaviour Questionnaire (Reason et al., 1990).
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 114–122
Discussion The present study investigated the psychometric properties of the DRAS in a sample from the general population using modified instructions to ensure that responses reflected
120
J. E. Taylor et al., Driving and Riding Avoidance Scale
Table 5. Correlations between the DRAS and its subscales as well as validity measures Scale
General and traffic avoidance M = 1.84 (SD = 4.45)
Weather and riding avoidance M = 1.24 (SD = 2.84)
Total DRAS score
0.94***
0.88***
DSQ-Anxiety
0.56***
0.42***
DSQ-Avoidance
0.67***
0.53***
DCQ Total score
0.56***
0.46***
DBS Anxiety
0.45***
0.27***
DBS Safety
0.29***
0.25***
DBS Hostile
0.19***
0.14**
DBQ Lapses
0.19***
0.14**
DBQ Errors
0.22***
0.17***
DBQ Ordinary violations
0.04
0.02
DBQ Aggressive violations
0.09
0.07
Years held driver’s license
0.08
0.06
Km traveled per week
0.06
0.05
DSDS
0.04
0.09
Notes. DBQ = Driver Behaviour Questionnaire. DBS = Driving Behavior Survey. DCQ = Driving Cognitions Questionnaire. DRAS = Driving and Riding Avoidance Scale. DSDS = Driver Social Desirability Scale. DSQ = Driving Situations Questionnaire. **p < .005; ***p < .001.
avoidance due to driving anxiety. The DRAS scores in the present study were much lower than those in previous research using student samples, despite all of the data originating from nonclinical samples. It is unclear why the scores were so different in this general population sample compared to the undergraduate student samples used in previous studies. However, the DRAS total and subscale scores distinguished between levels of driving anxiety in the expected direction. In terms of the study hypotheses, the psychometric properties were very similar to those reported by Stewart and St. Peter. Internal consistency for the overall DRAS was very good at 0.94, which is similar to the coefficients in previous studies (0.89–0.92). The four subscales also had good internal consistency, ranging from 0.79 to 0.90, in line with previous studies. The DRAS demonstrated convergent validity with other measures of driving avoidance. Correlations with the DSQ-Avoidance were the highest among all of the other measures used (0.47–0.70), and correlations with measures of driving anxiety were also good (0.37– 0.56), as would be expected given the close relationship between anxiety and avoidance. There was also evidence of discriminant validity. The DRAS had weak to nil correlations with the subscales of the DBQ, as well as weekly mileage, years licensed as a driver, and traffic-related socially desirable responding. The two-factor model reported by Blachnio et al. (2013) was also largely reproduced in this general population sample, compared with the three-factor model found in other previous research with undergraduate students (Taylor & Sullman, 2009). The only differences were that two items loaded onto the opposite factor in the present study com-
pared to in Blachnio et al., items 18 (“I put off a brief trip or errand that required riding in a car”) and 17 (“I rescheduled making a drive in the car to avoid bad weather (e.g., fog, rain, or ice)”). While the analysis supported the factor structure of the DRAS and the distinction between different types of driving anxiety-related avoidance, it also suggested that a simpler and more parsimonious factor structure may be appropriate, in terms of items representing general and traffic avoidance compared to those reflecting weather and riding avoidance. These small differences in findings do not call into question the construct validity of the DRAS, although further assessment is warranted to clarify how the subscales can best represent the construct of driving anxiety-related avoidance, and in particular whether there is more extensive support for small changes to the DRAS subscales. Further research is needed to examine the construct validity of the DRAS, especially when used with clinical populations. Such research should also include measures of more subtle expressions of driving anxietyrelated avoidance, such as safety behaviors (e.g., using Ehring, Ehlers, & Glucksman’s, 2006; Safety Behaviours Questionnaire, SBQ; Koch & Taylor, 1995; Taylor & Koch, 1995). The present study is subject to limitations, perhaps most important of which is the fact that 29.4% of the random sample took part in the study. While the final group of participants was demographically similar to the larger random sample except for being slightly older in average age, it is not possible to know whether there were systematic differences on the key variable of interest between the random and final samples. For example, the final sample may have had overall lower levels of driving-related anxiety and avoidance than might occur in the general population,
European Journal of Psychological Assessment (2020), 36(1), 114–122
Ó 2018 Hogrefe Publishing
J. E. Taylor et al., Driving and Riding Avoidance Scale
although it is not possible to know this with certainty given that there are no representative population studies of driving anxiety. Future studies should consider alternative data collection methods to improve response rates, such as the Internet, which has been shown to provide equivalent responses to traditional paper-and-pencil administration, in traffic psychology research (Sullman, Stephens, & Taylor, 2016) and other research areas (e.g., Davidov & Depner, 2011; De Beuckelaer & Lievens, 2009; Sawhney & Cigularov, 2014; Wolf, Hattrup, & Mueller, 2011). In summary, the DRAS is a promising measure of driving anxiety-related avoidance behavior that can be used irrespective of vehicle crash involvement, which is important given that driving anxiety does not always occur in the context of crash history (Taylor et al., 2002). The use of clearer instructions in the present study seems to have provided evidence of construct validity that is consistent with Blachnio et al.’s (2013) research using the revised instructions than Taylor and Sullman’s (2009) study. Given the low levels of driving-related avoidance in the present general population and previous student samples, further research using clinical samples would provide more robust evidence of the psychometric properties of the DRAS.
121
Bartlett, M. S. (1954). A note on the multiplying factors for various chi square approximations. Journal of the Royal Statistical Society, 16, 296–298. Blachnio, A., Przepiórka, A., Sullman, M., & Taylor, J. (2013). Polish adaptation of the Driving and Riding Avoidance Scale. Polish Psychological Bulletin, 44, 59–66. https://doi.org/10.2478/ppb2013-0021 Blanchard, E. B., & Hickling, E. J. (2004). After the crash: Psychological assessment and treatment of survivors of motor vehicle accidents (2nd ed.). Washington, DC: American Psychological Association. Clapp, J. D., Olsen, S. A., Beck, J. G., Palyo, S. A., Grant, D. M., Gudmundsdottir, B., & Marques, L. (2011). The Driving Behavior Survey: Scale construction and validation. Journal of Anxiety Disorders, 25, 96–105. https://doi.org/10.1016/j.janxdis.2010. 08.008 Clapp, J. D., Olsen, S. A., Danoff-Burg, S., Hagewood, J. H., Hickling, E. J., Hwang, V. S., & Beck, J. G. (2011). Factors contributing to anxious driving behaviour: The role of stress history and accident severity. Journal of Anxiety Disorders, 25, 592–598. https://doi.org/10.1016/j.janxdis.2011.01.008
Clapp, J. D., Baker, A. S., Litwack, S. D., Sloan, D. M., & Beck, J. G. (2014). Properties of the Driving Behavior Survey among individuals with motor vehicle accident-related posttraumatic stress disorder. Journal of Anxiety Disorders, 28, 1–7. https:// doi.org/10.1016/j.janxdis.2013.10.008 Davey, J., Wishart, D., Freeman, J., & Watson, B. (2007). An application of the driver behaviour questionnaire in an Australian organisational fleet setting. Transportation Research Part F, 10, 11–21. https://doi.org/10.1016/j.trf.2006.03.001 Davidov, E., & Depner, F. (2011). Testing for measurement equivalence of human values across online and paper-andpencil surveys. Quality & Quantity, 45, 375–390. https://doi.org/ 10.1177/0020715210363534 De Beuckelaer, A., & Lievens, F. (2009). Measurement equivalence of paper-and-pencil and internet organisational surveys: A large scale examination in 16 countries. Applied Psychology, 58, 336–361. https://doi.org/10.1111/j.1464-0597.2008.00350.x Ehlers, A., Hofmann, S. G., Herda, C. A., & Roth, W. T. (1994). Clinical characteristics of driving phobia. Journal of Anxiety Disorders, 8, 323–339. https://doi.org/10.1016/0887-6185(94) 00021-2 Ehlers, A., Taylor, J. E., Ehring, T., Hofmann, S. G., Deane, F. P., Roth, W. T., & Podd, J. V. (2007). The Driving Cognitions Questionnaire: Development and preliminary psychometric properties. Journal of Anxiety Disorders, 21(4), 493–509. Ehring, T., Ehlers, A., & Glucksman, E. (2006). Contribution of cognitive factors to the prediction of posttraumatic stress disorder, phobia and depression after motor vehicle accidents. Behaviour Research and Therapy, 44, 1699–1716. https://doi. org/10.1016/j.brat.2005.11.013 Hair, J. F., Anderson, R. E., Tatham, R. L., & Black, W. C. (1998). Multivariate data analysis (5th ed.). Upper Saddle River, NJ: Prentice Hall. Hu, L.-T., & Bentler, P. M. (1995). Evaluating model fit. In R. H. Hoyle (Ed.), Structural equation modeling: Concepts, issues, and applications (pp. 76–99). Thousand Oaks, CA: Sage Publications. IBM Corp. Released. (2015). IBM SPSS statistics for Windows, Version 23.0. Armonk, NY: IBM Corp. Kaiser, H. (1970). A second generation little jiffy. Psychometrika, 35, 401–415. Kaiser, H. (1974). An index of factorial simplicity. Psychometrika, 39, 31–36. https://doi.org/10.1007/BF02291575 Koch, W. J., & Taylor, S. (1995). Assessment and treatment of motor vehicle accident victims. Cognitive and Behavioral Practice, 2, 327–342. https://doi.org/10.1016/S1077-7229(95)80016-6 Kuch, K., Cox, B. J., & Direnfeld, D. M. (1995). A brief self-rating scale for PTSD after road vehicle accident. Journal of Anxiety Disorders, 9, 503–514. https://doi.org/10.1016/0887-6185(95) 00029-N Lajunen, T., Corry, A., Summala, H., & Hartley, L. (1997). Impression management and self-deception in traffic behaviour inventories. Personality and Individual Differences, 2, 341– 353. https://doi.org/10.1016/S0191-8869(96)00221-8 O’Connor, B. (2000). SPSS and SAS programs for determining the number of components using parallel analysis and Velicer’s MAP test. Behavior Research: Methods Instruments, and Computers, 32, 396–402. https://doi.org/10.3758/BF03200807 Ozkan, T., Lajunen, T., Chliaoutakis, J. E., Parker, D., & Summala, H. (2006). Cross-cultural differences in driving behaviours: A comparison of six countries. Transportation Research Part F, 9, 227–242. https://doi.org/10.1016/j.trf.2006.01.002 Parker, D., Reason, J. T., Manstead, A. S. R., & Stradling, S. G. (1995). Driving errors, driving violations and accident involvement. Ergonomics, 38, 1036–1048. https://doi.org/10.1080/ 00140139508925170
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 114–122
Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000502 ESM 1. Data (.pdf) The original inputs and outputs of the analyses.
References
122
J. E. Taylor et al., Driving and Riding Avoidance Scale
Reason, J., Manstead, A., Stradling, S., Baxter, J., & Campbell, K. (1990). Errors and violations on the roads: A real distinction? Ergonomics, 33, 1315–1332. https://doi.org/10.1080/ 00140139008925335 Sawhney, G., & Cigularov, K. P. (2014). Measurement equivalence and latent mean differences of personality scores across different media and proctoring administration conditions. Computer in Human Behavior, 36, 412–421. https://doi.org/ 10.1016/j.chb.2014.04.010 Spicer, J. (2005). Making sense of multivariate data analysis. Thousand Oaks, CA: Sage Publications. Stevens, J. P. (2002). Applied multivariate statistics for the social sciences (4th ed.). Mahwah, NJ: Erlbaum. Stewart, A. E., & St. Peter, C. C. (2004). Driving and riding avoidance following motor vehicle crashes in a non-clinical sample: Psychometric properties of a new measure. Behaviour Research and Therapy, 42, 859–879. https://doi.org/10.1016/ S0005-7967(03)00203-1 Sullman, M. J. M., Meadows, M. L., & Pajo, K. (2002). Aberrant driving behaviours amongst New Zealand truck drivers. Transportation Research Part F, 5, 217–232. https://doi.org/10.1016/ S1369-8478(02)00019-0 Sullman, M. J. M., & Taylor, J. E. (2010). Social desirability and self-reported driving avoidance: Should we be worried? Transportation Research Part F, 13, 215–221. https://doi.org/ 10.1016/j.trf.2010.04.004 Taylor, J. E. (2018). The extent and characteristics of driving anxiety. In Transportation Research Part F: Psychology and Behaviour (58, pp. 70–79). https://doi.org/10.1016/j.trf.2018. 05.031 Taylor, J. E., & Deane, F. P. (1999). Acquisition and severity of driving-related fears. Behaviour Research and Therapy, 37, 435–449. https://doi.org/10.1016/S0005-7967(98)00065-5 Taylor, J. E., & Deane, F. P. (2000). Comparison and characteristics of motor vehicle accident (MVA) and non-MVA driving fears. Journal of Anxiety Disorders, 14, 281–298. https://doi.org/ 10.1016/S0887-6185(99)00040-7 Taylor, J., Deane, F., & Podd, J. (2002). Driving-related fear: A review. Clinical Psychology Review, 22, 631–645. https://doi. org/10.1016/S0272-7358(01)00114-3 Taylor, J. E., Deane, F. P., & Podd, J. V. (2007). Diagnostic features, symptom severity, and help-seeking in a mediarecruited sample of women with driving fear. Journal of Psychopathology and Behavioural Assessment, 29, 81–91. https://doi.org/10.1007/s10862-006-9032-y
Taylor, J. E., Deane, F. P., & Podd, J. V. (2008). The relationship between driving anxiety and driving skill: A review of human factors and anxiety-performance theories to clarify future research needs. New Zealand Journal of Psychology, 37, 28–37. Taylor, S., & Koch, W. J. (1995). Anxiety disorders due to motor vehicle accidents: Nature and treatment. Clinical Psychology Review, 15, 721–738. https://doi.org/10.1016/0272-7358(95) 00043-7 Taylor, J. E., & Sullman, M. J. M. (2009). What does the Driving and Riding Avoidance Scale (DRAS) measure? Journal of Anxiety Disorders, 23, 504–510. https://doi.org/10.1016/j.janxdis.2008. 10.006 Walshe, D. G., Lewis, E. J., Kim, S. I., O’Sullivan, K., & Wiederhold, B. K. (2003). Exploring the use of computer games and virtual reality in exposure therapy for fear of driving following a motor vehicle accident. Cyberpsychology and Behavior, 6, 329–334. https://doi.org/10.1089/109493103322011641 Wolf, T. R., Hattrup, K., & Mueller, K. (2011). A cross-national investigation of the measurement equivalence of computerized organizational attitude surveys: A two-study design in multiple nations. Journal of Organizational Computing and Electronic Commerce, 21, 246–263. https://doi.org/10.1080/10919392. 2011.590112
European Journal of Psychological Assessment (2020), 36(1), 114–122
Ó 2018 Hogrefe Publishing
History Received July 2, 2017 Revision received May 23, 2018 Accepted June 6, 2018 Published online December 19, 2018 Acknowledgments Special thanks to those who participated in this study. Funding This work was supported by the Massey University Research Fund [Grant Number 10/0084]. Joanne E. Taylor School of Psychology Massey University Private Bag 11-222 Palmerston North New Zealand j.e.taylor@massey.ac.nz
Multistudy Report
The Multidimensional Structure of Math Anxiety Revisited Incorporating Psychological Dimensions and Setting Factors Sofie Henschel1 and Thorsten Roick2 1
Institute for Educational Quality Improvement (IQB), Berlin, Germany
2
Senate Department for Education, Youth and Family, SIBUZ Pankow, Berlin, Germany
Abstract: The study introduces a math anxiety scale that systematically addresses psychological components, including cognitive (worry) and affective (nervousness) math anxiety when dealing with mathematical problems in mathematics-related settings (concerning tests, teachers, learning in class, working with mathematics textbooks, mathematics homework, and applying mathematics in everyday life). Our results indicate a hierarchical structure of math anxiety. Specifically, cognitive and affective math anxiety at the second-order level each determined three setting factors at the first-order level concerning evaluation (tests, teachers), learning (in class, with mathematics books, and during homework), and application (applying mathematics in everyday life). Furthermore, girls reported higher math anxiety than boys, which was particularly pronounced in the affective scale and in high-stakes academic settings, such as those involving evaluation and learning. After controlling for mathematics performance, gender effects decreased in all sub-dimensions but remained significant in affective math evaluation anxiety. Practical implications and directions for further research on cognitive and affective math anxiety are discussed. Keywords: math anxiety, achievement emotion, worry, nervousness, mathematics performance
Math anxiety is an achievement-related emotion that can be described as a feeling of tension, apprehension, or fear in the processing of mathematical problems in daily life and in school settings (Richardson & Suinn, 1972). The control-value theory of achievement emotions (Pekrun, 2006) proposes that appraisals of control (e.g., competence beliefs) and value appraisals (e.g., perceived value of an activity or outcome) serve as critical antecedents of emotional experiences. Specifically, low perceived controllability along with high values individuals ascribe to learning outcomes (e.g., achieving a good grade in order to meet the expectations of significant others) are assumed to evoke math anxiety. Accordingly, students with higher levels of math anxiety have been found to report lower self-concept and self-esteem as well as high achievement values compared to their peers (Frenzel, Pekrun, & Goetz, 2007a; Hembree, 1990). Furthermore, math anxiety has been consistently found to be negatively related to mathematics performance (Hembree, 1990; Vukovic, Kieffer, Bailey, & Harari, 2013). Despite the vast body of research on the antecedents and consequences of math anxiety (Beilock, Gunderson, Ramirez, & Levine, 2010; Frenzel et al., 2007a; Maloney, Ramirez, Gunderson, Levine, & Beilock, 2015; Vukovic
et al., 2013), the internal structure of the construct itself is not yet fully understood. Specifically, the conceptualizations and operationalizations that have been developed over the last 30 years primarily address psychological components (e.g., worry and nervousness; Hembree, 1990; Wigfield & Meece, 1988) and different settings in which individuals may exhibit math anxiety (e.g., test taking, learning, or everyday application; Baloğlu & Balgalmisß, 2010; Chiu & Henry, 1990; Richardson & Suinn, 1972; Roick, Gölitz, & Hasselhorn, 2013; Satake & Amato, 1995). However, studies that systematically incorporate both psychological components and different settings are scarce and often report a total (sum or mean) score of math anxiety, thus implying a unidimensional structure (Lichtenfeld, Pekrun, Stupnisky, Reiss, & Murayama, 2012; Vukovic et al., 2013). The aim of the present study was to develop a math anxiety scale for 4th grade elementary students which systematically integrates both psychological factors and settings and to explore its internal structure and external validity. In the following, we describe two lines of research which conceptualize math anxiety primarily regarding psychological dimensions and regarding specific settings relevant to mathematical problem solving.
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 123–135 https://doi.org/10.1027/1015-5759/a000477
124
S. Henschel & T. Roick, The Multidimensional Structure of Math Anxiety Revisited
The Psychological Structure of Math Anxiety Math anxiety has overlaps with general test anxiety but both constructs are theoretically and empirically distinct from each other (Hembree, 1990). The first line of research detailed in this paper, however, builds on the conceptualization of general test anxiety and differentiates between two main psychological components (Liebert & Morris, 1967). Conscious worry or concern represents the cognitive component comprising self-deprecatory thoughts about one’s performance, negative expectations, and preoccupation with anxiety-causing situations (Wigfield & Meece, 1988). Emotionality represents the affective component and involves feelings of nervousness, fear, and tension along with unpleasant physiological reactions. In a meta-analysis, Hembree (1988) reported moderate to high correlations (.67 r .78) between cognitive and affective components of general test anxiety and showed that the cognitive component correlated more strongly and negatively with test performance than the affective component. This result supports the attentional control theory (Derakshan & Eysenck, 2009), according to which it is in particular the cognitive (worry) component of anxiety that exerts a negative effect on performance by co-opting the limited resources of the working memory system (Baddeley, 2001), which are otherwise used for task processing. However, research on math anxiety which differentiates between these two psychological dimensions is rare and points to different findings. For example, Wigfield and Meece (1988) as well as Ho et al. (2000) used the Mathematics Anxiety Questionnaire (MAQ), which consists of 11 items referring to worrisome thoughts about positive evaluations (e.g., “In general, how much do you worry about how well you are doing in math?”) and emotionality (e.g., “When I am taking math tests, I usually feel. . .”). The items address different academic settings to varying degrees, including test taking (2 items), being evaluated by the teacher (2 items), learning in mathematics (6 items), and learning in school (1 item). However, some settings are confounded with psychological dimensions. For example, items about the mathematics teacher and learning in school are only captured in the worry scale, whereas test taking is only addressed in the emotionality scale. Exploratory factor analyses indicated a two-factor solution for 6th–12th graders with moderate correlations between cognitive and affective math anxiety (.25 r .38; Wigfield & Meece, 1988). In contrast to general test anxiety research, the affective component of the MAQ correlated more strongly and negatively with students’ mathematics performance than the cognitive scale, which was either positively or not at all related to mathematics performance (Ho et al., 2000; Wigfield & Meece, 1988). European Journal of Psychological Assessment (2020), 36(1), 123–135
The differences between the research results on general test anxiety and math anxiety are probably due to different conceptualizations of the cognitive (worry) scale. In general test anxiety, the cognitive scale addresses concerns about self-deprecatory thoughts and negative evaluations. Thus, the focus is on failure (Liebert & Morris, 1967). By contrast, the cognitive scale of the MAQ reflects concerns about positive evaluations and thus focuses on success (Wigfield & Meece, 1988). As concerns about success relate to task completion instead of interferences to it, they may function as a motivator and even affect mathematics performance positively because individuals invest more effort in order to avoid failure (Ho et al., 2000; Wigfield & Meece, 1988). More recent studies that capture cognitive concerns about success or failure and affective aspects (along with negative physiological reactions) to varying degrees typically report a total (sum or mean) score of math anxiety (Lichtenfeld et al., 2012; Vukovic et al., 2013). Therefore, the underlying structure of cognitive and affective math anxiety is still somewhat unclear.
The Setting Structure of Math Anxiety According to the control-value theory of achievement emotions (Pekrun, 2006), features of the social environment (e.g., interactions with and expectations of teachers and parents) vary between settings (e.g., test vs. classroom vs. everyday life) and may thus differentially influence the interplay between motivational antecedents (e.g., selfconcept) and emotional experiences. Lichtenfeld et al. (2012) investigated 2nd and 3rd graders’ achievement emotions in mathematics. The authors showed that students experience enjoyment, boredom, and anxiety (involving cognitive and affective components) differently in situations related to tests, learning in class, and homework (boredom was only assessed in relation to homework and learning in class). Moreover, a hierarchical second-order structure, which integrates the achievement emotions enjoyment, boredom, and anxiety at the second-order level and different settings (test, learning in class, and homework) at the first-order level, provided the best data fit. Several studies that have explored the settings in which individuals may exhibit math anxiety are based on the Mathematics Anxiety Rating Scale (MARS; Richardson & Suinn, 1972). The original MARS consists of 98 items that describe mathematical settings which may arouse anxiety in academic situations (e.g., test taking, learning) and in everyday situations (e.g., checking receipts). Respondents are asked to imagine each of these situations and to rate their anxiety and nervousness. Thus, the scale captures in particular the affective math anxiety component. As the MARS was shown to be psychometrically sound across Ó 2018 Hogrefe Publishing
S. Henschel & T. Roick, The Multidimensional Structure of Math Anxiety Revisited
125
gender, grade, and ethnicity (Hembree, 1990), several short versions (e.g., MARS-R; Plake & Parker, 1982) and adaptations for adolescents (Abbreviated Math Anxiety Scale, AMAS; Hopko, Mahadevan, Bare, & Hunt, 2003) and children (Mathematics Anxiety Scale for Children, MASC; Chiu & Henry, 1990; MARS elementary form, MARS-E; Suinn, Taylor, & Edwards, 1988) have been developed and translated into different languages. For example, translations of the MARS-E are available in several languages, including German (Roick et al., 2013), Japanese (Satake & Amato, 1995), and Turkish (Baloğlu & Balgalmisß, 2010). A two-factor structure of the MARS and its different versions was consistently identified across children, adolescents, and adults. The first factor involves negative affect associated with being tested in mathematics (math test anxiety) and the second factor involves negative feelings towards activities and processes of learning mathematics (math learning and numerical anxiety; Hopko et al., 2003; Plake & Parker, 1982; Suinn et al., 1988). However, for the shortened 26-item MARS-E, which is appropriate for upper elementary students, the two-dimensional structure only explained 28% of the variance in an exploratory factor analysis among 4th, 5th, and 6th graders (Suinn et al., 1988). By contrast, Lukowski et al. (2016) identified a threedimensional factor structure of the MARS-E among 12-year olds by means of exploratory factor analysis. In addition to math test anxiety, they differentiated between math anxiety in the classroom that captures being evaluated by the teacher and doing homework, and math calculation anxiety that involves calculating in academic settings and in everyday life situations. The three setting factors were substantially correlated with each other after controlling for general anxiety (.56 r .81). Furthermore, only math calculation anxiety was significantly associated with mathematics performance over and above math anxiety in test and classroom situations and general anxiety. This solution supports the three-dimensional setting structure Lichtenfeld et al. (2012) showed for several achievement emotions in mathematics (e.g., anxiety, enjoyment), which may indicate similar setting structures across different emotions. Satake and Amato (1995) explored the structure of the Japanese translation of the MARS-E by means of exploratory factor analysis. During the analysis procedure, they excluded the first two items and identified a four-dimensional structure for the final 24-item scale that explained 65% of variance in a sample of 5th and 6th graders. In this study, school-related math anxiety concerning test taking, calculating, learning in the classroom, and applying mathematical concepts in everyday life constituted four separate factors. Based on the revised and shortened MARS-R for adolescents (MARS-R; Plake & Parker, 1982), Chiu and Henry
Several studies have reported higher math anxiety for girls than boys (Frenzel et al., 2007a; Hembree, 1990). Roick et al. (2013) showed that girls’ higher nervousness in mathematics may develop in particular in academic settings because no gender differences were found in less-academic learning settings, such as doing mathematics homework and applying mathematical concepts in everyday life. It has been discussed whether girls’ higher math anxiety is attributable to their lower competence and value beliefs (Frenzel et al., 2007a). The more negative motivation pattern of girls may result from social and environmental influences, such as interactions with teachers and parents as well as instructional practices, gender stereotypes, and associated stereotype threat effects which propose mathematics to be a male domain. For example, Beilock et al. (2010) showed that female elementary school teachers’ math anxiety was related to higher levels of math anxiety in female students. Specifically, throughout the school year teachers transmitted their gender stereotypes (“Mathematics is a male domain.”) to female students, who internalized the teachers’ beliefs (“Boys are better in mathematics than girls.”) and adjusted their own competence beliefs downwards. At the end of the school year, girls reported higher math anxiety and performed worse than boys in mathematics. Thus, anxiety components may be important factors that mediate stereotype threat effects on performance.
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 123–135
(1990) developed the 22-item MASC. This scale is appropriate for students from upper elementary to early secondary school (until 8th grade). In contrast to the MARS-E, the MASC focuses in particular on school-related settings. Using exploratory factor analysis, Chiu and Henry (1990) determined four factors of school-related math anxiety (evaluation, learning, problem solving, and being evaluated by the teacher) among students between the 4th and 8th grade which explained 56% of the variance. More recently, Baloğlu and Balgalmisß (2010) and Roick et al. (2013) identified the same five-factor structure (test, teacher, calculating in class, learning individually, and applying mathematics concepts in everyday life) by means of confirmatory factor analyses. Moderate to satisfactory reliabilities of the five dimensions were obtained among Turkish 2nd to 9th graders (.77 α .86; Baloğlu & Balgalmisß, 2010) and among German 4th graders (.67 α .93; Roick et al., 2013). Furthermore, Roick et al. (2013) observed strong correlations among the latent factors that ranged between .78 and .92, which may indicate a common (affective) math anxiety factor (cf. Lichtenfeld et al., 2012).
Gender Differences in Math Anxiety Components
126
S. Henschel & T. Roick, The Multidimensional Structure of Math Anxiety Revisited
The Present Study
instructional practices, social interactions (e.g., with teachers, parents, and peers), and expected consequences of failure, which may influence motivational antecedents, performance expectations, and thus emotional experiences (Pekrun, 2006). Given that emotional experiences seem to be highly correlated across different situations (Lukowski et al., 2016; Roick et al., 2013), we also predicted that math anxiety would be hierarchically organized (Lichtenfeld et al., 2012), with cognitive and affective math anxiety determining the experience of worry and nervousness in specific settings. In Study II, we developed a short version of our measure that was administered in another sample of 4th graders. We aimed to replicate the internal structure that was identified in Study I and we explored gender differences across math anxiety components in order to provide first indications of external validity. We expected girls to report higher math anxiety than boys, particularly in academic settings (Roick et al., 2013) when controlling for mathematics performance. In the following, we report the methods and results of the two studies.
The literature review indicated that individuals exhibit worrisome thoughts and nervousness in several academic and everyday life settings (Ho et al., 2000; Lukowski et al., 2016). Empirical studies have consistently identified settings related to evaluations (e.g., by tests or teachers) and to activities and processes of learning and performing mathematics (Suinn et al., 1988), which, however, seem to provide further differentiations. Specifically, collective learning settings in the classroom (Lichtenfeld et al., 2012), studying individually (e.g., with mathematics textbooks and during homework; Baloğlu & Balgalmisß, 2010; Roick et al., 2013), and applying mathematical concepts in everyday life (Roick et al., 2013; Satake & Amato, 1995) have been identified across several studies. Although previous research has extensively informed research and practice about the structure of math anxiety, there are at least two limitations to the conceptualizations that primarily address psychological dimensions of math anxiety (e.g., MAQ; Wigfield & Meece, 1988) or specific mathematicsrelated settings (e.g., MARS; Richardson & Suinn, 1972). First, in both approaches the psychological dimensions (worry and nervousness) and related settings (e.g., test, homework, or learning) are often confounded (Chiu & Henry, 1990; Hopko et al., 2003). Moreover, measures that are based on the MARS predominantly capture the affective component. The cognitive component is not explicitly included although it is assumed that performance impairments in mathematics are particularly associated with worrisome thoughts (Derakshan & Eysenck, 2009). Second, many studies report a total (sum or mean) score of math anxiety, implying a unidimensional structure, which may mask whether effects and consequences of math anxiety are related to specific mathematics domains and areas (e.g., Beilock et al., 2010; Frenzel et al., 2007a; Maloney et al., 2015). To provide further insights into the structure and measurement properties of math anxiety, we developed a balanced scale that systematically integrates psychological dimensions and specific settings. In order to explore the internal structure and the external validity of the scale, we conducted two studies among 4th grade elementary students. Study I served to develop the math anxiety scale. We expected students to exhibit both worry and nervousness in at least three main situations that have been identified across previous studies (Chiu & Henry, 1990; Lukowski et al., 2016; Roick et al., 2013; Suinn et al., 1988), including evaluation (by test and teacher), learning (in class, individually during homework, and with mathematics textbooks), and when applying mathematics concepts in everyday life. It can be assumed that these settings differ in terms of
In both studies, we used a non-probability sampling procedure. Specifically, we contacted schools in the city districts of Berlin with at least two 4th grade classes and asked them to participate in the project. If the schools and teachers agreed to participate in the study, students and their parents received an information letter about the aim and procedure of the study and parents were asked to consent in writing to their child’s participation. Ethical clearance for the study was obtained from the state education administration (processing number: VI D 1; 3.6.2014) and it was ensured that the nature and the content of the study did not affect the rights of students, teachers, parents, and other persons. In Study I, we used a cross-sectional correlational approach to analyze the psychometric properties of the math anxiety items. To this end, we developed a multimatrix design that takes into account that each individual student only completed a subset of all items (30 out of 120) but that, 80–100 students in total provided data for each item. In order to minimize the standard errors, 429 German 4th grade elementary students (52% female, Mage = 10.13 years, SD = 0.46) from 23 classrooms were finally recruited. Study I was conducted at the end of the academic year. In Study II, we focused on gender differences and expected girls to report slightly higher math anxiety than
European Journal of Psychological Assessment (2020), 36(1), 123–135
Ó 2018 Hogrefe Publishing
Method Participants and Procedures
S. Henschel & T. Roick, The Multidimensional Structure of Math Anxiety Revisited
boys. A total of 368 German 4th grade students (52% female, Mage = 9.40 years, SD = 0.52) from 24 classrooms participated in Study II, which was conducted at the beginning of the new academic year. In both studies, students answered questions about math anxiety and reported their gender and age. In Study II, participants completed a mathematics test after they had finished the math anxiety questionnaire. On both occasions, the data were assessed in a classroom setting and collected by trained research assistants. To standardize administration and reduce reading demands, questions were read aloud to the children.
Measure Math Anxiety In Study I, we developed 120 items, which systematically incorporated two psychological dimensions (cognitive and affective) and six mathematics-related settings (test, teacher, learning in class, studying with mathematics textbooks, homework, and applying mathematical concepts in everyday life) which have been identified across previous studies (Baloğlu & Balgalmisß, 2010; Chiu & Henry, 1990; Lichtenfeld et al., 2012; Lukowski et al., 2016; Roick et al., 2013; Satake & Amato, 1995). The items systematically referred to four mathematical content areas (arithmetic, geometry, word problems and magnitudes, and stochastic) and to mathematics in general. Math anxiety was assessed in a multi-matrix design, thus, each student answered a subset of 30 items. Cognitive math anxiety (60 items, M = 2.10, SD = 0.67, α = .92) addressed worrisome thoughts and concerns about negative expectations and failure when dealing with mathematical problems (Liebert & Morris, 1967). The children indicated the degree to which they agreed with each statement (e.g., “I worry that I cannot solve an arithmetical problem in mathematics class.”) on a 4-point Likert-scale (1 = does not apply at all, 2 = does rather not apply, 3 = partially applies, 4 = fully applies). Affective math anxiety (60 items, M = 1.80, α = .98, SD = 0.53, α = .89) related to nervousness when dealing with mathematical problems. The children ranked their nervousness (e.g., “How nervous are you when you have to add 976 + 777 + 558 on paper in mathematics class?”) on a 4-point-Likert scale (1 = not at all nervous, 2 = a little nervous, 3 = somewhat nervous, 4 = very nervous). The six mathematical settings were measured with 20 items of which 10 items pertained to the cognitive and affective component respectively (see Appendix, Table A1 and A2 for item examples): Math test anxiety relates to testing situations and receiving test results. Math teacher anxiety involves being evaluated by the mathematics teacher. Math learning in class anxiety relates to the process of learning mathematics and calculating in the classroom. Math Ó 2018 Hogrefe Publishing
127
textbook anxiety involves studying mathematics problems from mathematics textbooks. Math homework anxiety relates to studying mathematics problems individually during homework. Math application anxiety refers to the application of mathematical concepts in everyday life (see Table 1 for item and scale properties). Based on the results of Study I, we generated a meaningful scale suitable for brief group administration that was tested in Study II. The item selection process was guided by psychometric criteria (item discrimination and difficulty) and theoretical considerations (coverage of mathematical settings and content areas). First, the items of the setting factors which were determined in Study I (including evaluation, learning, and application; see Results section) were sorted by their discriminations. Then, we selected items with highest discrimination values that showed moderate difficulty without floor effects and that covered the intended mathematical content areas. Following this procedure, we selected 36 items from the original 120-item scale, of which 18 items pertained to cognitive (α = .92) and affective math anxiety (α = .89) respectively. In the final scale, the affective and cognitive items were equally distributed across the three mathematical settings. Item examples and item and scale properties are presented in the Appendix. Mathematics Performance In Study II, we assessed mathematics performance as a control variable using the German Mathematical Achievement Test (DEMAT 3+; Roick, Gölitz, & Hasselhorn, 2004), which can be applied between the end of the 3rd grade and the beginning of the 4th grade. The DEMAT 3+ is a curriculum-oriented achievement test that contains subscales with arithmetic problems, word problems and magnitudes, and geometry problems. Cronbach’s α for the 31-item scale was .86.
Statistical Analysis In both studies, the statistical analyses were conducted with SPSS and Mplus 7.4 (Muthén & Muthén, 1998–2012). In Study I, the item discriminations of the cognitive and affective math anxiety items varied between .11 and .91. Thus, it did not seem appropriate to apply the restrictive 1-PL (Rasch) model to explore the internal structure of math anxiety because the 1-PL model proposes the same discrimination for all items. Therefore, we specified a two-parameter logistic model (2-PL; see Yen & Fitzpatrick, 2006), which is less restrictive than the 1-PL model. Specifically, it takes into account the variation across items in the relationship between students’ item response and the latent trait, as indicated by the discrimination value. Therefore, the 2-PL model predicts the expected math anxiety score European Journal of Psychological Assessment (2020), 36(1), 123–135
128
S. Henschel & T. Roick, The Multidimensional Structure of Math Anxiety Revisited
Table 1. Descriptive statistics, discriminations, reliabilities, and correlations among setting factors of cognitive and affective math anxiety M (SD)a
r(it)min–r(it)maxa αa
1
2
4
1 Math test anxiety
2.47 (0.88)
.33–.83
.89
2 Math teacher anxiety
2.17 (0.88)
.26–.85
.90
.87
3 Math learning in class 2.00 (0.82) anxiety 4 Math homework anxiety 2.09 (0.85)
.11–.78
.85
.89
.83
.35–.84
.83
.85
.88
.91
5 Math textbook anxiety
2.00 (0.82)
.46–.77
.88
.86
.84
.87
.91
6 Math application anxiety 1.97 (0.83)
.50–.69
.83
.74
.77
.84
.77
M (SD)b
.82
3
5
6
.70
.67
.80
.50
.58
.65
.70
.53
.81
.81
.68
.83
.65 .67
.78
2.18 (0.82)
2.24 (0.86)
1.67 (0.69)
1.63 (0.70)
1.64 (0.54)
1.61 (0.70)
r(it)min–r(it)maxb
.40–.91
.34–.85
.35–.81
.50–.78
.29–.80
.48–.85
αb
.88
.87
.84
.85
.83
.84
Notes. Means (M), Standard Deviations (SD), discrimination range [r(it)min–r(it)max], and standardized Cronbach’s alpha refer to cognitive and baffective math anxiety. Due to the multi-matrix design, the standardized Cronbach’s alphas were calculated from available intercorrelations. Correlations among cognitive setting factors are illustrated below the diagonal and correlations among affective setting factors are illustrated above the diagonal. All correlations are significant at p < .01, N = 429. a
power (95%). Taking into account the sample size (N = 368) and significance level (α = 1%), a posteriori test planning was used to calculate the critical effect size of Cohen’s d = 0.42. In accordance with Cohen’s d, 0.2 d < 0.5 indicates small, 0.5 d < 0.8 indicates moderate, and 0.8 d indicates large practical relevance of the group differences (Cohen, 1992).
as a function of the latent ability and the two item parameters difficulty and discrimination. Moreover, we used Bayesian estimation because maximum or weighted likelihood estimation becomes computationally unwieldy with data for 120 categorical items and up to 12 different subdimensions (e.g., 2 6 settings) of cognitive and affective math anxiety. Another advantage of Bayesian estimation is that it can effectively address drift and unusual response patterns (e.g., floor or ceiling effects), and it is appropriate for smaller sample sizes (Rupp, Dey, & Zumbo, 2004). In Study I, we specified all dimensions as latent variables and used the Bayesian Information Criterion (BIC) as well as the Posterior predictive p-value (PPP) for model evaluation. Models with small BIC values should be preferred. According to Raftery (1995), differences in BIC scores of more than five units indicate strong evidence for differences in model appropriateness. The PPP reflects the discrepancy between the model generated data and the observed data. Zyphur and Oswald (2015) suggest regarding PPP values around .50 as an indicator of good model fit because, on average, the observed data are just as probable as the generated data. In Study II, the number of math anxiety items was reduced to 36. All students answered all items and it was possible to apply weighted maximum likelihood estimation to investigate gender differences in the math anxiety subdimensions. We used categorical items as indicators to specify the math anxiety sub-dimensions as latent variables. The estimator WLSMV (Weighted Least Square Means and Variance adjusted) was applied in conjunction with the option “type = complex” to obtain test statistics and robust standard errors which account for the nested data structure (students in classes). For hypothesis testing, we conducted multi-group comparisons (girls vs. boys) for each subdimension of math anxiety based on latent means with a reasonable significance level (one-sided α, Bonferroni corrected to 1% for multiple comparisons) and fair statistical
In a first step, we examined the psychological structure of math anxiety and compared a unidimensional general factor model (BIC = 32,050) with a two-dimensional model of cognitive and affective math anxiety (BIC = 31,692). In accordance with previous research (Wigfield & Meece, 1988), the two-dimensional model provided a better fit to the data (ΔBIC = 358). The correlation between cognitive and affective math anxiety (r = .64, p < .01) was considerably stronger than in previous math anxiety studies, presumably because we addressed concerns about failure instead of success in the cognitive scale (Wigfield & Meece, 1988). At the same time, the strength of this relationship is consistent with results from studies on general test anxiety (Hembree, 1988). Furthermore, the proportion of common variance between cognitive and affective math anxiety might be relatively high because mathematics-related settings and content areas were balanced across the psychological dimensions. As it seemed appropriate to differentiate between cognitive and affective math anxiety, we subsequently analyzed the settings in which worry and nervousness may occur. First, the descriptive statistics in Table 1 (see also Electronic Supplementary Material, ESM 1) show that students
European Journal of Psychological Assessment (2020), 36(1), 123–135
Ó 2018 Hogrefe Publishing
Results Study I
S. Henschel & T. Roick, The Multidimensional Structure of Math Anxiety Revisited
experience worries (cognitive math anxiety) and nervousness (affective math anxiety) primarily in school-related settings that involve being evaluated by tests and teachers. Furthermore, the weakest zero-order correlations were obtained between (cognitive and affective) math application anxiety and all other setting factors. On average, the correlations among the cognitive setting factors are slightly stronger ( r = .61, SD = 0.16) and the range between the correlations is smaller (.74 r .91) as opposed to the correlations among the affective sub-dimensions ( r = .56, SD = 0.16), which cover a wider range (.50 r .83). In line with our expectations, the relationships among cognitive and among affective setting factors respectively (Table 1) are considerably stronger than the correlations between cognitive and affective settings (Table 2, see also ESM 1), indicating that setting-specific worries and nervousness may be related to cognitive and affective math anxiety at a higher-order level (cf. Lichtenfeld et al., 2012). Therefore, we regarded the two-dimensional model of math anxiety as our baseline model (model A) and then gradually expanded the setting structure of cognitive and affective math anxiety at the first-order level (Table 3). The following considerations guided the process of model extension. Model B differentiates between math evaluation anxiety (involves test and teacher) and a residual factor that is comparable with math learning and performance anxiety (Suinn et al., 1988). In model C, we differentiated the residual factor into math learning anxiety (involves learning in class and studying mathematics individually with mathematics textbooks and during homework) and math application anxiety (Roick et al., 2013). In model D, math learning anxiety was further divided into learning collaboratively in class and studying individually (involving homework and textbooks; Baloğlu & Balgalmisß, 2010; Lichtenfeld et al., 2012). Finally, in model E studying individually was divided into math textbook and math homework anxiety. Also, math evaluation anxiety was divided into math test and teacher anxiety (Chiu & Henry, 1990; Roick et al., 2013). The model fit statistics are illustrated in Table 3 (see also ESM 2). The PPP values of all second-order models (B–E) from Study I vary between .37 and .38, indicating that all models are equally appropriate and provide acceptable model fit. According to the BIC, all second-order models fit the data better than the two-dimensional first-order model A. Comparing models B to E reveals an increasing improvement in the BIC up to model C. In this 2 3dimensional model, cognitive and affective math anxiety each determine three first-order factors: Math evaluation anxiety (including test and teacher), math learning anxiety (including learning in class, studying individually during homework and with mathematics textbooks), and math application anxiety (applying mathematical concepts in everyday life). Ó 2018 Hogrefe Publishing
129
Table 2. Correlations between setting factors of cognitive and affective math anxiety 1
2
3
4
5
6
.71
.52
.62
.60
.60
.48
.61
.54
.38
.44
.50
.38
.55
.43
1
Math test anxiety
2
Math teacher anxiety
.65
.50
.59
3
Math learning in class anxiety
.59
.39
.49
4
Math homework anxiety
.55
.37
.53
.50
5
Math textbook anxiety
.62
.48
.58
.55
.61
.57
6
Math application anxiety
.44
.36
.39
.46
.41
.32
Notes. Setting factors of cognitive math anxiety are illustrated in columns and setting factors of affective math anxiety are displayed in rows. All correlations are significant at p < .01, N = 429.
Table 3. Fit statistics of the estimated models First-order factors
Second-order factors
BIC
Model A
2
0
31,692
Model B
2
2
31,598
94
Model C
3
2
31,577
116 (21)
.37
Model D
4
2
31,580
112 ( 3)
.38
Model E
6
2
31,592
100 ( 12)
.38
ΔBICa
PPPb .34 .37
Notes. All models used Bayesian estimation. BIC = Bayesian Information Criterion. aDifference scores to the two-dimensional first-order model A and to the previous model (values in brackets). Models with positive difference scores indicate a better approximation and should be preferred. b PPP = Posterior Predictive p-value substantially above zero indicate a well-fitting model. Models B–E include the following settings at the first-order level, equally determined by cognitive and affective math anxiety at the second-order level. Model B: Evaluation (test, teacher), Learning and performance (learning in class, textbook, homework, application). Model C: Evaluation (test, teacher), Learning (learning in class, textbook, homework), Application. Model D: Evaluation (test, teacher), Learning in class, Studying individually (textbook, homework), Application. Model E: Test, Teacher, Learning in class, Textbook, Homework, Application. N = 429.
Table 4 (see also ESM 2) illustrates that all setting factors of the hierarchical models have substantial loadings on the second-order factors. However, the loadings on affective math anxiety are consistently lower than on cognitive math anxiety. Although the correlation coefficients between the two second-order factors are smaller in the 2 4-dimensional model D and in the 2 6-dimensional model E, these models did not provide substantially better PPP or BIC values (see Table 3) than the more parsimonious 2 3-dimensional model C in which the second-order factors correlate slightly more strongly. The lower BIC of the more complex models (D and E) may be due to the strong correlations between the cognitive setting factors learning in class and studying individually during homework (r = .91, p < .01, Table 1) and between test and teacher (r = .87, p < .01, Table 1). Thus, differentiating between these aspects does not seem to provide additional information gain. Taking into account the BIC and the correlations European Journal of Psychological Assessment (2020), 36(1), 123–135
130
S. Henschel & T. Roick, The Multidimensional Structure of Math Anxiety Revisited
Table 4. Parameters of the estimated models Sub-dimensions
Model A
Cognitive and affective math anxiety (Ψ)
Model B
Model C
Model D
Model E .68
.64
.81
.73
.70
–
.91/.84
.91/.83
.89/.81
–
Test (β)
–
–
–
–
.91/.84
Teacher (β)
Evaluation (test, teacher) (β)
–
–
–
–
.88/.79
Learning and performance (learning in class, homework, textbook, application) (β)
–
.87/.73
–
–
–
Learning (learning in class, homework, textbook) (β)
–
–
.91/.82
–
–
Learning in class (β)
–
–
–
.91/.93
.90/.90
Studying (homework and textbook) (β)
–
–
–
.93/.83
– .93/.83
Homework (β)
–
–
–
–
Textbook (β)
–
–
–
–
.93/.82
Application (β)
–
–
.82/.68
.83/.68
.82/.67
Notes. Further information about Models A–E is provided in Table 3. Ψ = Intercorrelations between cognitive and affective math anxiety; β = Factor loadings of first-order setting factors on cognitive and affective math anxiety. Numbers before the slash indicate loadings on cognitive math anxiety and numbers after the slash indicate loadings on affective math anxiety. All parameters (Ψ, β) are significant at p < .01, N = 429.
between the sub-dimensions (and the less informative PPPvalues) altogether, the parsimonious 2 3-dimensional model seems best to describe the internal structure of math anxiety. Finally, we compared the 2 3-dimensional secondorder model C with a 3-dimensional first-order setting model (BIC = 32,070) which integrates the same cognitive and affective settings of model C into three composite factors (evaluation, learning, and application). The setting factors were highly correlated (.82 r .92) and the model did not provide a better fit to the data than the 2 3dimensional second-order model (ΔBIC = 493). Thus, our results support the multidimensional hierarchical structure of math anxiety with 2 3 setting factors (evaluation, learning, and application) at the first-order level and cognitive and affective math anxiety at the second-order level. Furthermore, integrating several settings that have been identified in previous math anxiety research (test vs. teacher; Baloğlu & Balgalmisß, 2010; Roick et al., 2013) into composite factors (e.g., evaluation) seems appropriate in order to describe both cognitive and affective math anxiety equally well.
Table 5. Tests of measurement invariance for the 2 3-dimensional model Invariance model
w2
df
RMSEA
TLI
CFI
Configural
1,406.34
1,175
.03
.96
.96
Metric
1,411.84
1,204
.03
.97
Scalar
1,536.87
1,276
.03
.96
Δw2
df
p
.96
38.66
29
.11
.96
83.37
72
.17
Notes. All models used WLSMV estimation with categorical variables. Configural invariance: unrestricted baseline model in which each group had the same structure. Metric invariance: all first- and second-order factor loadings were constrained to be equal across groups. Scalar invariance: The intercepts of the measured variables and the intercepts of the firstorder latent factors were constrained to be equal across groups. RMSEA = Root Mean Square Error of Approximation, CFI = Comparative Fit Index, TLI = Tucker-Lewis Index. N = 368.
In order to replicate the multidimensional second-order structure of math anxiety from Study I, we compared the two-dimensional first-order baseline model of cognitive and affective math anxiety (w2 = 805.38, df = 593, p < .01, TLI = .96, CFI = .96, r = .74) with the 2 3-dimensional second-order model (w2 = 781.96, df = 588, p < .01, TLI = .97, CFI = .97, r = .77). In support of Study I, the model test provided a better fit (Δw2 = 17.73, df = 5, p < .01) for the second-order model, in which cognitive and affective math anxiety each determine worries and
nervousness in settings related to evaluation, learning, and application at the first-order level (see also ESM 3). As scalar measurement invariance can be assumed between male and female students (Table 5, see also ESM 4), the latent means can be resonably compared and interpreted across groups. As hypothesized, Table 6 (see also ESM 5) shows that girls reported higher levels of cognitive and affective math anxiety than boys. Regarding the total scores, gender differences were more strongly pronounced in affective math anxiety than in cognitive math anxiety. Examining the setting factors revealed a medium to large sized gender gap in affective math evalution anxiety, followed by moderate effects in cognitive math evaluation and learning anxieties. By contrast, only small gender differences were obtained in everyday life situations (cognitive and affective math application anxiety) and in affective math learning anxiety, which were not significant at the adjusted significance level (p < .01) for multiple comparisons. After controlling for differences in mathematics performance, the moderate gender effects in the cognitive (worry) sub-dimensions decreased and were no longer
European Journal of Psychological Assessment (2020), 36(1), 123–135
Ó 2018 Hogrefe Publishing
Study II
S. Henschel & T. Roick, The Multidimensional Structure of Math Anxiety Revisited
131
Table 6. Girls’ and boys’ cognitive and affective math anxiety
Cognitive math anxiety
Girls
Boys
M (SD)
M (SD)
Cohens’ d (p)a
Cohens’ d (p)b 0.29 (.05)
2.15 (0.66)
1.89 (0.65)
0.49 (< .01)
Evaluation
2.24 (0.71)
1.96 (0.74)
0.50 (< .01)
0.31 (.05)
Learning
2.06 (0.74)
1.80 (0.72)
0.49 (< .01)
0.31 (.05)
Application Affective math anxiety
2.10 (0.74)
1.89 (0.70)
0.45 (.03)
0.30 (.10)
2.37 (0.59)
2.11 (0.65)
0.58 (< .01)
0.42 (.05)
Evaluation
2.78 (0.67)
2.39 (0.78)
0.64 (< .01)
0.51 (< .01)
Learning
2.00 (0.69)
1.84 (0.68)
0.41 (.02)
0.27 (.09)
Application
2.09 (0.72)
1.95 (0.71)
0.32 (.14)
0.26 (.15)
Notes. Means and standard deviations refer to observed scores, which correspond to the response format of the scale and can be directly interpreted. Effect sizes were determined using Cohen’s d (Cohen, 1992) and refer to multigroup comparisons of latent means, which are not biased by measurement error as opposed to observed scores. Cohen’s d: 0.2 d < 0.5 indicates a small, 0.5 d < 0.8 indicates a moderate and 0.8 d indicates a large effect. aUncorrected. b Adjusted for mathematics performance. N = 368.
The present research comprised two studies designed to examine the multidimensional structure of cognitive and affective math anxiety in different mathematics-related settings. The data from both studies support a hierarchical structure of math anxiety and provide first indications of external validity as indicated by differential gender effects. As for the hypothesized multidimensional structure, we found that students experience worries and nervousness in settings related to evaluation (e.g., by test or teacher; Chiu & Henry, 1990), learning (in class and studying individually with the mathematics textbook and during homework; Lukowski et al., 2016), and when applying mathematical concepts in everyday life situations (Roick et al., 2013). On the one hand, our findings corroborate the assumption that achievement emotions are hierarchically organized (Lichtenfeld et al., 2012) and expand upon these findings by integrating cognitive and affective components at a higher-order level. The question remains, however, whether a third level would be needed in order to take psychological components of other achievement emotions into account. In the present study, nervousness and worry were particularly pronounced in high-stakes situations that may have direct and important consequences, such as getting a bad grade in tests or being negatively evaluated by the teacher. By contrast, low-stakes everyday mathematics activities
(e.g., calculating change) do not seem to cue worries and nervousness in the same way. Our results thus add to previous research that indicated higher nervousness in test situations compared to learning in class, doing homework, and in everyday situations (Suinn et al., 1988). It is possible that the interplay between individual characteristics of the learner and environmental features may shape the experience of math anxiety in specific settings (Beilock et al., 2010). For example, students may experience lower levels of math anxiety in extracurricular settings because the perceived environment and, especially, social interactions (e.g., with parents vs. teachers) differ considerably from academic settings and consequences of failure may be less important. By contrast, aspects of the classroom environment may account in particular for variance in math anxiety related to academic settings. For example, Frenzel, Pekrun, and Goetz (2007b) showed that high perceived peer esteem and quality of instruction in mathematics related to reduced levels of math anxiety, whereas higher perceived punishment by the teacher and competition in class related to higher math anxiety. Further studies should investigate the developmental mechanisms of worry and nervousness in specific settings because this may also be helpful in order to better prevent girls’ higher math anxiety. Specifically, our results add to previously reported gender effects for a total score of math anxiety (Frenzel et al., 2007a) by showing that girls’ higher nervousness and worries seem to be particularly pronounced in high-stakes settings (Roick et al., 2013). Interestingly, although girls seem to be more nervous than boys in academic settings related to evaluations, boys seem to differ much more in the extent to which they feel nervous in such situations compared to girls because we observed considerably more variance in boys (also compared to any other sub-dimension of math anxiety; Table 6). Furthermore, the gender differences in the cognitive math anxiety components evaluation and learning were strongly
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 123–135
significant. By contrast, the moderate gender effect in affective math evaluation anxiety only decreased slightly and remained significant. Thus, girls’ higher levels of worrisome thoughts – especially in evaluation- and learning-related settings – seem to be more strongly associated with mathematics performance than girls’ higher nervousness in evaluation-related settings.
Discussion
132
S. Henschel & T. Roick, The Multidimensional Structure of Math Anxiety Revisited
related to differences in mathematics performance. Considering the cross-sectional design of our study, this finding may be interpreted in two directions: According to the deficits model, our results may suggest that girls’ lower mathematics performance evokes their higher level of worries in academic situations related to evaluations and learning. On the other hand, the debilitating model would suggest that girls’ higher worries in the situations described above impede their mathematics performance. Several longitudinal and experimental studies provide evidence for both directions (Ma & Xu, 2004; Vukovic et al., 2013), which may suggest a reciprocal relationship over time (Carey, Hill, Devine, & Szücs, 2016). However, girls’ higher nervousness in evaluation-related settings seems to be less strongly associated with gender differences in mathematics performance. Frenzel et al. (2007a) showed in a longitudinal study that girls’ lower control and value beliefs completely accounted for the differences in a composite measure of math anxiety. Thus, future research should investigate whether perceived controllability and values can explain in particular girls’ higher nervousness in evaluation-related settings.
Limitations and Directions for Future Research The present study has some limitations that should be noted when interpreting the results. First, as we only included 4th grade elementary students, future research should investigate the generalizability of our findings across different age groups. Second, our scales are only one out of several possibilities to systematically measure cognitive and affective components of math anxiety. In order to improve the scales for use in future studies, it could be worthwhile to revise the wording of some items as well as the response format. For example, 6 out of 18 items in the cognitive scale (items f, j, k, l, m, and n; see also Appendix, Table A1) define no specific reason for worrisome thoughts and could be revised to ensure the same level of specificity in all items. Furthermore, to improve the comparability between both scales, it could be useful to adjust the current response format of the cognitive scale (does not apply at all to fully applies) to the more specific format that is used in the affective scale (not at all nervous/worried to very nervous/worried). Furthermore, the gender effects and their differential associations with mathematics performance are only very first indications of the construct validity of the scale. To provide sufficient evidence for the multidimensional structure, subsequent studies should investigate the specificity and usefulness of the underlying setting factors more closely. For example, by examining differential relationships with other measures of cognitive (e.g., MAQ; Wigfield & Meece,
European Journal of Psychological Assessment (2020), 36(1), 123–135
1988) and affective math anxiety (e.g., MARS-E; Richardson & Suinn, 1972). Moreover, stereotypical beliefs and instructional practices of teachers may affect in particular math anxiety in academic settings (Beilock et al., 2010), whereras beliefs, expectations, and practices of parents may primarily affect math application anxiety (Maloney et al., 2015). Given that students seem to experience feelings of nervousness and worrisome thoughts in different settings, teachers and parents should be encouraged to reflect on their (potentially stereotypical) behavior as this may shape students’ motivational beliefs and subsequently their emotional experiences. Taken together, the findings of the present study suggest that further research on math anxiety should consider the distinction not only between worry and nervousness but also between different settings in which math anxiety may occur in order to broaden our knowledge about affective processes in learning situations and how they interact with learning outcomes. We are hopeful that the theoretical math anxiety framework and the empirical scale introduced in this study will prove to be useful for further investigating the antecedents and consequences of cognitive and affective math anxiety components in different achievement-related settings more closely. Electronic Supplementary Materials The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000477 ESM 1. Tables (.txt) Input and output information of the latent correlations. ESM 2. Tables (.txt) Input and output information for Study I (Models A–E). ESM 3. Tables (.txt) Input and output information for Study II. ESM 4. Tables (.txt) Invariance Models of Study II. ESM 5. Tables (.txt) Gender differences in 1st- and 2nd-order factors, Study II.
References Baddeley, A. D. (2001). Is working memory still working? The American Psychologist, 56, 851–864. https://doi.org/10.1027/ 1016-9040.7.2.85 Baloğlu, M., & Balgalmisß, E. (2010). The adaptation of the mathematics anxiety rating scale-elementary form into Turkish, language validity, and preliminary psychometric investigation. Educational Sciences: Theory & Practice, 10, 101–110. Beilock, S. L., Gunderson, E. A., Ramirez, G., & Levine, S. C. (2010). Female teachers’ math anxiety affects girls’ math achievement. PNAS Proceedings of the National Academy of Sciences of the United States of America, 107, 1860–1863. https://doi.org/ 10.1073/pnas.0910967107
Ó 2018 Hogrefe Publishing
S. Henschel & T. Roick, The Multidimensional Structure of Math Anxiety Revisited
133
Carey, E., Hill, F., Devine, A., & Szücs, D. (2016). The chicken or the egg? The direction of the relationship between mathematics anxiety and mathematics performance. Frontiers in Psychology, 6, 1–6. https://doi.org/10.3389/fpsyg.2015.01987 Chiu, L.-H., & Henry, L. L. (1990). Development and validation of the Mathematics Anxiety Scale for Children. Measurement and Evaluation in Counseling and Development, 23, 121–128. Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159. https://doi.org/10.1037/0033-2909.112.1.155 Derakshan, N., & Eysenck, M. W. (2009). Anxiety, processing efficiency, and cognitive performance: New developments from attentional control theory. European Psychologist, 14, 168–176. https://doi.org/10.1027/1016-9040.14.2.168 Frenzel, A. C., Pekrun, R., & Goetz, T. (2007a). Girls and mathematics – A hopeless issue? A control-value approach to gender differences in emotions towards mathematics. European Journal of Psychology of Education, 22, 497–514. https://doi.org/ 10.1007/BF03173468 Frenzel, A. C., Pekrun, R., & Goetz, T. (2007b). Perceived learning environment and students’ emotional experiences: A multilevel analysis of mathematics classrooms. Learning and Instruction, 17, 478–493. https://doi.org/10.1016/j.learninstruc.2007.09.001 Hembree, R. (1988). Correlates, causes, effects, and treatment of test anxiety. Review of Educational Research, 58, 47–77. https://doi.org/10.3102/00346543058001047 Hembree, R. (1990). The nature, effects, and relief of mathematics anxiety. Journal for Research in Mathematics Education, 21, 33–46. https://doi.org/10.2307/749455 Ho, H.-Z., Senturk, D., Lam, A. G., Zimmer, J. M., Hong, S., & Okamoto, Y. (2000). The affective and cognitive dimensions of math anxiety: A cross-national study. Journal for Research in Mathematics Education, 31, 362–379. https://doi.org/10.2307/ 749811 Hopko, D. R., Mahadevan, R., Bare, R. L., & Hunt, M. K. (2003). The Abbreviated Math Anxiety Scale (AMAS): Construction, validity, and reliability. Assessment, 10, 178–182. https://doi.org/ 10.1177/1073191103010002008 Lichtenfeld, S., Pekrun, R., Stupnisky, R. H., Reiss, K., & Murayama, K. (2012). Measuring students’ emotions in the early years: The Achievement Emotions Questionnaire-Elementary School (AEQ-ES). Learning and Individual Differences, 22, 190–201. https://doi.org/10.1016/j.lindif.2011.04.009 Liebert, R. M., & Morris, L. W. (1967). Cognitive and emotional components of test anxiety: A distinction and some initial data. Psychological Reports, 20, 975–978. https://doi.org/10.2466/ pr0.1967.20.3.975 Lukowski, S. L., DiTrapani, J., Jeon, M., Wang, Z., Schenker, V. J., . . . Petrill, S. A. (2016). Multidimensionality in the measurement of math-specific anxiety and its relationship with mathematical performance. Learning and Individual Differences. Advance online publication. https://doi.org/10.1016/j.lindif.2016.07.007 Ma, X., & Xu, J. (2004). The causal ordering of mathematics anxiety and mathematics achievement: A longitudinal panel analysis. Journal of Adolescence, 27, 165–179. https://doi.org/10.1016/ j.adolescence.2003.11.003 Maloney, E. A., Ramirez, G., Gunderson, E. A., Levine, S. C., & Beilock, S. L. (2015). Intergenerational effects of parents’ math anxiety on children’s math achievement and anxiety. Psychological Science, 26, 1480–1488. https://doi.org/10.1177/ 0956797615592630 Muthén, L. K., & Muthén, B. O. (1998–2012). Mplus user’s guide (7th ed.). Los Angeles, CA: Muthén & Muthén. Pekrun, R. (2006). The control-value theory of achievement emotions: Assumptions, corollaries, and implications for educational research and practice. Educational Psychology
Review, 18, 315–341. https://doi.org/10.1007/s10648-0069029-9 Plake, B. S., & Parker, C. S. (1982). The development and validation of a revised version of the Mathematics Anxiety Rating Scale. Educational and Psychological Measurement, 42, 551–557. Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111–163. https://doi.org/ 10.2307/271063 Richardson, F. C., & Suinn, R. M. (1972). The Mathematics Anxiety Rating Scale: Psychometric data. Journal of Counseling Psychology, 19, 551–554. https://doi.org/10.1037/h0033456 Roick, T., Gölitz, D., & Hasselhorn, M. (2004). Deutscher Mathematiktest für dritte Klassen (DEMAT 3+) [German Mathematical Achievement Test for Third Graders]. Göttingen, Germany: Beltz Test. Roick, T., Gölitz, D., & Hasselhorn, M. (2013). Affektive Komponenten der Mathematikkompetenz: Die Mathematikangst-Ratingskala für vierte bis sechste Klassen (MARS 4–6) [Affective components of mathematical competence: The Mathematics Anxiety Rating Scale for fourth to sixth graders (MARS 4–6)]. In M. Hasselhorn, A. Heinze, W. Schneider, & U. Trautwein (Eds.), Diagnostik mathematischer Kompetenzen (pp. 205–224). Göttingen, Germany: Hogrefe. Rupp, A. A., Dey, D. K., & Zumbo, B. D. (2004). To bayes or not to bayes, from whether to when: Applications of Bayesian methodology to modeling. Structural Equation Modeling: A Multidisciplinary Journal, 11, 424–451. https://doi.org/10.1207/ s15328007sem1103_7 Satake, E., & Amato, P. P. (1995). Mathematics anxiety and achievement among Japanese elementary school students. Educational and Psychological Measurement, 55, 1000–1007. Suinn, R. M., Taylor, S., & Edwards, R. W. (1988). Suinn Mathematics Anxiety Rating Scale for elementary school students (MARS-E): Psychometric and normative data. Educational and Psychological Measurement, 48, 979–985. https://doi.org/ 10.1177/0013164488484013 Vukovic, R. K., Kieffer, M. J., Bailey, S. P., & Harari, R. R. (2013). Mathematics anxiety in young children: Concurrent and longitudinal associations with mathematical performance. Contemporary Educational Psychology, 38, 1–10. https://doi.org/ 10.1016/j.cedpsych.2012.09.001 Wigfield, A., & Meece, J. L. (1988). Math anxiety in elementary and secondary school students. Journal of Educational Psychology, 80, 210–216. https://doi.org/10.1037/0022-0663.80.2.210 Yen, W. M., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational measurement (pp. 111–153). Westport, CT: Praeger. Zyphur, M. J., & Oswald, F. L. (2015). Bayesian estimation and inference. Journal of Management, 41, 390–420. https://doi. org/10.1177/0149206313501200
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 123–135
Received February 27, 2017 Revision received November 27, 2017 Accepted December 12, 2017 Published online August 3, 2018 EJPA Section/Category Educational Psychology Sofie Henschel Institute for Educational Quality Improvement (IQB) Unter den Linden 6 10099 Berlin Germany sofie.henschel@iqb.hu-berlin.de
134
S. Henschel & T. Roick, The Multidimensional Structure of Math Anxiety Revisited
Appendix Table A1. Cognitive math anxiety items I worry. . .
Setting
Mathematics content area
M (SD)
λ
(a) that I measure the wrong quantities when I follow a cooking recipe.
Application
Word problems and magnitudes
2.05 (1.06)
0.52
(b) that I do not notice getting shortchanged when shopping.
Application
Arithmetic
2.06 (1.12)
0.56
(c) that I will have to deal a lot with numbers when I have a job someday.
Application
Mathematics in general
1.86 (1.03)
0.68
(d) that I wrongly estimate the size of a box.
Application
Geometry
2.03 (0.98)
0.73
(e) that the problems in my mathematics book are too difficult for me.
Mathematics book
Mathematics in general
1.93 (0.98)
0.82
(f) when I have to open a page in my mathematics book that has many tasks with lengths and weights.
Mathematics book
Word problems and magnitudes
2.00 (0.99)
0.75
(g) that I cannot complete my homework on geometric bodies.
Mathematics homework
Geometry
1.90 (0.98)
0.73
(h) that I cannot complete my mathematics homework.
Mathematics homework
Mathematics in general
1.78 (0.95)
0.79
(i) that I cannot solve an arithmetical problem in mathematics class.
Mathematics class
Arithmetic
1.95 (1.00)
0.78
(j) that I have to read out numbers from a diagram in mathematics class.
Mathematics class
Stochastic
2.01 (1.02)
0.71
(k) when my teacher asks me to explain a diagram.
Teacher
Stochastic
2.14 (1.04)
0.62
(l) when my teacher asks me something in mathematics class.
Teacher
Mathematics in general
1.94 (1.05)
0.80
(m) when my teacher is watching me while I make a mirror drawing in my notebook.
Teacher
Geometry
1.88 (1.01)
0.58
(n) when my teacher will ask me to give the solution to an addition problem.
Teacher
Arithmetic
1.82 (1.03)
0.77
(o) that I do not have enough time in a mathematics test.
Test
Mathematics in general
2.45 (1.12)
0.74
(p) that I will not find a way to solve a written multiplication task in a mathematics test.
Test
Arithmetic
2.26 (1.01)
0.71
(q) that I will not find the solution for a problem about length units in a mathematics test.
Test
Word problems and magnitudes
2.25 (0.99)
0.81
(r) that I will not understand a problem in a mathematics test that contains many tables.
Test
Stochastic
2.11 (0.98)
0.79
Notes. Items were presented in German. Response labels: 1 = does not apply at all, 2 = does rather not apply, 3 = partially applies, 4 = fully applies. All standardized factor loadings (λ) are significant at p < .01 and relate to the intended setting factors: Application (4 items a–d, M = 2.00, SD = 0.72, α = .83), Learning (6 items e–j, M = 1.93, SD = 0.74, α = .92), and Evaluation (8 items k–r, M = 2.11, SD = 0.74, α = .89).
European Journal of Psychological Assessment (2020), 36(1), 123–135
Ó 2018 Hogrefe Publishing
S. Henschel & T. Roick, The Multidimensional Structure of Math Anxiety Revisited
135
Table A2. Affective math anxiety items How nervous are you when. . .
Setting
Mathematics content area
M (SD)
λ
(a) you are trying to figure out how big your chances are of throwing a six twice in a row in a dice game?
Application
Stochastic
1.90 (1.01)
0.57
(b) you want to cook pudding for four people but the quantities in the recipe are only for three people?
Application
Mathematics in general
2.32 (1.12)
0.46
(c) you get your change in many 1-, 2- and 5-cent coins when you do shopping?
Application
Word problems and magnitudes
1.82 (1.07)
0.57
(d) you want to estimate the size of a box?
Application
Geometry
2.05 (0.97)
0.77
(e) you read this sentence in your mathematics book: “Convert the following quantities”?
Mathematics book
Word problems and magnitudes
2.30 (1.01)
0.76
(f) you see a whole page of addition problems in your mathematics book?
Mathematics book
Arithmetic
1.76 (1.01)
0.69
(g) you get a homework assignment at the end of mathematics class?
Mathematics homework
Mathematics in general
1.64 (0.96)
0.57
(h) you get a homework assignment using geometric bodies?
Mathematics homework
Geometry
1.93 (1.00)
0.64
(i) you are on your way to mathematics class?
Mathematics class
Mathematics in general
1.61 (0.90)
0.68
(j) you have to add on paper in mathematics class: 976 + 777 + 558?
Mathematics class
Arithmetic
2.27 (1.20)
0.68
(k) your teacher asks you for the solution to the following problem: 32 multiplied by 584?
Teacher
Arithmetic
3.31 (1.00)
0.64
(l) your teacher wants you to convert 423 decimeters into millimeters?
Teacher
Word problems and magnitudes
3.12 (1.02)
0.59
(m) your teacher asks you something in mathematics class?
Teacher
Mathematics in general
2.05 (1.07)
0.82
(n) your teacher wants you to explain how to draw a bar chart?
Teacher
Stochastic
2.52 (1.17)
0.77
(o) you think about a mathematics test the evening before?
Test
Mathematics in general
2.51 (1.02)
0.63
(p) you think about a mathematics test that is about geometric shapes?
Test
Geometry
2.25 (1.09)
0.71
(q) you think about a mathematics test that requires you to convert different lengths and weights?
Test
Word problems and magnitudes
2.51 (1.05)
0.73
(r) you think about a mathematics test that requires you to explain a graph?
Test
Stochastic
2.51 (1.14)
0.74
Notes. Items were presented in German. Response labels: 1 = not at all nervous, 2 = a little nervous, 3 = somewhat nervous, 4 = very nervous. All standardized factor loadings (λ) are significant at p < .01 and relate to the intended setting factors: Application (4 items a–d, M = 2.02, SD = 0.72, α = .81), Learning (6 items e–j, M = 1.92, SD = 0.69, α = .85), and Evaluation (8 items k–r, M = 2.59, SD = 0.75, α = .88).
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 123–135
Multistudy Report
“Sweet Little Lies” An In-Depth Analysis of Faking Behavior on Situational Judgment Tests Compared to Personality Questionnaires Nadine Kasten,1 Philipp Alexander Freund,2 and Thomas Staufenbiel3 1
Department of Psychology, University of Trier, Germany
2
Institute of Psychology, Leuphana University at Lüneburg, Germany
3
Institute of Psychology, Osnabrück University, Germany
Abstract: Two laboratory studies examined the potential differences in the susceptibility to faking between a construct-oriented Situational Judgment Test (SJT) that measured conscientiousness and a traditional self-report measure of personality (NEO-FFI). In both studies, the mean differences between the honest and faked conscientiousness scores indicated that the NEO-FFI was more susceptible to faking than the SJT. In Study 1, we applied a within-subjects design (N = 137) and analyzed these differences in light of selected predictor variables derived from models of faking behavior. As a result, faking on the SJT was explained by cognitive ability alone, whereas faking on the NEO-FFI was also dependent on other personality traits that are associated with the ability to fake. In Study 2 (N = 602), the susceptibility to faking was predicted by differences in faking styles. The results of the mixed Rasch model analyses indicated profound differences in the measures in terms of the way the response scale was used. Keywords: Situational Judgment Test, SJTs, faking, response distortion, personality
Although meta-analytic results have provided evidence that personality questionnaires are valid predictors of job performance and that they possess incremental prediction over and above measures of cognitive ability (Barrick & Mount, 1991; Bratko, Chamorro-Premuzic, & Saks, 2006; Rothstein & Goffin, 2006), there are some issues regarding the most commonly used methods for assessing personality. Specifically, self-report questionnaires of personality are associated with several disadvantages. For instance, they tend to exhibit negative applicant reactions (Hausknecht, Day, & Thomas, 2004), the items are generally not embedded in the work context, and they are considered susceptible to the practice of faking. The latter point, in particular, presents a fundamental threat to the usefulness of personality inventories in applied selection settings. Faking is defined as the tendency of people to present a favorable image of themselves. Unlike other forms of socially desirable responding (e.g., self-deception), faking is considered a deliberate act where people intentionally endorse response categories that interfere with accurate self-reports (Paulhus, 2002). Regarding the use of self-report personality inventories, meta-analytic results have indicated that applicants are able to, and actually do, deviate from honest responses in order to make a good impression (Alliger & Dwight, 2000; Viswesvaran & Ones, 1999). Furthermore, faking
modifies the rank ordering of applicants, thus directly affecting selection decisions and potentially compromising the fairness of selection procedures (cf. Sackett, 2012). Over the years, researchers have followed different strategies to address the potential threat of distorted responses. Some of these focus on the identification of fakers, for example, by means of social desirability scales (Crowne & Marlowe, 1960) or through the examination of response time latencies (Holden, 1998). Other strategies primarily aim to prevent or at least reduce faking commonly by means of specific item formats, such as the application of forcedchoice formats (Brown, 2016; Jackson, Wroblewski, & Ashton, 2000). Though some of these approaches are more promising than others, each one individually leads to new problems (Zickar & Gibby, 2006). Accordingly, the question of how to handle the susceptibility of noncognitive measures to faking is still under debate. In recent years, it has been suggested that one of the ways to overcome the disadvantages of traditional personality assessment could be through the use of situational judgment tests (SJTs; Campion & Ployhart, 2013; Mussel, Gatzka, & Hewig, 2016). SJTs are simulation-based procedures wherein a person is presented with hypothetical scenarios that are followed by different response alternatives. Similar to traditional self-report measures of personality,
European Journal of Psychological Assessment (2020), 36(1), 136–148 https://doi.org/10.1027/1015-5759/a000479
Ó 2018 Hogrefe Publishing
N. Kasten et al., “Sweet Little Lies”
SJTs achieve appropriate levels of predictive validity and show incremental usefulness over measures of cognitive ability (Christian, Edwards, & Bradeley, 2010; Clevenger, Pereira, Wiechmann, Schmitt, & Schmidt, 2001; McDaniel, Morgeson, Finnegan, Campion, & Braverman, 2001; McDaniel, Whetzel, Hartman, Nguyen, & Grubb, 2006). However, SJT items are job related since the item content is explicitly linked to the target work environment, which commonly leads to an increase in user acceptance (Anderson, Salgado, & Hülsheger, 2010; Bauer & Truxillo, 2006). Regarding their susceptibility to faking, empirical studies comparing personality questionnaires and SJTs are scarce, but the few studies that are available suggest that SJTs might be susceptible to faking to some degree but not to the same extent as classical personality measures (Kanning & Kuhne, 2006; Nguyen, Biderman, & McDaniel, 2005). For example, Nguyen et al. (2005) found only small differences between scores obtained under standard and faking instructions for a SJT (mean d = 0.15) but moderate to large differences on a Big Five personality inventory (mean ds ranging from 0.36 for agreeableness to 0.76 for emotional stability). Unfortunately, the problem with previous studies comparing personality measures and SJTs regarding their susceptibility to faking is that they differ not only in their methods but also in the constructs being measured. Since the development of SJTs is often largely atheoretical, the focus on the measured constructs is often neglected (Christian et al., 2010). Accordingly, many SJTs either do not assess homogeneous constructs (Ployhart, 2012) or are applied without knowing which constructs are actually being measured. For a particular SJT, the susceptibility to faking depends on the constructs being measured. Some constructs are easier to fake than others, and some are hardly able to be faked at all, such as ability-related traits (Nguyen et al., 2005). Accordingly, when previous studies have found differences in faking susceptibility between personality questionnaires and SJTs, it is unclear whether they were due to differences in the method or in the traits being measured. To clarify this reason, this study compared the levels of faking between a traditional self-report measure of personality and an SJT, with the construct being held constant between the two instruments. Though the results from previous studies might be flawed to some degree, there are good reasons to expect differences in faking susceptibility between SJTs and traditional personality measures. SJTs are often designed to be less transparent to test takers (Hooper, Cullen, & Sackett, 2006; Nguyen et al., 2005), and the structure of SJT items is more complex than those of most self-report personality tests. Since traditional personality measures typically consist of a series of statements to which the respondent can indicate the extent of their agreement using a rating scale, the deliberate inflation of test scores simply requires the Ó 2018 Hogrefe Publishing
137
identification of the favorable end of the rating scale (Hooper et al., 2006). Given the nested structure of SJT items (i.e., response options are nested within situations), faking SJT responses requires the evaluation of different response alternatives per item, which is likely to increase the complexity and the cognitive effort (Hooper et al., 2006; Snell, Sydell, & Lueke, 1999). A similar explanation was tested comparing response inflation in an employment interview to that in a classical personality inventory (Van Iddekinge, Raymark, & Roth, 2005). Based on these theoretical considerations and on the empirical support from previous studies, SJTs were expected to be less susceptible to faking than a traditional measure of personality. Hypothesis 1 (H1): Faking effects will be smaller with the SJT than with a traditional self-report measure of personality. Furthermore, it is expected that SJTs and personality questionnaires differ not only in the extent of faking but also in the factors that are associated with individual differences in faking. Two different research lines have emerged that seek to explain variability in faking behavior within and between different measures. First, there have been researchers who have tried to find antecedents of faking and aggregate them into models of faking behavior. A second line of research is concerned with the question of whether differences in faking are due to different faking styles or to different response sets. As these two research lines offer meaningful contributions to explain the potential differences in the faking vulnerability of SJTs and personality questionnaires, both of them are taken into account.
Faking Models There are numerous models that describe the psychological process underlying faking behavior (Marcus, 2009; McFarland & Ryan, 2000, 2006; Mueller-Hanson, Heggestad, & Thornton, 2006; Snell et al., 1999). Most of them define faking as a function of dispositional variables, such as the applicant’s personality and ability, as well as contextual features, such as the attractiveness of the job. Additionally, these dispositional variables are separated into the following two broad categories: (1) characteristics contributing to individual differences in the ability to fake and (2) traits referring to the motivation or willingness to fake. So far, many attempts have been made to explain faking behavior by means of these models. Within the group of ability correlates, cognitive ability has been shown to be positively related to faking in many empirical studies (e.g., Griffith, Malm, English, Yoshita, & Gujar, 2006; Nguyen et al., 2005; Pauls & Crost, 2005; Van Iddekinge et al., 2005; Ziegler, 2006). This positive relation could be because European Journal of Psychological Assessment (2020), 36(1), 136–148
N. Kasten et al., “Sweet Little Lies”
138
more intelligent people may be more successful in identifying what is expected in the testing situation and in recognizing the meaning of the items and the appropriate response. Previous research has also highlighted the importance of the perceived ability to present oneself in a favorable light (Ellingson, 2012; Marcus, 2009; Pauls & Crost, 2005). In this vein, we will also take into account the self-reported efficacy of self-presentation, which can be defined as the self-estimated capability to present oneself as a smart, capable, and likable person (Mielke, 1990). Additionally, self-monitoring is expected to affect faking behavior. Selfmonitoring is defined as “self-observation and self-control guided by situational cues to social appropriateness” (Snyder, 1974, p. 526). Accordingly, individuals who closely monitor themselves are expected to be successful fakers, as they are skilled in reading social cues and in adapting their behavior in an appropriate way. To identify characteristics that potentially contribute to individual differences in faking motivation, previous studies typically examined traits linked to general deceptive behavior (Snell et al., 1999). In this study, we investigated the influence of Machiavellianism, as it has been repeatedly found to relate to the extent of faking on personality questionnaires (Hogue, Levashina, & Hang, 2013; Levashina & Campion, 2007). Machiavellianism refers to the personality trait of behaving in a cold and manipulative fashion in order to further one’s own interests (Paulhus & Williams, 2002). Thus, people who score high on Machiavellianism are expected to show no moral objections and to view faking as an appropriate, and even necessary, behavior for accomplishing their goals (Ellingson, 2012). Hypothesis 2 (H2): (a) Cognitive ability, (b) efficacy of self-presentation, (c) self-monitoring, and (d) Machiavellianism will be positively related to faking behavior on the traditional measure of personality. So far, no previous research has applied faking models to SJTs. Since the models introduced above are models of general faking, we hypothesized that the same variables would impact faking on SJT items. Hypothesis 3 (H3): (a) Cognitive ability, (b) efficacy of self-presentation, (c) self-monitoring, and (d) Machiavellianism will show a positive relationship to the extent of faking on the SJT.
cognitive effort. Accordingly, we expected cognitive ability to be more pronounced in the prediction of faking for the SJT. Hypothesis 4 (H4): Cognitive ability will be more strongly related to faking with the SJT compared to the traditional personality questionnaire.
Response Sets There is evidence that the amount of faking is not constant across individuals. Instead, there are individual differences in the extent of how much people resort to faking (McFarland & Ryan, 2000; Zickar, Gibby, & Robie, 2004). By means of mixed Rasch modeling (MRM; Rost, 1990), which is basically a conjunction between item response theory and latent class analysis (Rost, Carstensen, & Von Davier, 1997), it is possible to identify subgroups that exhibit different response sets (Zickar et al., 2004). Thus, within a faking scenario, it is possible to differentiate between different response biases. Applying MRM analysis, previous studies identified different response sets associated with faking (Eid & Zickar, 2007; Zickar et al., 2004; Ziegler, 2006; Ziegler & Kemper, 2013). These different classes usually represent a group of respondents who apply a slight faking style versus a group of respondents who distort their responses in an extreme way. Additionally, some authors have provided evidence that some respondents seem to provide honest responses even when they are instructed to fake. Hypothesis 5 (H5): MRM analyses will reveal different response sets for the traditional personality measure. Until now, no empirical studies applying MRM analysis to SJT data have been published. Since assumptions on the potential differences in faking styles between the two conscientiousness measures are lacking evidence, we analyzed this question in an exploratory manner.
Organization of Studies
Previous studies have suggested that cognitive ability has a greater impact on faking as the complexity of the response process increases (Vasilopoulos, Cucina, Dyomina, Morewitz, & Reilly, 2006). As discussed earlier, the SJT format is commonly more demanding regarding the test takers’
To investigate our hypotheses, we conducted two studies. Both studies examined the potential differences in the extent of faking between the SJT and a traditional personality questionnaire (Hypothesis 1). Additionally, in Study 1, we investigated the effects of motivational and ability-related variables on faking in a traditional personality questionnaire (Hypotheses 2a–2d) and in the SJT (Hypotheses 3a–3d), as well as the differences between the measures regarding the impact of cognitive ability on faking (Hypothesis 4). Study 2 examined the interindividual differences in response
European Journal of Psychological Assessment (2020), 36(1), 136–148
Ó 2018 Hogrefe Publishing
N. Kasten et al., “Sweet Little Lies”
sets for a traditional personality questionnaire (Hypothesis 5) and an SJT (exploratory).
Study 1 Participants and Procedure The sample of Study 1 consisted of N = 137 university students (79.60% female; Mage = 22.66; SD = 3.95). To compare the susceptibility to faking of the NEO Five-Factor Inventory (NEO-FFI) and the SJT, we applied a withinsubjects 2 (faking good vs. honest) 2 (NEO-FFI vs. SJT) factorial design. Accordingly, the participants completed both measures twice: once under an honest instruction and once under a “faking good” instruction. In the latter condition, we used an instruction comparable to those applied in other studies (e.g., Zickar & Robie, 1999). Specifically, the participants were asked to imagine themselves as being part of an ongoing selection procedure for a job they really wanted. They were asked to present themselves in a favorable light in order to maximize their chances of being hired. Within the honest condition, no specific instruction was presented. The order of the instruction conditions was counterbalanced, and they were administered 1 week apart. Within the second examination, participants additionally completed a test battery including measures of cognitive ability, efficacy of self-presentation, self-monitoring, and Machiavellianism as potential predictors of faking behavior. Faking behavior was operationalized for each participant by examining the mean difference between responses in the faking and the honest conditions. Positive scores indicate a higher score in the faking condition. This approach is comparable to those of other authors who investigated faking behavior (e.g., McFarland & Ryan, 2000).
Measures Situational Judgment Test The SJT used in this study was designed to measure three Big Five dimensions, namely, extraversion, agreeableness, and conscientiousness (Kasten & Staufenbiel, 2015). In this study, we exclusively focused on conscientiousness since it is the best predictor of job performance compared to the other Big Five dimensions (Barrick & Mount, 1991; Ones, Viswesvaran, & Judge, 2007). Furthermore, conscientiousness typically exhibits high levels of susceptibility to faking (Birkeland, Manson, Kisamore, Brannick, & Smith, 2006). Conscientiousness is measured with 13 items. All the items include a short description of a job-related scenario that is highly relevant for the test takers’ conscientiousness (e.g., situations that require a structured working style). Additionally, the item stems are followed by three different response Ó 2018 Hogrefe Publishing
139
options that differed in their level of trait expression. One response option presents a highly conscientious behavior, a second option represents a highly negative trait expression, and the third had a medium level of trait expression. The participants are asked to rate their likelihood to show the described behavior in the given situation for every response option on a 5-point Likert scale ranging from 1 (= strongly disagree) to 5 (= strongly agree). An example SJT item is reported in the Appendix. By default, the item scores for this particular SJT format represent a weighted mean of the ratings (Kasten & Staufenbiel, 2015). To avoid the extant of faking being confounded with the trait expression of the different response options, we exclusively focused on the response options that represented high conscientiousness. In the present study, the scores exhibited an appropriate level of reliability in the honest condition (α = .80) and in the faking condition (α = .83). Personality Measure To compare the susceptibility of the SJT to faking compared to a traditional self-report measure of personality, the participants also completed the 12 conscientiousness items from the German version of the NEO Five-Factor Inventory (NEO-FFI, Borkenau & Ostendorf, 2008). The responses are made on a 5-point Likert scale ranging from 1 (= strongly disagree) to 5 (= strongly agree). In this study, Cronbach’s α was .89 for both the honest and the faking conditions. Cognitive Ability We used the basic module from the Intelligence Structure Test (IST-2000 R; Amthauer, Brocke, Liepmann, & Beauducel, 2001) to measure cognitive ability. This test measures verbal, numerical, and spatial intelligence using three subtests. Cronbach’s α of the composite score was .94. Efficacy of Self-Presentation The Efficacy of Self-Presentation Questionnaire (ESP; Mielke, 1990) was employed. This 33-item measure consists of three different subscales, but since they frequently show high intercorrelations, a composite score was computed (Pauls & Crost, 2005). The following is a sample item from this measure: “In courses I’m able to appear in a way that others perceive me as a capable person.” Cronbach’s α of this measure was .90. Self-Monitoring Self-monitoring was measured by the 18 items of the German version of the self-monitoring scale (Graf, 2004). The following is a sample item from this measure: “In different situations and with different people, I often act as a very different person.” Cronbach’s α of this measure barely missed the recommended rules of thumb for its use in research settings (Nunnally & Bernstein, 1994) at α = .68. European Journal of Psychological Assessment (2020), 36(1), 136–148
N. Kasten et al., “Sweet Little Lies”
140
Table 1. Descriptive statistics and intercorrelations of study variables for Study 1 M (1) Age
SD
(1)
(2)
(3)
(4)
(5)
(6)
22.66
3.95
(2) Sex (women = 0, men = 1)
0.20
0.41
.18
(3) NEO (honest)
3.62
0.64
.01
.14
(4) NEO (faking)
4.56
0.41
.19
.09
.29
(5) SJT (honest)
3.41
0.56
.01
.01
.66
(6) SJT (faking)
3.99
0.53
.15
.20
.19
.62
.36
(7) NEO (difference)
0.94
0.65
.13
.08
.80
.34
.47
.20
(8) SJT (difference)
0.58
0.61
.13
.17
.44
.28
.60
.53
(9) G
(7)
(8)
(9)
(10)
(11)
.28
.60
110.65
9.97
.30
.08
.08
.25
.04
.24
.23
.25
(10) ESP
2.67
0.46
.05
.19
.32
.10
.17
.03
.26
.13
.01
(11) SM
2.97
0.42
.08
.18
.04
.02
.04
.10
.06
.12
.06
.53
(12) MACH
2.76
0.44
.08
.21
.14
.02
.02
.03
.13
.01
.04
.14
.21
Notes. NEO = Neuroticism, Extraversion, Openness; SJT = Situational Judgment Test; G = cognitive ability; ESP = efficacy of self-presentation; SM = selfmonitoring; MACH = machiavellianism; difference scores were computed by subtracting each individual’s honest score from the faked score. Correlations above .16 are significant.
Machiavellianism Machiavellianism was measured by the Mach IV (Christie & Geis, 1970), a 20-item scale that has shown evidence of adequate reliability and validity in a variety of studies. The following is an example item from this measure: “The best way to handle people is to tell them what they want to hear.” Cronbach’s α of this measure was .80.
Results We calculated difference scores for all the participants by taking each individual’s honest conscientiousness score and subtracting it from the score obtained in the faking condition. Although difference scores are commonly computed to analyze faking, some researchers have criticized their use (e.g., Peter, Churchill, & Brown, 1993), particularly because they often lack reliability. However, when the reliability of the scores is high, it is possible to have appropriate reliabilities for the difference scores. Using the formula presented by McFarland and Ryan (2000), we calculated the reliability of each of the difference scores in the present study. They were both above .70 and, thus, were sufficiently high. Table 1 displays the descriptive statistics and intercorrelations for all the study variables and for the difference scores. We first examined whether our faking instruction was effective. As indicated in Table 1, the mean comparisons yielded strong increases from the honest to the faking instruction. The subsequent t-tests for the dependent samples were highly significant for both the NEO-FFI, t(136) = 16.75, p < .001, dc = 1.71,1 and the SJT, t(136) = 11.00, p < .001, dc = 0.78. 1
Hypothesis 1 predicted that the SJT would be associated with smaller faking effects than the NEO-FFI. To test for this predicted difference, we conducted a (2 2) ANOVA with the following two repeated measures factors: instruction (honest vs. faking) and conscientiousness measure (SJT vs. NEO). A significant interaction between instruction and measure, F(1, 136) = 54.82, p < .001, η2 = .29, indicated that the participants inflated their conscientiousness scores from the honest to the faking instruction to a greater extent in the NEO than in the SJT. Additionally, the comparison between the difference scores provided further support for Hypothesis 1, as the difference scores were significantly higher for the NEO-FFI than for the SJT, t(136) = 7.40, p < .001, dc = 0.56. To test Hypotheses 2 and 3, we performed multiple regression analyses with the difference scores as the dependent variable and the characteristics associated with the ability to fake (i.e., general mental ability, the efficacy of self-presentation and self-monitoring) and the motivation to fake (i.e., Machiavellianism) as the predictors. We also controlled for sex and age of the participants. The results are summarized in Table 2. Regarding faking on the NEO-FFI, we found a significant effect of cognitive ability, efficacy of self-presentation and self-monitoring. Only Machiavellianism exhibited a nonsignificant influence on faking. Thus, Hypotheses 2a–2c were confirmed,whereasHypothesis2dhadtoberejected.Regarding the extent of faking within the SJT measure, a different pattern emerged. Only cognitive ability had a significant influence on faking. Thus, only Hypothesis 3a was supported. In contrast to our prediction from Hypothesis 4, we found no significant differences in the impact of cognitive
The effect size dc was calculated according to the formula described in Dunlap, Cortina, Vaslow, and Burke (1996) for a repeated measures design.
European Journal of Psychological Assessment (2020), 36(1), 136–148
Ó 2018 Hogrefe Publishing
N. Kasten et al., “Sweet Little Lies”
141
Table 2. Multiple regression analyses of faking motivation and ability on mean difference scores, controlling for sex and age NEO-FFI β Sex
.11
SJT t
1.25
β .17
Honest t 1.85
Age
.09
1.02
.04
0.49
G
.22
2.55*
.27
2.99**
ESP
.42
4.22***
.06
0.59
SM
.26
2.59*
.08
0.72
MACH
.02
0.21
.04
0.39
adj. R2
.16
F6,
125
5.04***
α
M
SD
Faking α
M
SD
t
d [95% CI]
NEO-FFI .85 3.73 0.60 .84 4.34 0.48 13.92*** 1.14 [0.97, 1.32] (11 items)a SJT .76 3.35 0.54 .76 3.70 0.51 (13 items)
8.06*** 0.66 [0.50, 0.83]
Notes. aDue to zero counts in one response category, one NEO-FFI item was excluded from analyses, ***p < .001. SJT = Situational Judgment Test.
.08 2.98**
Notes. NEO = Neuroticism, Extraversion, Openness; SJT = Situational Judgment Test; G = cognitive ability; ESP = efficacy of self-presentation; SM = self-monitoring; MACH = machiavellianism. *p < .05; **p < .01; ***p < .001.
ability on faking between the NEO-FFI and the SJT. Although the correlation between cognitive ability and the difference scores was stronger for the SJT than for the NEO-FFI, the statistical comparison failed to reach significance (z = 0.13; p = .45).
Study 2 Method Participants and Procedure Study 2 included 602 participants (68.60% female, Mage = 28.61, SD = 10.81). In this study, we applied a between-subjects design. Accordingly, the participants received instructions to either be honest or to practice faking. The wording of the instruction was the same as in Study 1. The two groups were comparable regarding gender, w2(1) = 2.31, p = .137, and age (t = 0.13; p = .90), as well as in their educational level, measured as the highest achieved degree, w2(4) = 4.76, p = .31.
Measures The participants completed the same personality measures as in Study 1. Accordingly, they were asked to respond to the 13 SJT and the 11 NEO-FFI conscientiousness items.2 The reliability estimates indicated an acceptable level of reliability for every measure (see Table 3).
Results The conscientiousness scores were categorized by instruction type and measure and are displayed in Table 3. We saw that the conscientiousness scores for all the 2
Table 3. Scale scores, reliability estimates, and effect sizes broken down by instruction type
measures were higher under the faking condition compared to the honest condition. Additionally, the differences between the conscientiousness scores under the faking and the honest instructions were larger for the NEO-FFI than for the SJT. Since the confidence intervals of the standardized mean differences were not overlapping, the differences in the extent of faking between the NEO-FFI and the SJT scores were statistically significant. These results provided further support for Hypothesis 1. To identify different faking styles, we conducted an MRM analyses using WINMIRA (Von Davier, 2001). The determination of the number of classes was performed in two steps. Using a bootstrap procedure with 300 samples, we tested the Cressie-Read (CR) statistic and the Pearson w2 for statistical significance. If the CR and Pearson w2 indicated model fit (with p-values larger than .05), we used the Consistent Akaike Information Criterion (CAIC) to compare the model fit between the class solutions. Smaller CAIC values indicated better model fit (Preinerstofer & Formann, 2012). Table 4 presents the p-values associated with the CR and Pearson w2 and the CAIC statistics for the NEO-FFI and the SJT scores. Three classes were needed to fit the data for both the SJT and the NEO. To further evaluate the appropriateness of these solutions, we used the standardized Q-index, which indicates the fit of every item for each class (Rost & Von Davier, 1994). Overall, only 1 out of the 72 items (three classes with 11 NEO items and three classes with 13 SJT items), or 1.39%, indicated model misfit, which was beyond the level that would have been expected by chance. To obtain further insight into the nature of the three classes, we examined the plots of the threshold estimates for the two measures. Since both of the measures exhibited very similar patterns and there were only differences in the class sizes, Figure 1 exclusively displays the threshold estimates for the three classes that were associated with the SJT items. However, the NEO threshold estimates and class sizes are additionally displayed in the Electronic Supplementary Material (ESM 1).
Due to zero counts in one response category, one NEO-FFI item had to be excluded from the analyses.
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 136–148
N. Kasten et al., “Sweet Little Lies”
142
Table 4. Results of the MRM analyses (1–4 classes) p(CR)
p(w2)
CAIC
1
< .001
< .001
14,280.31
2
< .001
< .001
14,642.56
3
.17
.20
14,952.76
4
.14
.16
15,409.71
1
< .001
< .001
21,569.08
2
.01
.02
21,825.16
3
.06
.16
22,319.33
4
.04
.08
22,941.91
Class no. NEO-FFI
SJT
Notes. N = 602; p-values below .05 indicate model misfit for both CR and Pearson w2. The best model is indicated in bold. NEO-FFI = NEO Five Factors Inventory; SJT = Situational Judgment Test.
We primarily focused on the order of the thresholds, as a correct ordering would indicate that the participants used the rating scale as intended (i.e., more conscious people are more likely to choose higher response categories). For both measures, two classes exhibited a correct ordering of the four thresholds for the majority of the items (i.e., class 1 and class 2 in Figure 1). The difference between these classes was primarily established by differences in the threshold distances, with class 1 showing more narrowly spaced thresholds than class 2. Within the last class (i.e., class 3 in Figure 1), the thresholds showed no orderly relationship between conscientiousness and option choice. Examining the class preference categorized by measure and instruction type, further illuminated what constituted the differences between the classes (see Table 5). For every conscientiousness measure, class 1 comprised the participants of whom the majority received an honest instruction. For class 2, this quota was reversed. For class 3, the proportion of the participants who received a faking instruction was even more pronounced, with approximately 75% having been instructed to fake. Interestingly, in line with the profound overlaps in the nature of the classes between the measures, there appeared to be intraindividual stability as well. Accordingly, Spearman’s rank correlation coefficient between the faking style on the NEO-FFI and on the SJT was substantial, with rs = .35 (p < .001). This meant that the participants who exhibited a specific faking style in the NEO-FFI, applied a similar response set in the SJT. In many MRM studies, the average scores from the faking condition are taken into account to further explore the nature of the different classes (e.g., Zickar et al., 2004). Although this procedure would mainly confirm our present categorization (except for the NEO class 2, which exhibited higher faked conscientiousness values than class 3), it also runs the potential danger of false interpretation. Since our definition of faking includes the inflation of test scores from
Figure 1. Threshold estimates for the 13 SJT conscientiousness items broken down by class.
European Journal of Psychological Assessment (2020), 36(1), 136–148
Ó 2018 Hogrefe Publishing
N. Kasten et al., “Sweet Little Lies”
143
Table 5. Class prevalence and scale scores for every measure and instruction type SJT Honest
NEO Faking
Honest
Faking
58.1
41.9
61.0
39.0
3.26 (0.51)
3.60 (0.52)
3.66 (0.51)
4.02 (0.36)
40.0
60.0
31.7
68.3
3.47 (0.48)
3.67 (0.40)
3.59 (0.73)
4.58 (0.53)
22.1
77.9
25.7
74.3
3.31 (0.83)
3.87 (0.66)
4.27 (0.46)
4.36 (0.29)
Class 1 Class size Size (%) M (SD)
222
277
Class 2 Class size Size (%) M (SD)
285
189
Class 2 Class size Size (%) M (SD)
95
136
Notes. NEO = Neuroticism, Extraversion, Openness; SJT = Situational Judgment Test.
the honest to the faking conditions, it is also important to consider the honest conscientiousness scores. The participants who exhibited high scores in the faking condition did not necessarily engage in a great deal of faking. For example, a person who is actually highly conscientious will exhibit high conscientiousness scores in the faking condition, without a deliberate attempt to inflate his or her scores. To rule out such alternative explanations and to test for the robustness of our results, we applied an MRM analysis a second time using a combined dataset, additionally including the faking scores from Study 1. This analysis yielded the same threshold patterns for every measure. Since the participants from Study 1 provided both faked and honest conscientiousness scores, it was possible to quantify the extent of faking. The differences between the honest and faking scores, as categorized by measure and class, are displayed in Figure 2. The results from this analysis largely correspond to our previous naming of the classes. Regarding the SJT, class 1 exhibited only a small, nonsignificant difference between the scores observed under the different instructional sets, t(18) = 1.84, p = .08, dc = 0.38. For class 2, we found a significant difference, t(66) = 7.76, p < .001, dc = 1.21, and for class 3, the differences between the honest and faking conditions were even more distinct, t(51) = 8.30, p < .001, dc = 1.50. We found the same structure for the NEO, with class 3 showing the highest difference between the honest and faked scores, t(84) = 15.45, p < .001, dc = 2.06, and class 2 exhibiting a smaller but notably profound difference, t(37) = 8.77, p < .001, dc = 0.378. Although the participants assigned to class 1 showed the least deviation from the honest conscientiousness scores in the NEO-FFI, the differences were still large and statistically significant, t(13) = 2.81, p < .05, dc = 0.82.
Figure 2. Conscientiousness scores obtained under honest (white bar) and faking (black bar) condition for every measure and class, respectively. Error bars represent 95% confidence interval.
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 136–148
144
N. Kasten et al., “Sweet Little Lies”
The problems associated with the susceptibility to faking of noncognitive measures challenge test developers and users alike. Since there has not been a consistent trend in previous studies so far, the present study aimed to obtain a more profound understanding of the susceptibility to faking of SJTs. More precisely, the fakeability of a construct-oriented SJT that measures conscientiousness relative to a traditional measure of personality was the focal point of the present studies. Regarding the first hypothesis, the present results suggested that both measures were susceptible to faking but that the SJT showed significantly decreased levels of faking susceptibility. This result was true for both for the within-subjects and the between-subjects datasets. In Study 1, we intended to explain the extent of faking by means of analyzing interindividual differences in the dispositional variables associated with the ability and motivation to fake. For both the measures, we found no significant influence of Machiavellianism, which might have been attributable to the laboratory setting. The individual predispositions associated with the motivation to fake are all connected to the question of whether individuals see faking as an acceptable behavior or not. Thus, Machiavellianism might be a valid predictor of response distortion in situations where faking presents a morally questionable behavior. This is likely to be true for faking in a real personnel selection setting. However, a laboratory setting is less likely to trigger a participant’s moral stance since they are directly instructed to deviate from honest responses. Thus, the missing effect for Machiavellianism in the present study also emphasizes the influence of contextual variables on faking. In accordance with the current theoretical considerations and our predictions, we found evidence supporting the hypothesized link between faking and cognitive ability for both the NEO-FFI and the SJT. Although the influence was slightly higher for the SJT measures, the differences were restricted to a descriptive level. Since the SJT applied in this study was designed to target specific constructs, it comprised behaviorally uniform response options (Weekley, Ployhart, & Holtz, 2006), that is, response alternatives that differ in their level of trait expression. Other authors have suggested that this response format is greatly transparent to test takers and is thus particularly easy to fake (e.g., Muck, 2013; Ployhart & MacKenzie, 2010). For other, less transparent SJT response formats, cognitive ability might account for the greater proportion of variance in faking behavior. However, research on the fakability of SJTs has just recently begun to receive attention. Therefore, more studies are needed to evaluate the influence of test characteristics, such as the response format, on the extent of faking.
Interestingly, both of the predictors associated with the self-estimated ability demonstrated a higher influence on faking behavior in the NEO-FFI than the objective ability did. This might have contributed to the fact that the NEO-FFI items are so easily faked, in that cognitive ability only accounts for a small proportion of the variance in faking behavior. Contrary to our hypothesis, we found no significant effect of these two variables on the prediction of faking within the SJT measure. Since the empirical evidence, at least for self-monitoring, was hitherto mixed (e.g., McFarland & Ryan, 2000; Wrenson & Biderman, 2005), we think that further studies are needed to examine the robustness of our results. Additionally, the consideration of additional predictor variables might further elucidate the process associated with faking personality questionnaires. For example, we considered the efficacy of self-presentation in our study, though some authors have highlighted the importance of actual self-presentation ability (e.g., Marcus, 2009). Thus, future studies should also integrate this variable. Study 2 provided further evidence to explain the differences in the faking susceptibility between the SJT and the NEO-FFI. In accordance with Hypothesis 5, we also found different faking styles for the NEO-FFI items. More precisely, we found the full range of response styles that other authors found, which included a class of honest respondents, a group of participants that exhibited a slight faking style and a third group that extremely deviated from their honest scores. Interestingly, the class solutions and threshold estimates showed profound similarities between the measures, as the SJT items showed the same threshold patterns regardless of whether the items exhibited positive or negative trait expression. For all the measures, the slight faking group represented a group of respondents that used the rating scale in the correct way (i.e., the scale exhibited a correct ordering of the thresholds). This finding likely supports the view that for these respondents, faking represents a constant theta shift (Zickar & Robie, 1999). In accordance with previous studies (Zickar et al., 2004), the extreme faking classes for the three conscientiousness measures showed no correct ordering of the thresholds. This result indicates that there is no relationship between a person’s conscientiousness and their option choice, a pattern that we probably would expect when considering extreme faking. Accordingly, the naming of this class seems to be fairly easy and generalizable over different measures. Regarding the naming of the other two classes, there might be an alternative choice. The designation of class 2 as a class of honest respondents especially seems to be problematic because they indeed show a tendency toward faking (see Figure 2). Although this tendency was much smaller than in the other classes, the amount of faking was not negligible. It is possible that the differences
European Journal of Psychological Assessment (2020), 36(1), 136–148
Ó 2018 Hogrefe Publishing
Discussion
N. Kasten et al., “Sweet Little Lies”
between classes 1 and 2 did not manifest in the categorization of the slight faking and honest responding but in a set of other response styles. In this vein, Ziegler and Kemper (2013) found, for example, overlaps between faking styles and two response styles, namely, extreme responding and midpoint responding. While the first response style refers to the tendency to prefer the endpoints of the rating scale, the latter refers to a participant’s tendency to opt for the middle categories. Given this explanation, class 1 and class 2 would both represent groups of slight fakers who differ in their use of the rating scale.
Limitations and Implications One of the most salient limitations of this study is inherent in every laboratory-induced faking study, namely, the generalizability of the results to faking in real-world selection settings. As stated above, there are likely to be differences, for example, in the mean differences, that are typically more pronounced in the laboratory context than in real-life settings. The question of the generalizability of the results to real-life settings is also valid with regard to the faking styles. Although previous studies found profound similarities in the response sets obtained in laboratory and selection settings (Zickar et al., 2004), the class prevalence is likely to differ. Since the motivation to fake is more diverse in applicant settings, it is expected that the class of honest responders and slight fakers is likely to be more pronounced. In summary, future studies should analyze the potential effects induced by distinct study designs. Another important limitation relates to the competing modeling approaches applied in the present studies. In Study 1, faking was modeled by means of the difference scores and therefore as a continuous variable. In contrast, Study 2 viewed faking as the manifestation of qualitative, distinct response sets and thus modeled faking as a categorical variable. Although both views are present in the empirical literature on faking, the former more closely resembles the current theoretical ideas on faking behavior as an interaction between person and situation characteristics. However, both views might be helpful in analyzing faking behavior, and future research might benefit from the combination of these approaches. Regardless of this limitation, we believe that the results of our present studies entail important implications for both practitioners and researchers. First, the present work provides evidence that SJTs are viable alternatives of personality measurements. Although, as mentioned above, the mean difference between the honest and faking instructions was not negligible for the SJT, it was significantly smaller than the effect of faking that was measured for the NEO-FFI. The multiple regression analyses provided evidence that faking behavior has a solid base in cognitive Ó 2018 Hogrefe Publishing
145
ability. Faking on the SJT should therefore not decrease its predictive power since cognitive ability has been shown to predict job performance in many empirical studies and meta-analyses. Additionally, the present work offers several implications for future research. Previous studies on the susceptibility to faking of SJTs have revealed high variability in the results. The great amount of variance might be attributable to the specific constructs measured with the SJT, but there may also be other relevant characteristics of the SJT. As stated earlier, many characteristics of SJTs likely influence the cognitive loading and the transparency of SJT items. Since SJTs are measurement methods, they exhibit different means of scoring, presentation forms, instructions and other essential characteristics (Kasten & Freund, 2016; Schmitt & Chan, 2006; Weekley et al., 2006). It seems reasonable to assume that some of these characteristics influence the cognitive loading and the transparency of SJT items and could therefore be posited as moderators of faking behavior. There is no doubt that a meta-analysis would be an appropriate procedure to aggregate the inconsistent results and to systematically analyze sources of variability by means of moderator analyses. However, at least at this time, the basis for such analyses in terms of the number of empirical studies is insufficient.
Electronic Supplementary Materials The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000479 ESM 1. Figure (.pdf) Threshold estimates for the three NEO classes.
References Alliger, G. M., & Dwight, S. A. (2000). A meta-analytic investigation of the susceptibility of integrity tests to faking and coaching. Educational and Psychological Measurement, 60, 59–72. https://doi.org/10.1177/00131640021970367 Amthauer, R., Brocke, B., Liepmann, D., & Beauducel, A. (2001). Intelligenz-Struktur-Test 2000 R [Intelligence Structure Test 2000 R]. Göttingen, Germany: Hogrefe. Anderson, N. R., Salgado, J. F., & Hülsheger, U. R. (2010). Applicant reactions in selection: Comprehensive meta-analysis into reaction generalization versus situational specificity. International Journal of Selection and Assessment, 18, 291–304. https://doi.org/10.1111/j.1468-2389.2010.00512.x Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44, 1–26. https://doi.org/10.1111/j.17446570.1991.tb00688.x Bauer, T. N., & Truxillo, D. M. (2006). Applicant reactions to situational judgment tests: Research and related practical issues. In J. A. Weekley & R. E. Ployhart (Eds.), Situational judgment tests: Theory, measurement and application (pp. 233– 249). Mahwah, NJ: Erlbaum.
European Journal of Psychological Assessment (2020), 36(1), 136–148
146
N. Kasten et al., “Sweet Little Lies”
Birkeland, S. A., Manson, T. M., Kisamore, J. L., Brannick, M. T., & Smith, M. A. (2006). A meta-analytical investigation of job applicant faking on personality measures. International Journal of Selection and Assessment, 14, 317–335. https://doi.org/ 10.1111/j.1468-2389.2006.00354.x Borkenau, P., & Ostendorf, F. (2008). NEO-FFI: NEO-Fünf-Faktoren Inventar nach Costa und McCrae [The NEO-Five-Factor Inventory by Costa and McCrae]. Göttingen, Germany: Hogrefe. Bratko, D., Chamorro-Premuzic, T., & Saks, Z. (2006). Personality and school performance: Incremental validity of self- and peerratings over intelligence. Personality and Individual Differences, 41, 131–142. https://doi.org/10.1016/j.paid.2005.12.015 Brown, A. (2016). Item response models for forced-choice questionnaires: A common framework. Psychometrika, 81, 135–160. https://doi.org/10.1007/s11336-014-9434-9 Campion, M. C., & Ployhart, R. E. (2013). Assessing personality with situational judgment measures. In N. D. Christiansen & R. P. Tett (Eds.), Handbook of personality at work (pp. 439–456). New York, NY: Routledge. Christian, M. S., Edwards, B. D., & Bradeley, J. C. (2010). Situational judgment tests: Constructs assessed and a metaanalysis of their criterion-related validities. Personnel Psychology, 63, 83–117. https://doi.org/10.1111/j.1744-6570.2009. 01163.x Christie, R., & Geis, F. (1970). Studies in Machiavellianism. New York, NY: Academic Press. Clevenger, J., Pereira, G. M., Wiechmann, D., Schmitt, N., & Schmidt, V. (2001). Incremental validity of situational judgment tests. Journal of Applied Psychology, 86, 410–417. https://doi. org/10.1037/0021-9010.86.3.410 Crowne, D. P., & Marlowe, D. (1960). A new scale of social desirability independent of psychopathology. Journal of Consulting Psychology, 24, 349–354. https://doi.org/10.1037/ h0047358 Dunlap, W. P., Cortina, J. M., Vaslow, J. B., & Burke, M. J. (1996). Meta-analysis of experiments with matched groups or repeated measures designs. Psychological Methods, 1, 170–177. https://doi.org/10.1037/1082-989X.1.2.170 Eid, M., & Zickar, M. J. (2007). Detecting response styles and faking in personality and organizational assessment by mixed Rasch models. In M. Von Davier & C. H. Carstensen (Eds.), Multivariate and mixture distribution Rasch models: Extensions and applications (pp. 255–270). New York, NY: Springer Science + Business Media. Ellingson, J. E. (2012). People fake only when they need to fake. In M. Ziegler, C. MacCann, & R. D. Roberts (Eds.), New perspectives on faking in personality assessment [chapter 2]. New York, NY: Oxford University Press. Graf, A. (2004). Eine deutschsprachige Version der Self-Monitoring Skala [A German version of the Self-Monitoring Scale]. Zeitschrift für Arbeits- und Organisationspsychologie, 48, 109– 121. https://doi.org/10.1026/0932-4089.48.3.109 Griffith, R. L., Malm, T., English, A., Yoshita, Y., & Gujar, A. (2006). Applicant faking behavior: Teasing apart the influence of situational variance, cognitive biases, and individual differences. In R. L. Griffith & M. H. Peterson (Eds.), A closer examination of applicant faking behavior (pp. 151–177). Greenwich, CT: Information Age Publishing. Hausknecht, J. P., Day, D. V., & Thomas, S. C. (2004). Applicant reactions to selection procedures: An updated model and meta-analysis. Personnel Psychology, 57, 639–683. https://doi. org/10.1111/j.1744-6570.2004.00003.x Hogue, M., Levashina, J., & Hang, H. (2013). Will I fake it? The interplay of gender, Machiavellianism, and self-monitoring on strategies for honesty in job interviews. Journal of Business
Ethics, 117, 399–411. https://doi.org/10.1007/s10551-0121525-x Holden, R. R. (1998). Detecting fakers on a personnel test: Response latencies versus standard validity scale. Journal of Social Behavior and Personality, 13, 387–398. Hooper, A. C., Cullen, M. J., & Sackett, P. R. (2006). Operational threats to the use of SJTs: Faking, coaching, and retesting effects. In J. A. Weekley & R. E. Ployhart (Eds.), Situational judgment tests: Theory, measurement and application (pp. 205– 232). Mahwah, NJ: Erlbaum. Jackson, D. N., Wroblewski, V. R., & Ashton, M. C. (2000). The impact of faking on employment tests: Does forced choice offer a solution? Human Performance, 13, 371–388. https://doi.org/ 10.1207/S15327043HUP1304_3 Kanning, U. P., & Kuhne, S. (2006). Social desirability in a multimodal personnel selection test battery. European Journal of Work and Organizational Psychology, 15, 241–261. https:// doi.org/10.1080/13594320600625872 Kasten, N., & Freund, P. A. (2016). A meta-analytical multilevel reliability generalization of situational judgment tests (SJTs). European Journal of Psychological Assessment, 32, 230–240. https://doi.org/10.1027/1015-5759/a000250 Kasten, N., & Staufenbiel, T. (2015, May). A construct-oriented development approach to situational judgment tests. Paper presented at the meeting of the European Association of Work and Organizational Psychology, Oslo, Norway. Levashina, J., & Campion, M. A. (2007). Measuring faking in the employment interview: Development and validation of an interview faking behavior scale. Journal of Applied Psychology, 92, 1638–1656. https://doi.org/10.1037/0021-9010.92.6.1638 Marcus, B. (2009). Faking from the applicant’s perspective: A theory of self-presentation in personnel selection settings. International Journal of Selection and Assessment, 17, 417– 430. https://doi.org/10.1111/j.1468-2389.2009.00483.x McDaniel, M. A., Morgeson, F. P., Finnegan, E. B., Campion, M. A., & Braverman, E. P. (2001). Use of situational judgment tests to predict job performance: A clarification of the literature. Journal of Applied Psychology, 86, 730–740. https://doi.org/ 10.1037/0021-9010.86.4.730 McDaniel, M. A., Whetzel, D. L., Hartman, N. S., Nguyen, N. T., & Grubb, W. L. (2006). Situational judgment tests: Validity and an integrative model. In J. A. Weekley & R. E. Ployhart (Eds.), Situational judgment tests: Theory, measurement and application (pp. 183–203). Mahwah, NJ: Erlbaum. McFarland, L. A., & Ryan, A. M. (2000). Variance in faking across non-cognitive measures. Journal of Applied Psychology, 85, 812–821. McFarland, L. A., & Ryan, A. M. (2006). Toward an integrated model of applicant faking behavior. Journal of Applied Social Psychology, 36, 979–1016. Mielke, R. (1990). Ein Fragebogen zur Wirksamkeit der Selbstdarstellung [Efficacy of Self-Presentation Questionnaire]. Zeitschrift für Sozialpsychologie, 21, 162–170. Muck, P. M. (2013). Entwicklung von Situational Judgment Tests [Development of Situational Judgment Tests]. Zeitschrift für Arbeits- und Organisationspsychologie, 57, 185–205. https:// doi.org/10.1026/0932-4089/a000125 Mueller-Hanson, R. A., Heggestad, E. D., & Thornton, G. C. (2006). Individual differences in impression management: An exploration of the psychological process underlying faking. Psychological Science, 48, 288–312. Mussel, P., Gatzka, T., & Hewig, J. (2016). Situational judgment tests as an alternative measure for personality assessment. European Journal of Psychological Assessment. Advance online publication. https://doi.org/10.1027/1015-5759/a000346
European Journal of Psychological Assessment (2020), 36(1), 136–148
Ó 2018 Hogrefe Publishing
N. Kasten et al., “Sweet Little Lies”
Nguyen, N. T., Biderman, M. D., & McDaniel, M. A. (2005). Effects of response instructions on faking a situational judgment test. International Journal of Selection and Assessment, 13, 250–260. https://doi.org/10.1111/j.1468-2389.2005.00322.x Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory. New York, NY: McGraw-Hill. Ones, D. S., Viswesvaran, C., & Judge, T. A. (2007). In support of personality assessment in organizational settings. Personnel Psychology, 60, 995–1027. https://doi.org/10.1111/j.17446570.2007.00099.x Paulhus, D. L. (2002). Socially desirable responding: The evolution of a construct. In H. I. Braun, D. N. Jackson, & D. E. Wiley (Eds.), The role of constructs in psychological and educational measurement (pp. 49–69). Mahwah, NJ: Erlbaum. Paulhus, D. L., & Williams, K. M. (2002). The dark triad of personality: Narcissism, Machiavellianism, and psychopathy. Journal of Research in Personality, 36, 556–563. https://doi. org/10.1016/S0092-6566(02)00505-6 Pauls, C. A., & Crost, N. W. (2005). Cognitive ability and selfreported efficacy of self-presentation predict faking on personality measures. Journal of Individual Differences, 26, 194–206. https://doi.org/10.1027/1614-0001.26.4.194 Peter, J. P., Churchill, G. A., & Brown, T. J. (1993). Caution in the use of difference scores in consumer research. Journal of Consumer Research, 19, 655–662. https://doi.org/10.1086/ 209329 Ployhart, R. E. (2012). Personnel selection: Ensuring sustainable organizational effectiveness through the acquisition of human capital. In S. W. J. Kozlowski (Ed.), The Oxford handbook of organizational psychology (pp. 221–246). New York, NY: Oxford University Press. Ployhart, R. E., & MacKenzie, W. I. (2010). Situational judgment tests: A critical review and agenda for the future. In S. Zedeck (Ed.), APA handbook of industrial and organizational psychology (pp. 237–252). Washington, DC: American Psychological Association. Preinerstofer, D., & Formann, A. K. (2012). Parameter recovery and model selection in mixed Rasch models. British Journal of Mathematical and Statistical Psychology, 65, 251–262. https:// doi.org/10.1111/j.2044-8317.2011.02020.x Rost, J. (1990). Rasch-models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 14, 271–282. https://doi.org/10.1177/01466216900 1400305 Rost, J., Carstensen, C. H., & Von Davier, M. (1997). Applying the mixed Rasch model to personality questionnaires. In J. Rost & R. E. Langeheine (Eds.), Applications of latent trait and latent class models in the social sciences. New York, NY: Waxmann. Rost, J., & Von Davier, M. (1994). A conditional item-fit index for Rasch models. Applied Psychological Measurement, 18, 171– 182. https://doi.org/10.1177/014662169401800206 Rothstein, M. G., & Goffin, R. D. (2006). The use of personality measures in personnel selection: What does current research support? Human Resource Management Review, 16, 155–180. https://doi.org/10.1016/j.hrmr.2006.03.004 Sackett, P. R. (2012). Faking in personality assessments: Where do we stand?? In M. Ziegler, C. MacCann, & R. D. Roberts (Eds.), New perspectives on faking in personality assessment [chapter 18]. New York, NY: Oxford University Press. Schmitt, N., & Chan, D. (2006). Situational judgment tests: Method or construct?? In J. A. Weekley & R. E. Ployhart (Eds.), Situational judgment tests: Theory, measurement and application (pp. 135–156). Mahwah, NJ: Erlbaum. Snell, A. F., Sydell, E. J., & Lueke, S. B. (1999). Toward a theory of applicant faking: Integrating studies of deception. Human
Ó 2018 Hogrefe Publishing
147
Resource Management Review, 9, 219–242. https://doi.org/ 10.1016/S1053-4822(99)00019-4 Snyder, M. (1974). Self-monitoring of expressive behavior. Journal of Personality and Social Psychology, 30, 526–537. https://doi. org/10.1037/h0037039 Van Iddekinge, C. H., Raymark, P. H., & Roth, P. L. (2005). Assessing personality with a structured employment interview: Construct-related validity and susceptibility to response inflation. Journal of Applied Psychology, 90, 536–552. https://doi. org/10.1037/0021-9010.90.3.536 Vasilopoulos, N. L., Cucina, J. M., Dyomina, N. V., Morewitz, C. L., & Reilly, R. R. (2006). Forced-choice personality tests: A measure of personality and cognitive ability? Human Performance, 19, 175–199. https://doi.org/10.1207/s15327043hup1903_1 Viswesvaran, C., & Ones, D. S. (1999). Meta-analyses of fakability estimates: Implications for personality measurement. Educational and Psychological Measurement, 59, 197–210. https:// doi.org/10.1177/00131649921969802 Von Davier, M. (2001). WINMIRA 2001. Kiel, Germany: Institute for Science Education. Weekley, J. A., Ployhart, R. E., & Holtz, B. C. (2006). On the development of situational judgment tests: Issues on item development, scaling, and scoring. In J. A. Weekley & R. E. Ployhart (Eds.), Situational judgment tests: Theory, measurement and application (pp. 157–182). Mahwah, NJ: Erlbaum. Wrenson, L. B., & Biderman, M. D. (2005, April). Factors related to faking ability: A structural equation model approach. Paper presented at the 20th annual conference of the Society for Industrial and Organizational Psychology, Los Angeles, CA. Zickar, M. J., & Gibby, R. E. (2006). A history of faking and socially desirable responding on personality tests. In R. L. Griffith & M. H. Peterson (Eds.), A closer examination of applicant faking behavior (pp. 21–42). Greenwich, CT: Information Age Publishing. Zickar, M. J., Gibby, R. E., & Robie, C. (2004). Uncovering faking samples in applicant, incumbent, and experimental data sets: An application of mixed-model item response theory. Organizational Research Methods, 7, 168–190. https://doi.org/ 10.1177/1094428104263674 Zickar, M. J., & Robie, C. (1999). Modeling faking good on personality items: An item-level analysis. Journal of Applied Psychology, 84, 551–563. https://doi.org/10.1037/0021-9010. 84.4.551 Ziegler, M. (2006). Situational Demand and its impact on construct and criterion validity of a personality questionnaire: State and trait, a couple you just can’t study separately!. (Dissertation). Ludwig-Maximilian University, Germany. Ziegler, M., & Kemper, C. J. (2013). Extreme response styles and faking: Two sides of the same coin. In P. Winkler, N. Menold, & R. Prost (Eds.), Interviewers’ deviations in surveys: Impact, detection and prevention. Frankfurt, Germany: PL Academic Research. Received February 26, 2017 Revision received December 21, 2017 Accepted December 22, 2017 Published online August 3, 2018 EJPA Section/Category: I/O Psychology Nadine Kasten University of Trier Universitätsring 15 54296 Trier Germany kasten@uni-trier.de
European Journal of Psychological Assessment (2020), 36(1), 136–148
148
Appendix Sample SJT Item
N. Kasten et al., “Sweet Little Lies”
(b) . . . I try to work on the most important tasks but I tend to be easily distracted. (c) . . . I start with the tasks I find most interesting.
If I have to organize many different work projects simultaneously, . . . (a) . . . I systematically work through the tasks starting with the most important one.
European Journal of Psychological Assessment (2020), 36(1), 136–148
Ó 2018 Hogrefe Publishing
Detailed therapeutic strategies for personality disorders “This is a very well-structured, informative, and readily accessible book that provides unique and valuable guidelines for therapists treating those with personality disorders, with a clarification- and schema-focused approach.” Elsa Ronningstam, PhD, Harvard Medical School, Harvard University, Cambridge, MA
Rainer Sachse
Personality Disorders A Clarification-Oriented Psychotherapy Treatment Model 2020, x / 254 pp. US $49.80 / € 39.95 ISBN 978-0-88937-552-9 Also available as eBook This practice-oriented guide presents a model of personality disorders (PDs) based on the latest research showing that “pure” PDs are due to relationship disturbances. The reader gains concise and clear information about the dual-action regulation model and the framework for clarification-oriented psychotherapy, which relates the relationship dysfunction to central relationship motives and games. Practical information is given on how to behave with clients and clear therapeutic strategies based on a five-phase model are outlined to help therapists manage interactional problems in therapy and to assist clients in achieving effective change.
www.hogrefe.com
The eight pure personality disorders (narcissistic, histrionic, dependent, avoidant, schizoid, passive-aggressive, obsessive-compulsive, and paranoid) are each explored in detail so the reader learns about the specific features of each disorder and the associated interactional motives, dysfunctional schemas, and relationship games and tests, as well as which therapeutic approaches are appropriate for a particular PD. As the development of a trusting therapeutic relationship is difficult with this client group, detailed strategies and tips are given throughout.
Integrative perspectives on motivation and volition “This is an excellent and valuable volume. It is a wonderful collection of pieces on motivation that serves as an apt tribute to an unusually creative and generous scholar.” Andrew J. Elliot, PhD, Professor of Psychology, Department of Clinical & Social Sciences in Psychology, University of Rochester, NY, USA
Nicola Baumann / Miguel Kazén / Markus R. Quirin / Sander L. Koole (Eds.)
Why People Do the Things They Do Building on Julius Kuhl’s Contributions to the Psychology of Motivation and Volition 2018, xii + 434 pp. US $87.00 / € 69.95 ISBN 978-0-88937-540-6 Also available as eBook How can we motivate students, patients, employees, and athletes? What helps us achieve our goals, improve our well-being, and grow as human beings? These issues, which relate to motivation and volition, are familiar to everyone who faces the challenges of everyday life. This comprehensive book by leading international scholars provides integrative perspectives on motivation and volition that build on the work of German psychologist Julius Kuhl. The first part of the book examines the historical trail of the European and American research traditions of motivation and volition and their integration in Kuhl’s theory of personality systems interactions (PSI). The
www.hogrefe.com
sec-ond part of the book considers what moves people to action – how needs, goals, and motives lead people to choose a course of action (motivation). The third part of the book explores how people, once they have committed themselves to a course of action, convert their goals and intentions into action (volition). The fourth part shows what an important role personality plays in our motivation and actions. Finally, the fifth part of the book discusses how integrative theories of motivation and volition may be applied in coaching, training, psychotherapy, and education. This book is essential reading for everyone who is interested in the science of motivating people.
Multistudy Report
Validation of the Short and Extra-Short Forms of the Big Five Inventory-2 (BFI-2) and Their German Adaptations Beatrice Rammstedt,1 Daniel Danner,2 Christopher J. Soto,3 and Oliver P. John4 GESIS – Leibniz Institute for the Social Sciences, Mannheim, Germany
1 2
University of Applied Labor Studies, Mannheim, Germany
3
Colby College, Waterville, MN, USA
4
University of California, Berkeley, CA, USA
Abstract: The present study investigates the validity and utility of the German adaptations of the two short forms of the Big Five Inventory-2 (BFI-2), the 30-item BFI-2-S, and the 15-item BFI-2-XS, developed by Soto and John (2017b). Both scales assess the Big Five domains. The BFI-2-S allows, in addition, the brief assessment of three facets per domain. Based on a large and heterogeneous sample, we show that the psychometric properties of these adapted short scales are consistent with those of the Anglo-American source versions, and we demonstrate substantial convergence between the adaptations and the source versions. Extending the original scale development study, we demonstrate high retest stability of the scales and their facets. Our results clearly indicate the construct and criterion validity of the two scales: Both show substantial convergence with the NEO-PI-R domain scales. Moreover, the distinctive correlation pattern found between the facets of the BFI-2 and the NEO-PI-R could be replicated for the facets of the BFI-2-S. Furthermore, we show that the domain scales of both instruments are associated in the hypothesized directions with important life outcomes, such as life satisfaction and intelligence, and that the facets of the BFI-2-S have incremental validity for predicting these outcomes. Keywords: Big Five Inventory-2, Big Five, short forms, validation, personality measurement
In the course of the last half a century, consensus has grown among personality researchers that a person’s personality can be described on the most global level in terms of the “Big Five” or “Five-Factor Model” framework (Goldberg, 1981; John, Naumann, & Soto, 2008).1 According to this framework, personality can be summarized along five independent and bipolar dimensions – namely, Extraversion, Agreeableness, Conscientiousness, Negative Emotionality (or Neuroticism), and Open-Mindedness (or Openness to Experience). Although different researchers have sometimes interpreted these trait dimensions in somewhat different ways – such as defining the fifth factor as Openness within the questionnaire-based Five-Factor Model tradition (Costa and McCrae, 1992), versus defining it as Culture or Intellect within the lexical Big Five tradition 1
(Goldberg, 1990) – the general agreement within the field of personality psychology on a common framework to measure personality has made personality assessment more attractive for a wide range of researchers and contexts outside the narrow field of personality research. Due to this agreement on a common framework, measures of personality are increasingly considered to be useful tools in applied settings, including education, health, industrial psychology, and economics. For example, influential national surveys, such as the German Socio-Economic Panel (SOEP), the German National Educational Panel Study (NEPS), the Household, Income and Labour Dynamics in Australia (HILDA) Survey, the GESIS Panel, and the UK Household Longitudinal Study (UKHLS); and major international surveys, such as the World Values Survey
Historically, the term “Big Five” has been associated with psycholexical research examining personality-descriptive terms in natural language, whereas the term “Five-Factor Model” has been associated with research examining the content and structure of traditional personality questionnaires (for a review, see John et al., 2008). In the interests of simplicity and readability, we generally refer to the “Big Five” throughout this paper, while acknowledging that the connotations of the “Big Five” and the “Five-Factor Model” differ somewhat.
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 149–161 https://doi.org/10.1027/1015-5759/a000481
150
B. Rammstedt et al., Validation of the Short and Extra-Short Forms of the BFI-2
(WVS) and the International Social Survey Programme (ISSP), have included measures of personality in their core questionnaires. Moreover, all of these surveys follow the Big Five approach when assessing personality. The growing spread of the Big Five into diverse contexts, and especially the widespread assessment of the Big Five dimensions in large-scale surveys, has increased the need for efficient instruments for their measurement. To meet this need, several short-scale measures have been developed over the last decade. For example, Rammstedt and John developed the BFI-10 (Rammstedt & John, 2007) and the BFI-K (Rammstedt & John, 2005), 10-item and 21-item abbreviated versions, respectively, of the established Big Five Inventory (BFI; John, Donahue, & Kentle, 1991). In the US context, Gosling and colleagues developed a 5-item and a 10-item personality inventory (TIPI; Gosling, Rentfrow, & Swann, 2003; for a critical review of the German adaptation see Herzberg & Brähler, 2006), and Donnellan and colleagues developed a 20-item measure from the International Personality Item Pool (Mini-IPIP; Donnellan, Oswald, Baird, & Lucas, 2006). All of these ultrashort measures are widely used in a variety of contexts that usually suffer from strict time limitations not allowing for a fullscale assessment of personality. However, none of them have directly addressed a key issue: how to ensure that a very brief scale will adequately sample the heterogeneous content of a construct as broad as one of the Big Five trait domains. In fact, some brief Big Five measures have been developed using criteria, such as maximizing the size and simplicity of factor loadings, that promote the construction of scales with a narrow rather than broad range of item content (Smith, McCarthy, & Anderson, 2000). In recent years, researchers have placed increasing emphasis on examining not only the global Big Five dimensions but also the more specific facets of these broad domains. As shown by Paunonen and Ashton (2001), taking facets into account – in addition to the global Big Five domains – can incrementally predict important personal, academic, social, and health outcomes, such as physical activity or the grade point average in college. Moreover, studies have shown that the facets of a domain can have differential predictive power for different outcome variables. For example, Roberts, Chernyshenko, Stark, and Goldberg (2005) showed that, compared to the Conscientiousness domain scale, specific facets of this domain substantially increase the predictive validity for self-reported drug consumption and health prevention. Only very few, and comparatively lengthy, instruments for the assessment of these more fine-grained facets of the Big Five domains have existed to date. The most established and widely used instrument for this purpose – developed in the questionnaire tradition of the Five-Factor Model – is the Revised NEO Personality Inventory
(NEO-PI-R; Costa & McCrae, 1992). To meet the need for a more efficient Big Five measure that also allows the facet structure of the Big Five domains to be assessed, the well-established Big Five Inventory (BFI) was recently revised (Soto & John, 2017a; for a German adaptation, see Danner et al., 2016). The original BFI was developed to combine the strengths of the lexical Big Five and questionnaire-based Five-Factor Model traditions by using short phrases to assess the prototypical content of each Big Five dimension (John et al., 2008). The new BFI-2 incorporates several key advances, such as a robust facet-level structure, and balanced keying to minimize the influence of acquiescent responding, while still maintaining the brevity and accessibility of the original BFI (Soto & John, 2017a). The resulting 60-item BFI-2 allows the assessment of both the domain level of the Big Five and the three most prototypical facets of each of these domains. Based on this 60-item BFI-2, Soto and John (2017b) recently developed and validated two short-form measures for the Anglo-American context. The first measure, the BFI-2-S, consists of 30 items and also allows the 15 facets of the BFI-2 to be investigated. The second measure, the even shorter BFI-2-XS, consists of only 15 items; by including one item from each of the three facets defining each Big Five domain, these 3-item domain scales reflect the full breadth of the Big Five dimensions as defined on the original BFI-2. Due to its brevity, however, the BFI-2-XS can be used only to assess the Big Five domains. These BFI-2 short forms were developed for use in specific research contexts that impose severe constraints on the amount of time that can be devoted to personality assessment, such as largescale surveys or laboratory studies. Soto and John (2017b) validated the two short forms of the BFI-2 using a university student sample and a more extensive and heterogeneous Internet sample. Their results indicate that, at the level of the Big Five domains, the BFI-2-S and the BFI-2-XS capture approximately 91% and 80%, respectively, of the total variance in the full BFI-2 domain scales. Both short-form instruments were shown to have a clear factorial structure, thus indicating appropriate factorial validity. With regard to the facet level, the authors could show that BFI-2-S facets retained approximately 89% of the predictive power of the full BFI-2. Given the promising findings of this initial validation study for the BFI-2-S and BFI-2-XS, and in view of the need for short scales that also incorporate a more fine-grained facet structure of the Big Five, we adapted the BFI-2-S and the BFI-2-XS to the German context. When doing so, we used the item translations carried out when adapting the full BFI-2 to German. These translations were based on the state of the art method, the so-called TRAPD (Translation, Review, Adjudication, Pretesting, and Documentation) approach (Harkness, 2003; see also Harkness, Villar,
European Journal of Psychological Assessment (2020), 36(1), 149–161
Ó 2018 Hogrefe Publishing
B. Rammstedt et al., Validation of the Short and Extra-Short Forms of the BFI-2
& Edwards, 2010): Two translations into German were carried out independently by professional translators experienced in translating questionnaires. In a reconciling step, these two translations were then compared and aggregated to one single optimal solution. The present study aims, first, to investigate the psychometric properties of these German adaptations of the two BFI-2 short forms. Second, and more importantly, we aim to shed further light on the reliability and validity of the two short forms of the BFI-2. Therefore, we will investigate both the stability and the construct validity of the facets by comparing them to the corresponding scales of the NEO-PI-R. In addition, by investigating the criterion validity and incremental validity of the BFI-2-S facets for predicting central life outcomes that have been previously shown to relate with the Big Five (educational attainment, crystallized intelligence, life satisfaction, health, and income; see Rammstedt, Danner, & Lechner, 2017), we will examine the utility of the scales and the merits of assessing the facets in addition to the domain scales.
Method Samples and Procedure Data were collected through an online survey. Respondents were part of a regularly recruited online panel survey. The survey was conducted by a commercial online research organization in Germany (Respondi). A monetary incentive was paid to respondents upon participation. The survey included a 1-item attention check (“This is a functional check of the survey. Please choose the response category disagree a little here.”). Participants who failed this check were excluded from the survey. N = 1,338 respondents (50% female) were included in our analyses. The sample was heterogeneous with regard to age (M = 42.77, SD = 13.94) and education (36% lower secondary, 33% intermediate secondary, 16% higher secondary or general higher education entrance qualification, 15% university degree). A subset of the sample (N = 406, 50% female) participated in a retest after an interval of 6 weeks (for a more detailed description of the sample, see Danner et al., 2016). To investigate the measurement invariance of the German adaptations of the BFI-2-S and BFI-2-XS compared to the original Anglo-American version, we reanalyzed the Internet sample described and analyzed by Soto and John (2017b, Study 2) in their validation study of the two shortform versions. The 2,000 members of this sample (50% female, Mage = 28.85, SD = 11.82) volunteered to complete
Ó 2018 Hogrefe Publishing
151
an online version of the BFI-2 in exchange for automatically generated feedback about their personalities.
Measures All participants in our sample completed the German adaptation of the full BFI-2 (Danner et al., 2016). The BFI-2 consists of 60 short-phrase items, with responses made on a 5-point rating scale ranging from strongly disagree (1) to strongly agree (5). Based on these responses, the scale scores for the BFI-2-S and BFI-2-XS were computed. The items of the two short scales are displayed in the Electronic Supplementary Material, ESM 1. In addition, all respondents reported (a) their health status based on the single item “How would you describe your health status in general?”, rated on a scale from poor (1) to excellent (5), and (b) their satisfaction with life measured by the well-established single item (see Beierlein, Kovaleva, László, Kemper, & Rammstedt, 2014) “How satisfied are you with your life in general?”, rated on a scale from not satisfied at all (0) to completely satisfied (11). In addition, we assessed the following sociodemographic variables: age (in years); educational attainment (six categories from 1 = no formal education to 6 = university degree); labor force status (working vs. not working); and income based on 17 categories ranging from less than € 300 per month (1) to more than € 10,000 per month (17). A subset of participants (N = 411) additionally completed the BEFKI-GC-K, a 12-item scale for the measurement of crystallized intelligence (Schipolowski et al., 2014). After a retest interval of 6 weeks, another subset of respondents (N = 406) completed the BFI-2 items again, and part of this subsample (N = 204) also completed the Revised NEO Personality Inventory (NEO-PI-R; Costa & McCrae, 1992; German adaptation: Ostendorf & Angleitner, 2003). The NEO-PI-R comprises 240 items and captures the Big Five personality domains and six facets of each domain (see Table 7). Other subsamples completed additional personality scales that were not analyzed for the present study.
Results and Discussion Psychometric Properties of the German Adaptations of the BFI-2-S and BFI-2-XS The most central questions are (a) to what extent the German adaptations of the BFI-2-S and BFI-2-XS scales demonstrate adequate psychometric properties and (b)
European Journal of Psychological Assessment (2020), 36(1), 149–161
152
B. Rammstedt et al., Validation of the Short and Extra-Short Forms of the BFI-2
whether these properties are comparable to those of the Anglo-American source versions.2 Table 1 displays the means, standard deviations, and reliability estimates (retest correlations and Cronbach’s alpha) for the BFI-2-S and BFI2-XS domain scales and the BFI-2-S facet scales (for interscale correlations, see Table 2). Both the BFI-2-S and BFI-2-XS assess the Big Five by means of extremely short scales consisting of six and three items per dimension, respectively, and these items were selected to cover a maximum bandwidth – namely, three different facets of each dimension. Therefore, the widely used Cronbach’s alpha coefficient will tend to underestimate the reliability of the scales (see Rammstedt & Beierlein, 2014), and retest coefficients will provide more appropriate reliability estimates. As shown in Table 1, 6-week retest reliabilities of the domain scales averaged .83 for the 6-item scales on the BFI-2-S, and still .77 for the 3-item scales on super-brief BFI-2-XS. For the 2-item facet scale scored from the BFI2-S, retest reliability was still .72. The sizes of the reliability coefficients are in line with, or even exceed, those of similarly brief Big Five measures (see Rammstedt & John, 2005, 2007; Gosling et al., 2003). For completeness, and to compare reliability indicators for the German adaptations with those of the Anglo-American source versions, Table 1 also includes Cronbach’s alpha coefficients for the domain scales and facet scales. Not surprisingly, given the brevity of the short measures, the internal consistency coefficients are quite low compared to what is expected from full-length measures. The 6-item domain scales of the BFI-2-S showed an average internal consistency of .73; for the 3-item BFI-2-XS scales they averaged .53. As shown in Table 1, these coefficients are similar to those of the original US versions, both in terms of their sizes (although these are, in most cases, somewhat smaller) and their rank order. Finally, Table 1 shows the part-whole correlations of both short-form measures with the full German BFI-2. Although the number of items is half that of the BFI-2, the BFI-2-S with its 30 items still explains approximately 90% of the variance in the full scale at the domain level and 78% at the facet level. And although the number of items in the BFI-2-XS is just one-quarter of that of the BFI-2, the 15 items of the extra-short form still explain three-quarters of the variance in the BFI-2 domain scales. We additionally investigated the between-domain discrimination of the BFI-2S and the BFI-2-XS by means of their scale intercorrelations (see Table 2). As expected and in line with the literature (Rammstedt & John, 2007; Soto & John, 2017a), there were only small to medium correlations between the domain scores. On average, the intercorrelation (absolute values, Fisher-transformed, averaged, 2
back-transformed) was .26 for the BFI-2-S and .15 for the BFI-2-XS. The correlations between the BFI-2-S facet scores are shown in ESM 2, Table E1. Also for the facets, there were – as expected – small to medium correlations between facets of different domains (.17 on average) but medium to large correlations between facets of the same domain (.44 on average).
Factorial Structure To examine the factorial structure of the BFI-2-S and BFI-2XS, we conducted exploratory structural equation modeling (ESEM; Asparouhov & Muthén, 2009) with orthogonal target rotation. In the case of the BFI-2-XS, we modeled the Big Five based on its 15 items. Because the BFI-XS incorporates positively and negatively keyed items – thus allowing the measurement of acquiescence – and because acquiescence can bias both the factorial structure and the model fit, we included in the model an acquiescence factor capturing the tendency to agree regardless of item content (Aichholzer, 2014; Soto & John, 2017b). In the case of the BFI-2-S, the model is based on the 15 facet scores. We did not include an acquiescence factor for the BFI-2-S model because the facet scores are the arithmetic mean of one positively and one negatively keyed item, and they have thus already been corrected for acquiescence (Danner & Rammstedt, 2016; Rammstedt & Danner, 2017). The standardized factor loadings and the model fit are reported in Tables 3 and 4, respectively. For both instruments, all indicators (facets and items, respectively) loaded most strongly on their corresponding factors. The model fit and the pattern of loadings in both cases suggests a good fit (e.g., CFI .979, RMSEA .033) which suggests that – like the US versions of the scales – the German-language adaptations of both the BFI-2-S and the BFI-2-XS clearly reflect the intended five-dimensional structure. In addition, we analyzed the BFI-2-S facet scores and the BFI-2-XS items (centered items, see Soto & John, 2017b) with principal component analyses (five factors, varimax rotated). The respective loadings are also shown in Tables 3 and Table 4. As can be seen, the patterns of loadings were highly similar to the ESEM results.
Cross-Cultural Equivalence To investigate the comparability of the German adaptations of the scales with their Anglo-American source versions, we used the Internet sample from the US validation study and directly examined the measurement invariance for the scales of the BFI-2-S and BFI-2-XS across the two languages and cultures.
The original inputs and outputs of the present and all following analyses are available on request.
European Journal of Psychological Assessment (2020), 36(1), 149–161
Ó 2018 Hogrefe Publishing
B. Rammstedt et al., Validation of the Short and Extra-Short Forms of the BFI-2
153
Table 1. Means, standard deviations (SD), test-retest correlations, and Cronbach’s α for the BFI-2-S and BFI-2-XS, and their correlations with the full BFI-2 domain and facet scales M
SD
Cronbach’s α
Retest S
BFI-2 Domain/Facet XS
S
XS
S
XS
S
XS
G
US
G
US
S
XS
Extraversion
3.13
2.92
0.65
0.71
.88
.82
.71
.77
.47
.63
.94
.84
Sociability
3.01
0.83
.77
.47
.70
.90
Assertiveness
3.07
0.90
.82
.66
.72
.90
Energy
3.32
0.82
.71
.57
.60
.65
.75
3.70
Compassion
3.94
0.74
.69
.51
.48
.90
Respectfulness
3.96
0.73
.63
.39
.48
.89
Trust
3.20
0.74
.66
.21
.53
.75
.78
Conscientiousness
3.64
3.70
3.64
0.57
0.63
0.68
0.73
.79
.72
.86
Agreeableness
.84
.78
.45
.55
.94
.87 .53
.61
.95
Organization
3.67
0.88
.79
.65
.79
.93
Productiveness
3.56
0.80
.76
.55
.58
.90
Responsibility Negative Emotionality Anxiety
3.86
0.68
2.74
2.84
3.01
0.74
.65 0.86
.85
.84
.33
.47
.80
.84 .65
.88 .95
2.61
0.97
.79
.69
.67
0.89
.76
.67
.75
.72
.74
.69
.54
Aesthetic Sensitivity
3.27
3.20
2.94
.96
.41
2.60
Open-Mindedness
.73
.75
Volatility
0.70
0.78
1.10
.79
.66
.64
.57
.95
3.36
0.85
.65
.57
.42
.87
3.50
0.83
.69
.64
.64
.93
.73
.78
.53
.60
3.31 3.31
3.25
0.66 0.84
0.75
.83
.77
.72
.88
.93
Creative Imagination Mean (Facet)
.89
.92 .53
Intellectual Curiosity Mean (Domain)
.88
.85 .67
0.81
Depression
.85
.53
.62
.95
.87
.90
Notes. Retest interval was 6 weeks. Coefficients for the US data were taken from Soto and John (2017b); G = German; XS = BFI-2-XS; S = BFI-2-S; Averages of all the correlational indices were computed using Fisher Z transformation.
Table 2. Correlations between manifest domains scores for the BFI-2-S/BFI-2-XS Extraversion
Agreeableness
Conscientiousness
Agreeableness
.19/.04
Conscientiousness
.25/.11
Negative Emotionality
.34/ .20
.31/ .18
.30/ .22
Open-Mindedness
.35/.30
.18/.12
.14/.00
Negative Emotionality
.31/.22 .21/ .13
Note. N = 1,338, all r .11 are p < .001.
We investigated exact measurement invariance using confirmatory factor analysis (CFA; see Chen, 2007) and approximate measurement invariance using Bayes structural equation modeling (see Cieciuch, Davidov, Schmidt, Algesheimer, & Schwartz, 2014). Whereas exact measurement invariance tests whether factor loadings and item intercepts are exactly identical across groups, approximate invariance allows some a small amount of variation across the two solutions (technically, variance is 0.01 in the present study) for these parameters, and it tests whether they differ meaningfully between groups. We evaluated measure-
ment invariance separately for each of the five domains measured by each of the two instruments. This made the analyses more sensitive to non-invariance of single items, and it also avoided the complication of cross-loadings of single items on different domains (see Tables 3 and 4). Once again, the 15 facet scores were used as manifest indicators for the BFI-2-S, and the 15 items were used as manifest indicators for the BFI-2-XS. For both the exact and the approximate approach, we tested three levels of invariance: configural invariance (same factorial structure); metric invariance (same factor loadings); and scalar invariance
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 149–161
154
B. Rammstedt et al., Validation of the Short and Extra-Short Forms of the BFI-2
Table 3. Standardized Factor Loadings for the 2-item BFI-2S scale scores ESEM BFI-2S facets
E
A
C
PCA N
O
E
A
C
N
O
Sociability
.57
.12
.18
.11
.02
.87
.11
.02
.07
.02
Assertiveness
.38
.16
.10
.18
.35
.62
.23
.07
.22
.32
Energy
.38
.09
.27
.28
.18
.62
.24
.24
.29
.16
Compassion
.30
.47
.16
.02
.05
.27
.76
.18
.11
.17
Respectfulness
.07
.48
.19
.16
.01
.00
.72
.31
.19
.09
Trust
.03
.32
.03
.17
.10
.06
.72
.07
.20
.06
Organization
.09
.13
.51
.04
.10
.03
.06
.80
.09
.01 .07
Productiveness
.07
.10
.53
.16
.25
.11
.04
.77
.26
Responsibility
.01
.17
.45
.09
.15
.15
.21
.76
.14
.07
Anxiety
.03
.01
.00
.61
.19
.09
.01
.00
.87
.05
Depression
.19
.02
.21
.66
.23
.37
.07
.19
.72
.06
Volatility
.16
.39
.06
.61
.19
.12
.36
.18
.76
.13
Aesthetic Sensitivity
.23
.26
.15
.08
.41
.02
.17
.06
.07
.75
Intellectual Curiosity
.18
.18
.09
.00
.57
.07
.01
.07
.15
.80
Creative Imagination
.30
.08
.06
.07
.44
.36
.01
.15
.12
.64
Notes. E = Extraversion, A = Agreeableness, C = Conscientiousness, N = Negative Emotionality, O = Open-Mindedness, highest loadings are bolded, ESEM: RMSEA = .029, CFI = .990, SRMR = .012, N = 1,338.
Table 4. Standardized factor loadings of the BFI-2-XS items ESEM Item
E
A
C
N
PCA (centered items) O
ARS
E
A
C
N
O
Ich bin eher ruhig. (R)
.55
.13
.08
.18
.03 .16
.83
.03
.11
.16
.02
Ich neige dazu, die Führung zu übernehmen.
.43
.10
.04
.12
.30 .15
.61
.16
.13
.25
.26
Ich bin voller Energie und Tatendrang.
.51
.23
.23
.33
.20 .19
.46
.26
.31
.45
.20
Ich bin einfühlsam, warmherzig.
.05
.58
.16
.06
.14 .20
.01
.67
.28
.10
.17
Ich bin manchmal unhöflich und schroff. (R)
.10
.48
.28
.12
.02 .16
.27
.61
.30
.18
.08
Ich schenke anderen leicht Vertrauen, glaube an das Gute im Menschen.
.05
.38
.09
.08
.01 .16
.10
.72
.21
.13
.08 .05
Ich bin eher unordentlich. (R)
.01
.01
.63
.01
.05 .15
.06
.03
.76
.04
Ich neige dazu, Aufgaben vor mir herzuschieben. (R)
.11
.04
.61
.19
.05 .16
.06
.01
.73
.21
.05
Ich bin verlässlich, auf mich kann man zählen.
.10
.30
.34
.06
.06 .22
.05
.31
.57
.05
.05
Ich mache mir oft Sorgen.
.15
.12
.01
.65
.02 .16
.07
.09
.09
.81
.00
Ich bin oft deprimiert, niedergeschlagen.
.31
.13
.16
.70
.01 .14
.16
.14
.18
.78
.04 .11
Ich bin ausgeglichen, nicht leicht aus der Ruhe zu bringen. (R)
.19
.21
.11
.62
.16 .16
.34
.14
.13
.71
Ich kann mich für Kunst, Musik und Literatur begeistern.
.05
.14
.07
.00
.43 .14
.05
.18
.10
.02
.69
Mich interessieren abstrakte Überlegungen wenig. (R)
.13
.02
.03
.03
.54 .16
.05
.04
.03
.02
.78
Ich bin originell, entwickle neue Ideen
.30
.01
.07
.12
.55 .17
.32
.00
.13
.20
.62
Note. E = Extraversion, A = Agreeableness, C = Conscientiousness, N = Negative Emotionality, O = Open-Mindedness, ARS = acquiescence; (R) = item to be revised; highest loadings are bolded, ESEM: RMSEA = .033, CFI = .979, SRMR = .016, N = 1,338.
(same factor loadings and same item intercepts). Exact measurement invariance was evaluated on the basis of the criteria suggested by Chen (2007), where a difference of CFI < .01 suggests a stricter level of invariance. Approximate invariance was evaluated on the basis of criteria suggested by Cieciuch et al. (2014), where a posterior predictive probability (ppp) greater than zero suggests that a specified level of invariance can be accepted. The results
are shown in Table 5. For the BFI-2-S, results suggest exact scalar invariance for Negative Emotionality, metric invariance for Extraversion and Conscientiousness, and only configural invariance for Agreeableness and OpenMindedness. However, because the ppp was greater than zero for all scalar models, results suggest approximate scalar invariance for all domains. For the BFI-2-XS, results suggest exact metric invariance for Open-Mindedness and only
European Journal of Psychological Assessment (2020), 36(1), 149–161
Ó 2018 Hogrefe Publishing
B. Rammstedt et al., Validation of the Short and Extra-Short Forms of the BFI-2
155
As an extension to the initial development and validation study for the BFI-2-S and BFI-2-XS conducted by Soto and John (2017b), we investigated the construct validity of the scales in more detail. First, we compared the BFI-2XS and BFI-2-S domain scores and the facet scores of the BFI-2-S with those of the NEO-PI-R, as one of the most established full-length measures assessing both the Big Five domains and their facets. Then, we investigated the criterion validity of the domain scales and the differential validity of the facet scales for important life outcomes. The NEO-PI-R was assessed only on the second measurement occasion. Thus, the correlations between the BFI-2-S, BFI-2-XS, and the NEO-PI-R scores were estimated on the basis of the data from the second assessment. Table 6 displays the correlations of the BFI-2-XS and BFI-2-S domain scales with the corresponding NEO-PI-R domains. As the BFI/BFI-2 and the NEO-PI-R differ somewhat in the construct definitions of the Big Five used (see John et al., 2008; Rammstedt & John, 2007), these differences necessarily limit the absolute size of the correlations. Of more importance here is the extent to which the shortform versions of the BFI-2 can reflect the convergence of the full-length BFI-2 with the NEO-PI-R. Therefore, for comparison purposes, the correlations of the NEO-PI-R domain scales with the German-language full BFI-2 are also reported in Table 6. Not surprisingly, the size of the correlations decreased slightly with scale length. Whereas the convergent correlations averaged .79 for the full BFI-2, they averaged .76 for the BFI-2-S, and of .69 for the BFI-2-XS. That is, the full BFI-2 and the NEO-PI-R domain scales share, on average, 62% common variance; this value drops only to 58% for the BFI-2-S, and then more substantially to 48% for the BFI-2-XS. However, given that the number of items was halved from version to version (60 compared to 30 and 15, respectively), the decrease in convergence of 3% points for the BFI-2-S can be regarded as minor. For the BFI-2-XS, shared variance with the NEO-PI-R decreases by 9% compared to the full BFI-2. This is due primarily to Extraversion and Open-Mindedness. Thus, it seems that, in these domains, the 3-item scales of the BFI-2-XS cover markedly less of the content contained in the 48-item NEO-PI-R scales. To investigate in more detail the convergence between the BFI-2-S and the NEO-PI-R, we computed correlations of the BFI-2-S facets with the NEO-PI-R domains and the
six NEO-PI-R facets of the corresponding Big Five domains (see Table 7). Overall, each BFI-2-S facet correlated substantially with the relevant NEO-PI-R domain scales, with the average correlation being .60. With regard to the correlations of the BFI-2-S facets with the most relevant NEO-PI-R facets, it must be kept in mind that the BFI-2 facets were selected to reflect the common core across different facet-structural approaches to the Big Five – of which the NEO-PI-R is just one example – and that the specific facets of the BFI-2 do not always match a single facet in the NEO-PI-R. For example, the NEO-PI-R Openness facets do not include a unique Creative Imagination facet. However, in the cases in which a direct conceptual match can be made, these facets also show the highest correlations with this corresponding facet (e.g., .79 for the Depression facets on BFI-2 and the NEOPI-R). Soto and John (2017a) report for the full BFI-2 the correlations of the BFI-2 facets with the NEO-PI-R facets of the corresponding domains, and identified 21 such correlations as distinctive because a NEO-PI-R facet correlated more strongly with a particular BFI-2 facet than with the two other, same-domain BFI-2 facets. Thus, for comparison purposes, distinctive correlations between the facets reported for the Anglo-American full-length version are bolded in Table 7. This table shows that, of the 21 NEOPI-R facets that Soto and John (2017a) identified as showing a distinctive correlation with a particular BFI-2 facet, 19 also showed their strongest correlation with the same facet in the German adaptation of the BFI-2-S, and all were significant at the .001-level. The lone exceptions were that NEOPI-R Gregariousness correlated more strongly with German BFI-2-S Energy Level than with Sociability and that NEOPI-R Fantasy correlated slightly more strongly with German BFI-2-S Aesthetic Sensitivity than with Creative Imagination. The former exception may be explained by the inclusion of emotional content (e.g., enjoyment vs. boredom) in some NEO-PI-R Gregariousness items, whereas the latter exception may be due to the exclusion of the BFI-2 item “Has trouble imagining things” from the BFI-2-S, as well as the focus of NEO-PI-R Fantasy on idle daydreaming rather than creativity and originality (cf. Soto & John, 2017a). A complete comparison of the correlation coefficients of the German BFI-2-S facets with the NEO-PI-R facets of the corresponding domains to those of the Anglo-American BFI-2 is visualized in ESM 3. To more formally compare the similarity of the correlations found in the present study for the German adaptation of the BFI-2-S with those found for the Anglo-American version of the BFI-2 reported by Soto & John (2017a), we computed a column-vector correlation for each German BFI-2-S facet comparing its set of 30 correlations with the German NEO-PI-R facets to the corresponding set of correlations between the Anglo-American BFI-2 facets and the
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 149–161
configural invariance for Extraversion, Agreeableness, Conscientiousness, and Negative Emotionality. However, because the ppp was greater than zero for all scalar models, results again suggest approximate scalar invariance for all domains.
Convergent Validity
156
B. Rammstedt et al., Validation of the Short and Extra-Short Forms of the BFI-2
Table 5. Exact and approximate measurement invariance for the BFI-2-S and the BFI-2-XS BFI-2-S Domain
BFI-2-XS
Invariance level
Exact invariance (CFI)
Approximate invariance (ppp)
Exact invariance (CFI)
Approximate invariance (ppp)
Configural
1.000
.450
1.000
.483
Metric
1.000
.468
0.953
.390
Scalar
0.984
.516
0.926
.315
Configural
1.000
.482
1.000
.515
Metric
0.981
.485
0.976
.404
Scalar
0.967
.475
0.546
.355
Configural
1.000
.485
1.000
.505
Metric
0.991
.478
0.989
.515
Extraversion
Agreeableness
Conscientiousness
Negative Emotionality
Open-Mindedness
Scalar
0.973
.237
0.955
.406
Configural
1.000
.479
1.000
.513
Metric
0.995
.472
0.981
.462
Scalar
0.990
.278
0.927
.140
Configural
1.000
.500
1.000
.489
Metric
0.985
.466
1.000
.457
Scalar
0.869
.092
0.986
.040
Note. N = 3,338, CFI = comparative fit index, ppp = posterior predicted p value.
Table 6. Convergent validity correlations with the NEO-PI-R domain scales for the BFI-2-S and the BFI-2-XS domain scales as compared to the full BFI-2 NEO-PI-R
BFI-2
BFI-2-S
BFI-2-XS
Extraversion
.80
.75
.64
Agreeableness
.78
.75
.71
Conscientiousness
.81
.79
.72
Neuroticism/Negative Emotionality
.87
.84
.81
Open-Mindedness
.66
.65
.52
Mean
.79
.76
.69
Notes. N = 204; all ps < .001. Mean correlations were computed using Fisher’s r to Z transformation.
Anglo-American NEO-PI-R facets. Across the 15 BFI-2-S facets, these column-vector correlations ranged from .57 to .97, and averaged .83 (computed via Fisher’s r to Z transformation), thus suggesting a highly similar correlation pattern for the German BFI-2-S compared to the full-length Anglo-American version.
To investigate the criterion validity of the two short scale measures, we first computed the correlations of the domain scores with the important life outcomes assessed in the first measurement occasion, namely, educational attainment, intelligence, life satisfaction, self-reported health, and income. Numerous previous studies, some of which were based on the original full BFI or its short forms, have demonstrated reliable associations between the Big Five and these outcomes. For example, educational attainment
is known to relate positively with Open-Mindedness (e.g., Caspi, Roberts, & Shiner, 2005; George, Helson, & John, 2011; Rammstedt, 2007a). For crystallized intelligence (Gc), a moderate positive correlation with Open-Mindedness and a small negative correlation with Negative Emotionality are typically found (Rammstedt, Danner, & Martin, 2016; Von Stumm & Ackerman, 2013). Life satisfaction and self-perceived health have been found to be negatively associated with Neuroticism or Negative Emotionality and positively associated with Extraversion (DeNeve & Cooper, 1998; Gutierrez, Jimenez, Hernández, & Puente, 2005; Rammstedt, 2007b; Rammstedt, Kemper, Klein, Beierlein & Kovaleva, 2013). Moreover, some studies have also found a positive association between health and Conscientiousness (e.g., Bogg & Roberts, 2004). And finally, emotionally stable individuals have been found to report higher incomes (e.g., Judge, Higgins, Thoresen, & Barrick, 1999).
European Journal of Psychological Assessment (2020), 36(1), 149–161
Ó 2018 Hogrefe Publishing
Criterion Validity
B. Rammstedt et al., Validation of the Short and Extra-Short Forms of the BFI-2
157
Table 7. Correlations of the BFI-2-S facet scales with the corresponding facet scales of the NEO-PI-R BFI-2-S Facets
Domain
Extraversion
NEO-F1
NEO-F2
NEO-F3
NEO-F4
NEO-F5
NEO-F6
Warmth
Gregariousness
Assertiveness
Activity
F1 Sociability
.55
.32
.43
.43
.49
.23
F2 Assertiveness
.49
.13
.28
.74
.43
.25
.17
F3 Energy
.64
.44
.53
.35
.57
.22
.56
Agreeableness
Excitement-Seeking Positive Emotionality .39
Trust
Straightforwardness
Altruism
Compliance
Modesty
Tender-Mindedness
F1 Compassion
.60
.46
.34
.63
.30
.15
.57
F2 Respectfulness
.59
.48
.38
.55
.41
.22
.34
F3 Trust
.61
.61
.30
.44
.40
.28
.37
Self-Discipline
Deliberation
Conscientiousness
Competence
Order
F1 Organization
.60
.38
.66
.51
.41
.53
.42
F2 Productiveness
.71
.47
.55
.54
.52
.78
.51
F3 Responsibility
.66
.56
.48
Negative Emotionality
.46
.49
Anxiety
Angry Hostility
Dutifulness Achievement Striving
Depression Self-Consciousness
.64
.53
Impulsiveness
Vulnerability
F1 Anxiety
.69
.69
.61
.63
.43
.39
.60
F2 Depression
.73
.64
.56
.79
.56
.34
.63
F3 Volatility
.70
Open-Mindedness
.56
.78
.57
.48
.48
.63
Fantasy
Aesthetics
Feelings
Actions
Ideas
Values .08
F1 Aesthetic Sensitivity
.42
.29
.48
.18
.08
.35
F2 Intellectual Curiosity
.51
.17
.34
.29
.20
.73
.07
F3 Creative Imagination
.51
.26
.37
.34
.28
.52
.07
Notes. F = facet; N = 204; correlations of identically labeled facets are bold italicized; distinctive correlations according to Soto and John (2017a) are set in bold. All coefficients > .22 = p < .001.
Table 8 shows the correlations between the German BFI2-S and BFI-2-XS domains and the various indicators of life outcomes. For both short forms, the correlation patterns were highly similar in terms of direction and size. For both scales, the correlation pattern reflects the associations typically found with the particular outcome variables. For example, the Open-Mindedness domain was positively associated with educational attainment, whereas Negative Emotionality was negatively correlated with life satisfaction and health, and Extraversion was positively correlated with life satisfaction and health. As can be seen from the R2 coefficients in Table 8, both scales explained a substantial portion of the variance in the outcomes. Furthermore, the proportion of variance explained by the BFI-2-S and BFI2-XS domains scores did not decrease compared to the domain scores of the full 60-item version of the BFI-2. Also with regard to the facet scales the proportion of variance explained by the BFI-2-S facet scores was only marginally smaller compared to the facet scores of the full BFI-2 (average ΔR2 = .02). As an indicator for the differential validity, and thus for the incremental value, of the facets compared to using only the domain scales, we analyzed the criterion validity for the 15 facets separately. As can be seen from Table 8, in several cases, the facets of a domain show a differential correlation pattern. For example, the positive associations between
Extraversion and life satisfaction and health are primarily due to the facet Energy Level, whereas the other two facets of Extraversion show markedly lower correlations with these outcomes. Similarly, the fact that more neurotic persons report lower incomes is primarily caused by the facets Anxiety and Depression and to a much lesser extent by Emotional Volatility. Moreover, the positive association between Open-Mindedness and educational attainment is due primarily to positive associations with the Intellectual Curiosity and Aesthetic Sensitivity facets and to a much lesser extent to the Creative Imagination facet. For all outcome variables, the proportion of the variance explained by the 15 facets (mean R2 = .19) is greater than that explained by the domain scales (mean R2 = .14), thus indicating that the BFI-2-S facets provide approximately 36% greater predictive validity compared to the domain scales.
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 149–161
Summary and General Discussion Results of the present research demonstrate the validity and utility of the German adaptations of the two short forms of the BFI-2, namely, the BFI-2-S and BFI-2-XS. Specifically, the German-language adaptations and the AngloAmerican source versions of the BFI-2-S and BFI-2-XS
158
B. Rammstedt et al., Validation of the Short and Extra-Short Forms of the BFI-2
Table 8. Correlations of the BFI-2-S and BFI-2-XS domain scales and BFI-2-S facet scales with outcome variables Education
Intelligence (Gc)
Life Satisfaction
Health
Income
Extraversion Domain scales BFI-2-S
.16
.01
.28
.26
.16
BFI-2-XS
.15
.03
.24
.24
.16
BFI-2S facet scales F1 Sociability
.03
.06
.14
.13
.06
F2 Assertiveness
.21
.10
.17
.13
.16
F3 Energy
.12
.03
.34
.34
.13
Agreeableness Domain scales BFI-2-S
.03
.03
.18
.04
.02
BFI-2-XS
.00
.04
.18
.04
.01
BFI-2 facet scales F1 Compassion
.02
.05
.13
.00
.04
F2 Respectfulness
.00
.03
.18
.08
.03
F3 Trust
.04
.01
.11
.00
.01
Conscientiousness Domain scales BFI-2-S
.03
.02
.19
.12
.10
BFI-2-XS
.05
.01
.16
.10
.11
BFI-2 facet scales F1 Organization
.06
.08
.11
.05
.07
F2 Productiveness
.03
.02
.18
.14
.13
F3 Responsibility
.02
.08
.16
.11
.05
Negative Emotionality Domain scales BFI-2-S
.15
.15
.45
.44
.20
BFI-2-XS
.15
.17
.45
.47
.22 .21
BFI-2 facet scales F1 Anxiety
.16
.20
.34
.35
F2 Depression
.13
.10
.50
.49
.19
F3 Volatility
.09
.08
.28
.25
.10
BFI-2-S
.29
.21
.10
.05
.09
BFI-2-XS
.30
.20
.07
.03
.07
Open-Mindedness Domain scales
BFI-2 facet scales F1 Aesthetic Sensitivity
.24
.22
.04
.01
.02
F2 Intellectual Curiosity
.28
.12
.08
.05
.09 .10
F3 Imagination
.13
.12
.12
.05
R2 (BFI-2 domains)
.10
.11
.22
.21
.06
R2 (BFI-2-S domains)
.11
.08
.23
.23
.06
R2 (BFI-2-XS domains)
.11
.08
.24
.25
.07
2
R (BFI-2 facets)
.16
.21
.30
.30
.09
R2 (BFI-2-S facets)
.15
.16
.28
.30
.08
Notes. All facet correlations with education, life satisfaction, health, and income |.10| are p < .001; all correlations with Gc |.20| are p < .001.
showed quite similar psychometric properties. Moreover, the German adaptations of both short forms proved to be approximately invariant to the Anglo-American source versions. Extending the original scale development and
validation study by Soto and John (2017b), we were able to show high retest reliability of the scales and their facets, with sizes being in line with, or even exceeding, those of other Big Five short scales.
European Journal of Psychological Assessment (2020), 36(1), 149–161
Ó 2018 Hogrefe Publishing
B. Rammstedt et al., Validation of the Short and Extra-Short Forms of the BFI-2
By extending the original validation study for the AngloAmerican BFI-2-S and BFI-2-XS versions (Soto & John, 2017b), we have also added the body of validity evidence for these scales: First, we demonstrated that both short forms show substantial convergence with the NEO-PI-R domain scales, with a pattern of associations very similar to that shown by the full BFI-2 and only modest (for the BFI-2-S) or moderate (for the BFI-2-XS) decreases in correlation strength. Our results further revealed the construct validity of the BFI-2-S facets: The distinctive correlation pattern of the facets with those of the NEO-PI-R that was found for the Anglo-American BFI-2 (see Soto & John, 2017a) could be largely replicated for the German adaptation of the BFI-2-S. Second, we investigated the criterion validity of the domain scales for the two short-form versions for central life outcomes. Our results revealed that, for all outcome variables investigated, the known associations with the Big Five traits established by previous research could be replicated. In addition, our analyses at the facet level further highlighted the incremental validity of Big Five facets, and thus the importance of investigating the facet scales in addition to the domain scales. Several life outcomes showed distinctive correlations with particular BFI-2-S facets, and collectively the 15 facet scales provided substantially greater predictive validity than did the five domain scales. One central limitation of the present study is that our analyses are based on data that extract the BFI-2-S and BFI-2-XS items from an administration of the full BFI-2, rather than data from a separate administration of the short forms (cf. Smith et al., 2000). To address this caveat we compared our findings to those from a small and comparatively homogeneous sample of academics to which only the German BFI-2 short forms were administered (see Lechner, Génois, Strohmaier, & Rammstedt, 2017). Results indicated that scale means, reliabilities, and factor loadings were all very similar for embedded versus separate administrations of the short forms. For example, congruence coefficients comparing the present factor loadings (Table 3) with those obtained by Lechner et al. (2017) were all at least .91 (M = .95) for the BFI-2-S. However, future studies using larger and more diverse samples should be conducted to replicate these initial findings; such studies could also serve as a source for benchmark scale scores. A second limitation concerns the relatively low Cronbach’s alpha reliability of some scales, especially for the ultrashort BFI-2-XS. However, this finding is not surprising given that Cronbach’s alpha is determined by a scale’s length (i.e., number of items) and mean interitem correlation (i.e., content redundancy; Gosling et al., 2003; Rammstedt & Beierlein, 2014; Smith et al., 2000). Because the BFI-2-XS scales were developed to maximize content breadth using only three items per Big Five domain, they tend to have Ó 2018 Hogrefe Publishing
159
relatively modest alpha reliabilities in both English and German. However, as shown by the present results, as well as by Soto and John (2017b), modest alphas do not prevent the BFI-2 short forms from demonstrating strong retest reliability, structural validity, and – most importantly – predictive power (Tables 1, 3, 4, and 8). In sum, our results suggest that the German adaptations of the BFI-2-S and BFI-2-XS can serve as useful instruments for assessing the Big Five in settings that severely limit assessment time, such as large-scale surveys. In contrast with previously established Big Five short scale measures, the 30-item BFI-2-S also allows investigation of the main Big Five facets in addition to the domains, which proved to be an important additional source of personality information. We caution that, in settings which are less restrictive to assessment time, or which focus on individuals’ scores (e.g., for candidate selection) rather than overall associations between variables, researchers should carefully consider the reduced psychometric properties of these short scales when deciding whether to administer them instead of a longer measure such as the full BFI-2. However, when the administration of a full-length personality measure is not feasible, the German BFI-2-S and BFI-2-XS offer considerable reliability and validity with minimal assessment time. Acknowledgment We gratefully thank Clemens Lechner for providing additional data and results on the BFI-2-S. Electronic Supplementary Materials The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000481 ESM 1. Tables (.pdf) Items of the short scales BFI-2-S and BFI-2-XS in German. ESM 2. Table (.pdf) Correlations between the manifest facet scores of the BFI-2-S. ESM 3. Figure (.pdf) Correlations of the 15 German BFI-2-S compared to the Anglo-American BFI-2 facet scales with the corresponding facet scales of the NEO-PI-R.
References Aichholzer, J. (2014). Random intercept EFA of personality scales. Journal of Research in Personality, 53, 1–4. https://doi.org/ 10.1016/j.jrp.2014.07.001 Asparouhov, T., & Muthén, B. (2009). Exploratory structural equation modeling. Structural Equation Modeling, 16, 397–438. https://doi.org/10.1080/10705510903008204 Beierlein, C., Kovaleva, A., László, Z., Kemper, C. J., & Rammstedt, B. (2014). Eine Single-Item-Skala zur Erfassung der
European Journal of Psychological Assessment (2020), 36(1), 149–161
160
B. Rammstedt et al., Validation of the Short and Extra-Short Forms of the BFI-2
Allgemeinen Lebenszufriedenheit: Die Kurzskala Lebenszufriedenheit-1 (L-1) [A single-item scale measuring general life satisfaction]. Köln, Germany: GESIS. (GESIS Working Papers 2014, 33). https://www.gesis.org/fileadmin/kurzskalen/working_papers/ L1_WorkingPapers_2014-33.pdf Bogg, T., & Roberts, B. W. (2004). Conscientiousness and healthrelated behaviors: A meta-analysis of the leading behavioral contributors to mortality. Psychological Bulletin, 130, 887–919. https://www.gwern.net/docs/conscientiousness/2004-bogg.pdf Caspi, A., Roberts, B. W., & Shiner, R. L. (2005). Personality development: Stability and change. Annual Review of Psychology, 56, 453–484. https://www.annualreviews.org/doi/10.1146/ annurev.psych.55.090902.141913 Chen, F. F. (2007). Sensitivity of goodness of fit indexes to lack of measurement invariance. Structural Equation Modeling, 14, 464–504. https://www.tandfonline.com/doi/abs/10.1080/ 10705510701301834 Cieciuch, J., Davidov, E., Schmidt, P., Algesheimer, R., & Schwartz, S. H. (2014). Comparing results of an exact vs. an approximate (Bayesian) measurement invariance test: A cross-country illustration with a scale to measure 19 human values. Frontiers in Psychology, 5, 982. https://www.ncbi.nlm.nih.gov/pmc/ articles/PMC4157555/ Costa, P. T., & McCrae, R. R. (1992). Revised NEO Personality Inventory and NEO Five Factor Professional Manual. Odessa, FL: Psychological Assessment Resources. Danner, D., & Rammstedt, B. (2016). Facets of acquiescence: Agreeing with negations is not the same as accepting inconsistency. Journal of Research in Personality, 65, 120–129. https://doi.org/10.1016/j.jrp.2016.10.010 Danner, D., Rammstedt, B., Bluemke, M., Treiber, L., Berres, S., Soto, C., & John, O. P. (2016). Die deutsche Version des Big Five Inventory 2 (BFI-2). Zusammenstellung sozialwissenschaftlicher Items und Skalen [German version of the Big Five Inventory 2 (BFI-2)]. Mannheim, Germany: GESIS. https://doi.org/10.6102/ zis247 DeNeve, K. M., & Cooper, H. (1998). The happy personality: A meta-analysis of 137 personality traits and subjective wellbeing. Psychological Bulletin, 124, 197–229. https://www.ncbi. nlm.nih.gov/pubmed/9747186 Donnellan, M. B., Oswald, F. L., Baird, B. M., & Lucas, R. E. (2006). The Mini-IPIP scales: Tiny-yet-effective measures of the Big Five factors of personality. Psychological Assessment, 18, 192–203. http://psycnet.apa.org/doi/10.1037/1040-3590.18.2.192 George, L. G., Helson, R., & John, O. P. (2011). The “CEO” of women’s work lives: How Big Five Conscientiousness, Extraversion, and Openness predict 50 years of work experiences in a changing sociocultural context. Journal of Personality and Social Psychology, 101, 812–830. https://doi.org/10.1037/a0024290 Goldberg, L. R. (1981). Language and individual differences: The search for universals in personality lexicons. In L. Wheeler (Ed.), Review of Personality and Social Psychology (pp. 141–165). Beverly Hills, CA: Sage. Goldberg, L. R. (1990). An alternative “description of personality”: The Big-Five factor structure. Journal of Personality and Social Psychology, 59, 1216–1229. https://www.ncbi.nlm.nih.gov/ pubmed/2283588 Gosling, S. D., Rentfrow, P. J., & Swann, W. B. Jr. (2003). A very brief measure of the Big-Five personality domains. Journal of Research in Personality, 37, 504–528. https://doi.org/10.1016/ S0092-6566(03)00046-1 Gutierrez, J. L. G., Jimenez, B. M., Hernández, E. G., & Puente, C. P. (2005). Personality and subjective well-being: Big Five correlates and demographic variables. Personality and Individual Differences, 38, 1561–1569. https://doi.org/10.1016/ j.paid.2004.09.015
Harkness, J. (2003). Questionnaire translation. In J. Harkness, F. van de Vijver, & P. Mohler (Eds.), Cross-cultural survey methods (pp. 35–56). Hoboken, NJ: Wiley. Harkness, J. A., Villar, A., & Edwards, B. (2010). Translation, adaptation, and design. In J. A. Harkness, M. Braun, B. Edwards, T. P. Johnson, L. Lyberg, P. Ph. Mohler, B.-E. Pennell, & T. W. Smith (Eds.), Survey methods in multinational, multiregional, and multicultural contexts (pp. 117–140). Hoboken, NJ: Wiley-Blackwell. Herzberg, P. Y., & Brähler, E. (2006). Assessing the Big-Five personality domains via short forms. A cautionary note and a proposal. European Journal of Psychological Assessment, 22, 139–148. https://doi.org/10.1027/1015-5759.22.3.139 John, O. P., Donahue, E. M., & Kentle, R. L. (1991). The Big Five Inventory – Versions 4a and 54. Berkeley, CA: University of California, Berkeley, Institute of Personality and Social Research. John, O. P., Naumann, L. P., & Soto, C. J. (2008). Paradigm shift to the integrative Big-Five trait taxonomy: History, measurement, and conceptual issues. In O. P. John, R. W. Robins, & L. A. Pervin (Eds.), Handbook of personality: Theory and research (pp. 114–158). New York, NY: Guilford Press. Judge, T. A., Higgins, C. A., Thoresen, C. J., & Barrick, M. R. (1999). The Big Five personality traits, general mental ability, and career success across the life span. Personnel Psychology, 52, 621–652. https://doi.org/10.1111/j.1744-6570.1999.tb00174.x Lechner, C. M., Génois, M., Strohmaier, M., & Rammstedt, B. (2017). Personality and social behavior on academic conferences: A network approach. Manuscript in preparation. Ostendorf, F., & Angleitner, A. (2003). NEO-Persönlichkeitsinventar nach Costa und McCrae, Revidierte Fassung (NEO-PI-R). Manual [German adaptation of the revised NEO Personality Inventory by Costa and McCrae (NEO-PI-R)]. Göttingen, Germany: Hogrefe. Paunonen, S. V., & Ashton, M. C. (2001). Big Five factors and facets and the prediction of behavior. Journal of Personality and Social Psychology, 81, 524–539. https://www.ncbi.nlm. nih.gov/pubmed/11554651 Rammstedt, B. (2007a). The 10-Item Big Five Inventory (BFI-10). European Journal of Psychological Assessment, 23, 193–201. https://doi.org/10.1027/1015-5759.23.3.193 Rammstedt, B. (2007b). Who worries and who is happy? Explaining individual differences in worries and satisfaction by personality. Personality and Individual Differences, 43, 1626–1634. https://doi.org/10.1016/j.paid.2007.04.031 Rammstedt, B., & Beierlein, C. (2014). Can’t we make it any shorter? The limits of personality assessment and ways to overcome them. Journal of Individual Difference, 35, 212–220. https://doi.org/10.1027/1614-0001/a000141 Rammstedt, B., & Danner, D. (2017). Acquiescent responding. In V. Zeigler-Hill & T. K. Shackelford (Eds.), Encyclopedia of personality and individual differences (pp. 1–3). Cham, Switzerland: Springer International Publishing. https://doi.org/ 10.1007/978-3-319-28099-8_1276-1 Rammstedt, B., Danner, D., & Lechner, C. (2017). The association between personality and life outcomes – Results from the PIAAC longitudinal study in Germany. Large-Scale Assessment in Education, 5, 2. https://doi.org/10.1186/s40536-017-0035-9 Rammstedt, B., Danner, D., & Martin, S. (2016). The association between personality and cognitive ability: Going beyond simple effects. Journal of Research in Personality, 62, 39–44. https:// doi.org/10.1016/j.jrp.2016.03.005 Rammstedt, B., & John, O. P. (2005). Kurzversion des Big Five Inventory (BFI-K): Entwicklung und Validierung eines ökonomischen Inventars zur Erfassung der fünf Faktoren der Persönlichkeit [Short version of the Big Five Inventory (BFI-K): Development and validation of an economic inventory
European Journal of Psychological Assessment (2020), 36(1), 149–161
Ó 2018 Hogrefe Publishing
B. Rammstedt et al., Validation of the Short and Extra-Short Forms of the BFI-2
161
for assessment of the five factors of personality]. Diagnostica, 51, 195–206. http://psycnet.apa.org/doi/10.1026/0012-1924. 51.4.195 Rammstedt, B., & John, O. P. (2007). Measuring personality in one minute or less: A 10-item short version of the Big Five Inventory in English and German. Journal of Research in Personality, 41, 203–212. https://doi.org/10.1016/j.jrp.2006.02.001 Rammstedt, B., Kemper, C. J., Klein, M. C., Beierlein, C., & Kovaleva, A. (2013). Eine kurze Skala zur Messung der fünf Dimensionen der Persönlichkeit: 10 Item Big Five Inventory (BFI-10) [A short-scale measure assessing the Big Five dimensions of personality: the 10 item Big Five Inventory (BFI-10)]. Methoden, Daten, Analysen, 7, 235–251. https://www.gesis. org/fileadmin/upload/forschung/publikationen/zeitschriften/ mda/Vol.7_Heft_2/MDA_Vol7_2013-2_Rammstedt.pdf Roberts, B. W., Chernyshenko, O. S., Stark, S., & Goldberg, L. R. (2005). The structure of Conscientiousness: An empirical investigation based on seven major personality questionnaires. Personnel Psychology, 58, 103–139. http://psycnet.apa.org/doi/ 10.1111/j.1744-6570.2005.00301.x Schipolowski, S., Wilhelm, O., Schroeders, U., Kovaleva, A., Kemper, C. J., & Rammstedt, B. (2014). BEFKI GC-K: Eine Kurzskala zur Messung kristalliner Intelligenz [BEFKI GC-K: A short-scale measure assessing crystallized intelligence]. Methoden, Daten, Analysen, 7, 155–183. https://doi.org/10.12758/ mda.2013.010 Smith, G. T., McCarthy, D. M., & Anderson, K. G. (2000). On the sins of short-form development. Psychological Assessment,
12, 102–111. http://psycnet.apa.org/doi/10.1037/1040-3590.12. 1.102 Soto, C. J., & John, O. P. (2017a). The next Big Five Inventory (BFI2): Developing and assessing a hierarchical model with 15 facets to enhance bandwidth, fidelity, and predictive power. Journal of Personality and Social Psychology, 113, 117–143. https://doi.org/10.1037/pspp0000096 Soto, C. J., & John, O. P. (2017b). Short and extra-short forms of the Big Five Inventory-2: The BFI-2-S and BFI-2-XS. Journal of Research in Personality, 68, 69–81. https://doi.org/10.1016/ j.jrp.2017.02.004 Von Stumm, S., & Ackerman, P. L. (2013). Investment and intellect: A review and meta-analysis. Psychological Bulletin, 139, 841–869. http://psycnet.apa.org/doi/10.1037/a0030746
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 149–161
Received June 28, 2017 Revision received December 27, 2017 Accepted February 2, 2018 Published online August 3, 2018 EJPA Section/Category Personality Beatrice Rammstedt GESIS – Leibniz Institute for the Social Sciences PO Box 12 21 55 68072 Mannheim Germany beatrice.rammstedt@gesis.org
Multistudy Report
Personality Across the Lifespan Exploring Measurement Invariance of a Short Big Five Inventory From Ages 11 to 84 Naemi D. Brandt,1,2 Michael Becker,1,2 Julia Tetzner,1,2 Martin Brunner,3 Poldi Kuhl,4 and Kai Maaz1 1
Department of Educational Governance, German Institute for International Educational Research, Berlin/Frankfurt a.M., Germany
2
Department of Educational Research, Leibniz Institute for Science and Mathematics Education, Kiel, Germany
3
Quantitative Methods in Educational Sciences, University of Potsdam, Potsdam, Germany
4
Institute of Educational Science, Leuphana University, Lüneburg, Germany
Abstract: Personality is a relevant predictor for important life outcomes across the entire lifespan. Although previous studies have suggested the comparability of the measurement of the Big Five personality traits across adulthood, the generalizability to childhood is largely unknown. The present study investigated the structure of the Big Five personality traits assessed with the Big Five Inventory-SOEP Version (BFI-S; SOEP = Socio-Economic Panel) across a broad age range spanning 11–84 years. We used two samples of N = 1,090 children (52% female, Mage = 11.87) and N = 18,789 adults (53% female, Mage = 51.09), estimating a multigroup CFA analysis across four age groups (late childhood: 11–14 years; early adulthood: 17–30 years; middle adulthood: 31–60 years; late adulthood: 61–84 years). Our results indicated the comparability of the personality trait metric in terms of general factor structure, loading patterns, and the majority of intercepts across all age groups. Therefore, the findings suggest both a reliable assessment of the Big Five personality traits with the BFI-S even in late childhood and a vastly comparable metric across age groups. Keywords: personality traits, measurement invariance, ESEM, lifespan, late childhood
Previous research has frequently shown that personality traits have a substantial influence on different life domains. They are meaningful for academic success, health, and well-being, among other domains (Anglim & Grant, 2016; Poropat, 2009; Sirois & Hirsch, 2015). Moreover, one important finding of recent research is that personality traits do not remain entirely stable throughout life and that they are related to the experience of different life events (Lüdtke, Roberts, Trautwein, & Nagy, 2011; Roberts & DelVecchio, 2000; Specht, Egloff, & Schmukle, 2011). Studying those dynamics over the life course brings new challenges including the measurement of personality traits (Milfont & Fischer, 2010). In order to make assumptions about changes in personality and their impact on relevant life outcomes, it is necessary to investigate whether personality traits can be assessed validly in a similar way across different age groups. So far, most previous studies have either examined the structure of personality for isolated age groups or small age ranges longitudinally (e.g., Asendorpf & van Aken, 2003; John, Caspi, Robins, Moffitt, & Stouthamer-Loeber, 1994; Measelle, John, Ablow, Cowan, & Cowan, 2005) or
excluded childhood from cross-sectional multigroup analyses across the lifespan (e.g., Marsh, Nagengast, & Morin, 2013). The use of different personality inventories for different age groups also limits comparisons. Therefore, the main aim of the present study was to investigate the psychometric properties of a short personality inventory across the lifespan, specifically in terms of measurement invariance across different age groups. We focused on a short personality inventory, which has many advantages, such as test efficiency in large-scale surveys and panel studies. Personality traits assessed with short instruments based on the Big Five Inventory (BFI; John, Donahue, & Kentle, 1991; John & Srivastava, 1999) are useful for many disciplines (such as psychology, educational science, and economics) in order to explain individual differences, for example, in educational outcomes and returns (e.g., Caliendo, Fossen, & Kritikos, 2014; Marsh et al., 2013; Specht et al., 2011). So far, researchers have primarily tested the measurement properties of short BFI in adult samples. Therefore, the present study was – to the best of our knowledge – the first to investigate the psychometric properties of a short BFI from late childhood to late adulthood.
European Journal of Psychological Assessment (2020), 36(1), 162–173 https://doi.org/10.1027/1015-5759/a000490
Ó 2018 Hogrefe Publishing
N. D. Brandt et al., Personality Across the Lifespan
Personality Structure Across the Lifespan Personality traits are individual characteristics of a person that have an impact on his or her experiences and behavior (McCrae & Costa, 2008). Initially, researchers assumed that personality traits were stable and fully developed by the age of 30 with few changes after that (Costa & McCrae, 1997). However, current research suggests that personality develops across the entire lifespan (Roberts, Walton, & Viechtbauer, 2006). On the one hand, personality development follows a normative trajectory, and consistency rises with increasing age, also known as the cumulative continuity principle (Caspi, Roberts, & Shiner, 2005). On the other hand, life events and conditions (e.g., educational transitions, illness, or unemployment) are substantively related to the course of personality development and changes in personality (Lüdtke et al., 2011; Specht et al., 2011). For adulthood, the most commonly used model to describe personality is the five-factor model (Big Five), which uses five broad factors to describe individual differences in experience and behavior (John, Naumann, & Soto, 2008): openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism. These five factors build the highest order of a hierarchical personality model that subsumes narrower facets covering the diversity of human beings (John et al., 2008). Personality theory supposes the Big Five to be nearly uncorrelated, while on the facet level, correlations may occur (Costa, & McCrae, 1995). Empirically, researchers have often found significant relations between factors (Ashton, Lee, Goldberg, & DeVries, 2009). Some authors interpret this as evidence of higher-order factors above the Big Five (DeYoung, 2006; van der Linden, te Nijenhuis, & Bakker, 2010). Others attribute identified relations to artificial response tendencies (Biesanz & West, 2004; Chang, Connelly, & Geeza, 2012). While researchers generally use the Big Five to describe personality in adulthood, it is not yet well-known whether this model is also applicable to children’s personality structures. This knowledge gap might be due to a research tradition developed largely independently of the Big Five approach. In this research tradition, researchers describe individual differences in children not as personality but as temperament (Zentner & Bates, 2008). This line of research is exclusively focused on childhood and therefore offers few links between adults’ and children’s personality models (DeFruyt, Mervielde, Hoekstra, & Rolland, 2000). However, more recent studies have shown theoretical and empirical overlap between temperament and the Big Five (Caspi & Shiner, 2006; DePauw, Mervielde, & Van Leeuwen, 2009; Rothbart, Ahadi, & Evans, 2000). This might indicate that the Big Five personality model is applicable
Ó 2018 Hogrefe Publishing
163
to childhood. Researchers developed Big Five instruments with easy language for children (e.g., HiPIC; Bleidorn & Ostendorf, 2009), which confirmed a five-factor structure as established in adulthood. Theoretically unintended loadings also occurred. There are also further differences in the operationalization of child-specific inventories compared to adult inventories. Openness, for example, is conceptualized as imagination, omitting aspects of aesthetic and artistic interests in childhood inventories (Bleidorn & Ostendorf, 2009). Moreover, the Big Five showed stronger interrelations in childhood than in adulthood. It is still unclear whether these differences in metric and structure between age groups occur due to developmental processes, or if the differences are evoked by the use of inventories with slightly different conceptualizations or wordings (DePauw et al., 2009). In order to separate methodological issues from developmental processes, some studies have used adult Big Five Inventories (NEO-FFI; Costa & McCrae, 1992) with children and compared the structure to other age groups (Allik, Laidra, Realo, & Pullmann, 2004) or investigated age differences in the personality structure longitudinally (McCrae et al., 2002). These studies showed that the Big Five were apparent, and to some extent comparable to the adult Big Five structure, by age 12. Even though differences in factor structures of adults and children remained, Allik et al. (2004) found no evidence for additional factors besides the Big Five using the NEO-FFI. There were similar results when researchers investigated the structure using the Big Five Inventory: In a sample of 10- to 20-year olds, Soto, John, Gosling, and Potter (2008) found five factors while controlling for acquiescent response style. With increasing age, more items loaded on the theoretically intended domain. When the researchers controlled for acquiescence, the five-factor structure was already very recognizable at age 10. The explained variance of factors, though, was smallest in childhood and increased during adulthood. Thus, the results obtained from using adult inventories for samples of children provide evidence for the reproducibility of the Big Five personality structure in childhood with longer Big Five inventories. However, previous studies could not contrast children’s personality structures with adults’ psychometric properties across the entire lifespan and only contrasted them with the specific and frequently studied young adult age group. Panel survey designs are suitable for investigating personality traits and their relations with relevant life outcomes across a broad age range or even across the entire lifespan. The large samples assessed in these contexts necessitate efficient inventories, so researchers often use short personality inventories. The generalizability from longer tests to shorter ones must be investigated when including new research questions (Ziegler, Poropat, & Mell, 2014). A first European Journal of Psychological Assessment (2020), 36(1), 162–173
164
study by Marsh et al. (2013) examined the psychometric properties of a short Big Five Inventory (BFI) with 15 items in a British survey sample of 15- to 99-year olds. The authors found a comparable metric across age groups: The relations of the Big Five, the item loadings, and the vast majority of item intercepts were similar when the authors allowed for correlations between negatively worded items. In Germany, using data from the German Socio-Economic Panel (SOEP), researchers confirmed longitudinal measurement invariance (Specht et al., 2011) and also multigroup invariance (Lucas & Donnellan, 2011) of the SOEP version of the Big Five Inventory (BFI-S) across adulthood when they modeled personality domains separately. So far, no study has investigated whether this pattern of results remains when surveying children using a German short BFI.
The Present Study
N. D. Brandt et al., Personality Across the Lifespan
by age from adolescence to late adulthood (Marsh et al., 2013). We likewise expected to find variability in correlational patterns from childhood to late adulthood.
Method Sample and Participants We used two samples, one of children and one of adults, covering an age range from 11 to 84 years. The sample of children was based on data from the KEGS project (development of competencies in primary school, Fuchs & Brunner, 2014) and included N = 1,090 sixth graders (52% female, age: M = 11.87, SD = 0.56, Mdn = 12, Range = 11– 14 years) from 68 randomly drawn primary schools in the German federal state of Brandenburg in 2011. Trained test administrators administered the survey. The students filled in the questionnaires in their classrooms on their own. The sample of adults was from the Socio-Economic Panel in Germany (Socio-Economic Panel [SOEP], 2016; Wagner, Frick, & Schupp, 2007). A representative survey of households in Germany, the SOEP, includes questions regarding the economic situation of household members, as well as questions about psychosocial life conditions. Households were chosen using a multistage randomized sampling strategy. The sample included N = 18,789 adults (53% female, age: M = 51.09, SD = 17.42; Mdn = 52, Range: 17–84 years). Data was collected in 2013 primarily via online or paperpencil surveys. Trained interviewers personally surveyed about 15% of the adult respondents.
To date there is little systematic research on the comparability of personality traits from late childhood to late adulthood. Although there is evidence for longer Big Five inventories (e.g., Allik et al., 2004; McCrae et al., 2002; Soto et al., 2008), little is known about short inventories. Therefore, the aim of the present study was to investigate psychometric properties of a short Big Five inventory by means of three aims. First, we considered the personality structure in childhood and adulthood (Aim 1). Since previous studies based on childhood and adult Big Five inventories supported the existence of a five-factor structure in childhood, we expected to find the general structure of five factors across the entire considered age span, using the short Big Five inventory. Besides establishing the Big Five structure, the second aim was to investigate the comparability of the personality trait metric across different age groups (Aim 2). Previous research has found early indications of similarity for longer inventories in childhood (Allik et al., 2004; McCrae et al., 2002; Soto et al., 2008), pointing to a loading pattern that is somewhat more ambiguous. For short inventories, only one study has investigated comparability from adolescence to old age (Marsh et al., 2013); it indicated comparable loading patterns as well as widely comparable intercepts. We therefore analyzed how far the personality metric is comparable when considering childhood, too. Finally, in a third step, we addressed the comparability of the Big Five interrelations (Aim 3). Previous studies have demonstrated substantial correlations between personality factors in childhood (Allik et al., 2004; McCrae et al., 2002) as well as in adulthood (Ashton et al., 2009). Furthermore, relations between the Big Five factors varied
In both samples, we used the Big Five Inventory-SOEP Version (BFI-S; Gerlitz & Schupp, 2005; Lang, 2005) to assess personality. The BFI-S is a self-report inventory, originally based on a German translation of the BFI-44 by John et al. (1991). This short version was developed for the survey design of the SOEP. For information about reliabilities based on evaluations of the BFI-S within SOEP and convergent validities, see Gerlitz and Schupp (2005), Hahn, Gottschling, and Spinath (2012) as well as Lang (2005). The BFI-S assesses four of the personality traits – conscientiousness, extraversion, agreeableness, and neuroticism – with three items each (including one reverse-coded item for each). Due to heterogeneity of openness to experience, there are four items for this fifth trait (Lang, 2005), none of which are reverse-coded. For original item wording, see Table 2. Respondents rate all 16 items on a 7-point Likert scale from 1 (= doesn’t apply at all) to 7 (= applies perfectly). We estimated reliabilities of scale values for the full sample using the model-based reliability index ω (McDonald, 1999). Analyses showed low to satisfactory values for
European Journal of Psychological Assessment (2020), 36(1), 162–173
Ó 2018 Hogrefe Publishing
Instrument
N. D. Brandt et al., Personality Across the Lifespan
children and adults, respectively: ω (conscientiousness) = .69/.63, ω (agreeableness) = .75/.69, ω (extraversion) = .49/.68, ω (openness to experience) = .76/.68, and ω (neuroticism) = .57/.67.
Statistical Approach Previous studies most frequently used exploratory factor analyses or principal component analyses to examine the Big Five factor structure, while confirmatory factor analyses (CFAs) often failed to establish the assumed factor structure (Church & Burke, 1994; Vassend & Skrondal, 1997). Researchers often attributed this to the strong assumption of simple structure within the independence cluster model (ICM) of a confirmatory approach (Marsh et al., 2010). Researchers often modeled Big Five domains separately to avoid problems that arose while investigating the Big Five factor structure by applying an ICM (e.g., Lucas & Donnellan, 2011; Specht et al., 2011). However, Asparouhov and Muthén (2009) proposed a combined approach based on exploratory rotation principles and on structural equation modeling to investigate comparability of psychometric properties across the lifespan (exploratory structural equation modeling, ESEM). Within the measurement model, the exploratory part overcomes restrictions of zero crossloadings, while flexibility of structural equation modeling allows researchers to test models directly (e.g., measurement invariance in multigroup models). In our study, we used a cross-validation strategy combining ESEM (see Figure 1 for a schematic ESEM model) with multigroup mean and covariance structure (MGMCS) measurement invariance testing within the CFA framework. We therefore split our sample into two halves and created a multigroup ESEM model to identify cross-loadings in one half sample (n = 9,968). We then included all statistically significant (p < .01) non-zero cross-loadings in a CFA model1 and investigated measurement invariance using the other half sample (n = 9,911). We constructed four age groups: late childhood (11–14 years, n = 547), early adulthood (17– 30 years, n = 1,507), middle adulthood (31–60 years, n = 4,753), and late adulthood (61–84 years, n = 3,104). For comparability to the study of Marsh et al. (2013), we used a similar age categorization system. As a check of robustness and in order to deal with problems in breaking down a continuous variable such as age into discrete clusters, we reestimated our models using varying age-clusters and additionally applied a continuous modeling approach (local structural equation modeling, LSEM; Hildebrandt, Lüdtke, Robitzsch, Sommer, & Wilhelm, 2016; Hildebrandt, Wilhelm, & Robitzsch, 2009) to describe courses of loadings 1
165
and intercepts across the age range. Results from all additional analyses are in the Electronic Supplementary Material, ESM 1. To examine the factor structure of personality across the considered age range (Aim 1), we evaluated a configural CFA model using well-established model fit criteria (CFI > .95–.97; RMSEA < .05–.08, SRMR < .05–.10, and AIC; Hu & Bentler, 1999; Schermelleh-Engel, Moosbrugger, & Müller, 2003). To test measurement invariance of personality assessment from late childhood to late adulthood (Aims 2 and 3), we specified increasingly restrictive multigroup CFA models. For this purpose, we tested metric invariance (equal loadings across groups), scalar invariance (adding equal intercepts across groups), and structural invariance (adding equal variances and covariances of latent factors across groups) against each other across all age groups (Widaman & Reise, 1997). As we cannot assume that unsystematic error influences are the same across age groups, we did not test for strict invariance (equal residual variances; Little, 2013). We also allowed residual variances of negatively worded items to correlate between personality domains in all models. Although a priori correlated error variances should only be specified if there is a substantive rationale for doing so (Marsh et al., 2013), we assumed that this was the case in our study. Previous studies have often found systematic response tendencies to reverse-coded or negatively worded items for children (Marsh, 1986) and also for adults (Rammstedt & Farmer, 2013). Therefore, the similarity of responses to negatively worded items of different factors may simply result from their modified phrasing. We identified scales of latent variables by fixing the variance of the first group to one and the mean to zero. We evaluated the specified models using changes in model fit criteria like CFI, RMSEA, and SRMR. With regard to the invariance testing of loadings, according to Chen (2007) and Cheung and Rensvold (2002), a nonsignificant model deterioration is indicated by a decrease of less than .010 in CFI, an increase of less than .015 in RMSEA, or an increase of less than .030 in SRMR. With respect to the invariance of item intercepts, in addition to the same cutoffs for CFI and RMSEA, the SRMR should not increase by more than .010. We tested partial invariance of parameters when the model fit deterioration was significant according to these rules. We then checked modification indices (Lagrange multipliers), freed the equality constraint with the highest value, and fit the model again. We continued doing so until model fit deterioration remained within acceptable ranges according to Chen (2007). We also report scaled w2 difference tests for the sake of
Except for the cross-loading of item A3 to conscientiousness because this resulted in a negative residual variance of that item.
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 162–173
166
N. D. Brandt et al., Personality Across the Lifespan
Figure 1. Schematic illustration of an exploratory structural equation model (ESEM).
The first aim of our study was to investigate whether personality traits from late childhood to late adulthood can be described with the Big Five model. As Table S1 (see ESM 1) shows, we found first descriptive indications of
the correlational pattern of personality items. In both samples (KEGS and SOEP), there were significant relations between items of a theoretically intended personality domain, but we also found significant, albeit mostly smaller, loadings of items to theoretically unintended personality domains. (Tables S2 and S3 in ESM 1 show correlations of personality items in adult subgroups). As a basis for cross-validation within the CFA framework, we first estimated a configural ESEM model, which fits the data well with w2(167) = 601.86, p < .001; AIC = 529454.23, CFI = .982, RMSEA = .032, 90% CI [.030, .035], SRMR = .014 (see Table S4 in ESM 1 for loadings of the ESEM model). To gain more insight into the results in the context of our first aim, we evaluated the model fit of the configural CFA model with five factors based on the ESEM model (see measurement section). The configural CFA model fits the data well across the whole age span (see Table 1). We improved the model fit when we allowed residual variances of negatively worded items to correlate (model without residual correlations: Δw2(Δ36) = 543.89, p < .001; CFI = .960, RMSEA = .047, 90% CI [.044, .049], SRMR = .023). Finally, we checked whether residual correlations between negatively worded items were comparable across the considered age range by fixing them
European Journal of Psychological Assessment (2020), 36(1), 162–173
Ó 2018 Hogrefe Publishing
completeness but did not use them for model evaluation mainly because of their well-known sensitivity to trivial differences between specified models and empirical data. We used the full information maximum likelihood (FIML) approach with a robust maximum likelihood estimator (MLR) to account for both missing data and significant skew and kurtosis of item responses in all age groups. To account for possible biases resulting from nested data structure (students nested in schools, adults nested in households), we performed additional analyses. The pattern of results of these analyses was comparable to results derived without considering clustering of data and is therefore not reported in detail. We used Mplus 7.4 (Muthén & Muthén, 1998– 2015) for ESEM modeling and R (R Core Team, 2016) for CFA models, LSEM, and descriptive results.
Results
N. D. Brandt et al., Personality Across the Lifespan
167
Table 1. Measurement Invariance of Multigroup CFA Models from Ages 11 to 84 w2
Model
Model 1: configural
df
p
CFI ΔCFI
722.50 201 < .001 .981
RMSEA [CI]
ΔRMSEA SRMR ΔSRMR
AIC
.016
524924.57
.035 [.033; .038]
Δw2 (Δdf) [compared to model x]
Model 2: metric
1006.59 342 < .001 .975 .006 .031 [.029; .033]
.004
.024
.008
524995.81
Model 3: scalar
2044.54 375 < .001 .939 .036 .046 [.044; .048]
.015
.032
.008
526140.97
1258.00 (33)*** [2]
287.51 (141)*** [1]
1772.52 374 < .001 .948 .027 .042 [.040; .044]
.013
.030
.006
525823.34
910.78 (32)*** [2]
1654.48 373 < .001 .953 .022 .041 [.039; .041]
.010
.029
.005
525686.56
764.77 (31)*** [2]
1575.34 372 < .001 .956 .019 .039 [.037; .041]
.008
.029
.005
525595.79
667.25 (30)*** [2]
1504.25 371 < .001 .958 .017 .038 [.036; .040]
.007
.028
.004
525513.24
585.05 (29)*** [2]
1436.98 370 < .001 .961 .014 .037 [.035; .039]
.006
.028
.004
525436.77
501.01 (28)*** [2]
Age-group 17–30
1380.92 369 < .001 .963 .012 .036 [.034; .038]
.005
.027
.003
252373.49
431.53 (27)*** [2]
Age-group 31–60
1297.40 368 < .001 .966 .009 .035 [.033; .037]
.004
.027
.003
525277.74
329.03 (26)*** [2]
1582.97 413 < .001 .957 .009 .037 [035; .039]
.002
.039
.012
525531.88
284.17 (45)*** [3f]
Model 3a: Item C2r Intercept free Age-Group 17–30 Model 3b: Item O2 Intercept free Age-group 17–30 Model 3e: Item O4 Intercept free Age-group 11–14 Model 3d: Item A2 Intercept free Age-group 11–14 Model 3e: Item C2r Intercept free Age-group 61–84 Model 3f: Item A3 Intercept free
Model 4: Structural (with partial scalar intercepts)
Notes. Age groups: 11–14 (n = 547), 17–30 (n = 1,507), 31–60 (n = 4,753), 61–84 (n = 3,104). In models 3b–3f, item intercept equality constraints are relaxed in addition to previous relaxed intercepts. CI = Confidence Interval (confidence level = .90).
Table 2. Item Labels and Intercepts (Model 3f Table 1) Intercepts per age group Item Labels
11–14
17–30
31–60
61–84
4.84
5.56
5.86
C1
. . .does a thorough job
6.15i
C2r
. . .tends to be lazy
5.56
C3
. . .does things efficiently
5.79i
A1r
. . .is sometimes rude to others
5.24i
A2
. . .has a forgiving nature
6.06
5.53
5.53
5.53
A3
. . .is considerate and kind to almost everyone
5.81
6.19
6.05
5.81
E1
. . .is talkative
5.52i
E2
. . .is outgoing, sociable
5.13i
E3r
. . .is reserved
3.69i
N1
. . .worries a lot
4.24i
N2
. . .gets nervous easily
3.65i
N3r
. . .is relaxed, handles stress well
3.36i
O1
. . .is original, comes up with new ideas
4.56i
O2
. . .values artistic, aesthetic experiences
4.37
3.83
4.37
4.37
O3
. . .has an active imagination
4.80i
O4
. . .is curious
4.76
5.49
5.49
5.49
Note. Group specific intercepts are reported for freed intercepts. r = reverse-coded item. C = Conscientiousness, A = Agreeableness, E = Extraversion, N = Neuroticism, O = Openness to experience. iinvariant across groups.
to equality. Model fit suggested comparability (w2(228) = 666.17, p < .001, CFI = .981, RMSEA = .033, 90% CI [.030, .036], SRMR = .017, range of residual correlations: .31 to .21).
Furthermore, our second aim was to test the comparability of the personality trait metric across the age span. Therefore, we tested metric and scalar invariance. Results demonstrated comparable measurement properties of the
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 162–173
168
N. D. Brandt et al., Personality Across the Lifespan
Table 3. Loadings of BFI-S Personality Items to Big Five Factors: Results of the Partial Scalar Invariance Model (Model 3f Table 1) λ C
λ A
λ E
λ N
λ O
C1
.742***
A1r
1.024***
E1
.981***
N1
.885***
O1
C2r
.680***
A2
.439***
E2
1.088***
N2
1.241***
O2
.882*** .985***
C3
.687***
A3
.750***
E3r
.932***
N3r
.725***
O3
1.081***
A1r
.250***
E2
.063**
A2
.245***
C1
.007
O4
.675***
E1
.195***
E3r
.619***
A3
.185***
C3
.035
C2r
.243***
E3r
.002
N3r
.206***
C2r
.082*
O1
.144
C3
.189***
N1
.247***
O1
.340***
O1
.138***
O2
.365***
A1r
.304***
O1
.281***
O2
.196***
O3
.315**
E1
.168
O2
.105
O4
.053*
O4
.027
E2
.099
O3
.043
E3r
.275
O4
.220***
N3r
.382***
Notes. Unstandardized estimates are reported because equality constraints are based on unstandardized parameters. The standardized solution showed no loading greater than .3 of an item to a theoretically unintended factor in any age group except one loading from E3r to agreeableness across all age groups (range of standardized loadings from E3r to A: λ = .404 to .342). C = Conscientiousness, A = Agreeableness, E = Extraversion, N = Neuroticism, O = Openness to Experience; r = reverse-coded item. ***p < .001, **p < .01, *p < .05.
BFI-S across all age groups. In particular, we confirmed partial scalar invariance across age groups (Table 1): The pattern of loadings, their amount, and the majority (11 out of 16) of item intercepts were comparable across all age groups. Considering modification indices, we found five intercepts in total (C2r, O2, O4, A2, A3), which varied between groups (Table 2). Intercepts of items C2r, O2, and A3 were not invariant in early adulthood (17–30 years), A3 was also not invariant in middle adulthood (31–60 years) and C2r was also not invariant in late adulthood (61– 84 years). The intercepts of items O4 and A2 only needed to be freed from equality constraints in childhood (11– 14 years). The factor loadings of the resulting model (Table 3) confirmed the theoretically intended pattern: We found the highest loadings of items to their intended domains. The only exception was one significant crossloading of the reverse-coded item of extraversion (“is reserved”) to agreeableness (λ = -.619). The robustness checks (see Tables S5 and S6 in ESM 1) supported our findings. Results stayed the same across different age categorizations demonstrating partial scalar invariance (except for agreeableness, which showed only one invariant item across age groups). Figures S1–S3 in ESM 1 show patterns of unstandardized loadings, item intercepts, and fit indices as a function of age. Moreover, our third aim was to test the correlational pattern of personality factors for equality (structural measurement invariance). Descriptive results suggested somewhat comparable relations of Big Five factors across age groups, but we also discovered differences (see also Figure 2). Although we found significant correlations between Big Five personality traits in all age groups, they were more pronounced in childhood. We also found mixed results regarding the model fit of the structural invariance model
(Model 4 in Table 1). The reduction of model fit due to invariant factor correlations was acceptable with respect to most model fit criteria (ΔCFI = .009, ΔRMSEA = .002). The SRMR, however, increased more strongly (.012) than is considered acceptable by Chen (2007).
European Journal of Psychological Assessment (2020), 36(1), 162–173
Ó 2018 Hogrefe Publishing
Discussion In this article, our central aim was to investigate the psychometric properties of the BFI-S from late childhood to late adulthood. We built on existing knowledge of psychometric properties indicated by short personality inventories across the lifespan and broadened it by adding children. One first aim was to investigate whether the adult Big Five factor structure is also observable in late childhood. Results from a short BFI suggest that the overall structure of five factors is observable from late childhood to late adulthood. These results are in line with evidence from longer personality inventories that show the Big Five are already in place in childhood (Allik et al., 2004; McCrae et al., 2002; Soto et al., 2008). Using ESEM, we found item loadings to their theoretically intended personality factor but also additional cross-loadings to unintended domains. Thus, the finding of an incomplete simple structure is in keeping with the difficulties of representing the Big Five factor structure with confirmatory factor analyses (Church & Burke, 1994; Vassend & Skrondal, 1997). However, the ESEM analyses demonstrated that these results are not specific to childhood and similarly show across the entire considered age range. These results could be cross-validated within the CFA framework. Due to its brevity, the BFI-S is not intended to cover a facet structure of personality. In line
N. D. Brandt et al., Personality Across the Lifespan
169
11 to 14 years
17 to 30 years
31 to 60 years
61 to 84 years
Figure 2. Correlation patterns of Big Five latent factors in all age groups (results of Model 3f Table 1).
with theoretical assumptions of personality, simple structure is first and foremost expected between facets and latent factors (Costa & McCrae, 1995). We therefore recommend using an ESEM-based approach with cross-validation in the CFA framework when assessing personality with short inventories and predicting external criteria or identifying cross-loadings between domains. Especially when evaluating the joint Big Five model, an ESEM-based procedure within the CFA context could take cross-loadings into account and control for them between groups. Regarding the second aim of this study, we tested the equivalence of the BFI-S metric for children and adults.
We found that the psychometric properties of the measurement model were comparable between groups for the loadings and the vast majority of item intercepts. The group of early adulthood (17–30 years) was responsible for the most non-invariant intercepts (with three non-invariant intercepts), followed by the group of late childhood with two non-invariant intercepts whereas for middle and late adulthood, only one intercept differed statistically from the other age groups. From the constructs’ perspective, regarding extraversion and neuroticism, full scalar invariance could be established. Concerning openness and conscientiousness, equality constraints had to be relaxed in two groups
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 162–173
170
N. D. Brandt et al., Personality Across the Lifespan
relations (Hu & Bentler, 1998). We also found differences in factor relations between groups in the descriptive results. In particular, factor relations were smallest for early adulthood. Previous work has referred to less distinct personality factors in childhood (Allik et al., 2004; Soto et al., 2008). The study from Marsh et al. (2013) likewise failed to establish structural invariance from adolescence to late adulthood. With regard to how researchers might use BFI-S-assessed personality traits to predict and explain external criteria, less distinct factors are not problematic or less predictive per se (Booth & Hughes, 2014). However, depending on the research question of interest, researchers should consider interrelations of Big Five factors. Our study has several strengths: We considered a broad age range, and age groups were immediately comparable in one joint model. Furthermore, we used various modeling procedures to increase confidence in our results. On the other hand, some aspects also limit our approach: Our sample provided responses to the BFI-S using different methods (interview or questionnaire) and in different contexts (in classrooms for KEGS and individually or on a computer for SOEP). Although previous work has shown BFI-S to be invariant across different assessment methods including individual, assisted, or computer-based assessment in adulthood (Lang, John, Lüdtke, Schupp, & Wagner, 2011), we do not know whether the diverse set of methods impacted our results. Therefore, our results could be seen as lower estimates of comparability, and equivalence might be improved if assessment methods did not vary. Conclusively, our study revealed evidence for the comparability of the personality trait metric of the Big Five from late childhood to late adulthood. We demonstrated this with the BFI-S, a short Big Five inventory originally developed for adults. Therefore, the BFI-S is an inventory that encourages analyzing research questions regarding personality development, its antecedents, correlations, and consequences across a very broad age range or even lifespan.
for one item each. In terms of agreeableness, equality constraints had to be relaxed in three groups for one item each. Therefore, partial scalar invariance could be established for openness, conscientiousness and agreeableness. This implies that the BFI-S allows a comparison of constructmeans from late childhood to late adulthood with partial scalar invariance. Non-invariance of item intercepts could reflect that some behavior or experiences are less frequent in different age groups or have a different meaning in these groups, leading to different item responses. The valence of personality traits may differ along individuals’ developmental pathways, as demonstrated by the varying valences of self-concepts, goals, and priorities in life, as well as the achievement of developmental tasks and the need for adaptability (Baltes, Lindenberger, & Staudinger, 2006). For example, the importance of being hardworking or diligent may change across the lifespan. Particularly, being retired may change the relevance of conscientiousness in older age groups especially from concepts younger adults have at the beginning of their professional careers. In line with this assumption, Specht and colleagues (2011) reported evidence that conscientiousness wanes after people retire. Hence, the intercept of the item “tends to be lazy” differed for older and younger adults. On more methodological grounds, non-invariant items of a construct imply the measurement of fewer items on the same scale and therefore a decrease in the reliability of estimated means because personality domains then rely on fewer comparable items (Steenkamp & Baumgartner, 1998). However, recent studies have suggested that, in particular, non-invariance in both metric and scalar parts of the model may significantly bias results (Guenole & Brown, 2014). Nevertheless, future research needs to assess more thoroughly the conditions under which partial scalar invariance has meaningful consequences for mean or variance comparisons (Putnick & Bornstein, 2016). The result of an age-varying intercept is again not specific to childhood and has also appeared in joint investigations of adolescents and adults (e.g., Marsh et al., 2013). To conclude, while using the BFI-S, researchers could investigate relations of personality with other variables and compare between groups. Moreover, this also allows for an examination of personality development across the life course. Our study therefore adds to existing evidence by demonstrating the comparability of psychometric properties across ages from late childhood to late adulthood. To address our third aim, we studied the correlational pattern of the Big Five in different age groups. The global model fit of the structural invariance model was mostly satisfactory. For direct model comparisons, we found mixed results. The SRMR clearly increased, which is in the first instance an indicator for wrongly specified latent factor cor-
ESM 1. Descriptive Results and Robustness Checks (.docx) Descriptives of BFI-S items, loadings of the configural ESEM model and robustness checks of results with different age categorizations and LSEM.
European Journal of Psychological Assessment (2020), 36(1), 162–173
Ó 2018 Hogrefe Publishing
Acknowledgments We are grateful to the editor and an anonymous reviewer for valuable feedback. We thank Johannes Hartig for helpful comments on this project. Furthermore, we thank Holly Painter for editorial assistance. Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000490
N. D. Brandt et al., Personality Across the Lifespan
ESM 2. Data (.docx) R Code for MGMCS measurement invariance testing. ESM 3. Data (.out) Mplus code for a configural ESEM model.
171
Allik, J., Laidra, K., Realo, A., & Pullmann, H. (2004). Personality development from 12 to 18 years of age: Changes in mean levels and structure of traits. European Journal of Personality, 18, 445–462. https://doi.org/10.1002/per.524 Anglim, J., & Grant, S. (2016). Predicting psychological and subjective well-being from personality: Incremental prediction from 30 facets over the Big 5. Journal of Happiness Studies, 17, 59–80. https://doi.org/10.1007/s10902-014-9583-7 Asendorpf, J. B., & van Aken, M. A. G. (2003). Validity of Big Five personality judgments in childhood: A 9 year longitudinal study. European Journal of Personality, 17, 1–17. https://doi.org/ 10.1002/per.460 Ashton, M. C., Lee, K., Goldberg, L. R., & DeVries, R. E. (2009). Higher order factors of personality: Do they exist? Personality and Social Psychology Review, 13, 79–91. https://doi.org/ 10.1177/1088868309338467 Asparouhov, T., & Muthén, B. (2009). Exploratory structural equation modeling. Structural Equation Modeling, 16, 397– 438. https://doi.org/10.1080/10705510903008204 Baltes, P. B., Lindenberger, U., & Staudinger, U. M. (2006). Life span theory in developmental psychology. In W. Damon & R. M. Lerner (Eds.), Handbook of Child psychology (pp. 569–664). New York, NY: Wiley. Biesanz, J. C., & West, S. G. (2004). Towards understanding assessments of the Big Five: Multitrait-multimethod analyses of convergent and discriminant validity across measurement occasion and type of observer. Journal of Personality, 72, 845– 876. https://doi.org/10.1111/j.0022-3506.2004.00282.x Bleidorn, W., & Ostendorf, F. (2009). Ein Big Five-Inventar für Kinder und Jugendliche [A Big Five inventory for children and adolescents: The German version of the Hierarchical Personality Inventory for Children (HiPIC)]. Diagnostica, 55, 160–173. https://doi.org/10.1026/0012-1924.55.3.160 Booth, T., & Hughes, D. J. (2014). Exploratory structural equation modeling of personality data. Assessment, 21, 260–271. https://doi.org/10.1177/1073191114528029 Caliendo, M., Fossen, F., & Kritikos, A. S. (2014). Personality characteristics and the decisions to become and stay selfemployed. Small Business Economist, 42, 787–814. https://doi. org/10.1007/s11187-013-9514-8 Caspi, A., & Shiner, R. L. (2006). Personality development. In W. Damon, R. Lerner, & N. Eisenberg (Eds.), Handbook of child psychology (pp. 300–364). New York, NY: Wiley. Caspi, A., Roberts, B. W., & Shiner, R. L. (2005). Personality development: Stability and change. Annual Review of Psychology, 56, 453–484. https://doi.org/10.1146/annurev.psych.55. 090902.141913 Chang, L., Connelly, B. S., & Geeza, A. A. (2012). Separating method factors and higher order traits of the Big Five: A metaanalytic multitrait-multimethod approach. Journal of Personality and Social Psychology, 102, 408–426. https://doi.org/ 10.1037/a0025559 Chen, F. F. (2007). Sensitivity of goodness of fit indexes to lack of measurement invariance. Structural Equation Modeling, 14, 464–504. https://doi.org/10.1080/10705510701301834
Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-offit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233–255. https://doi.org/10.1207/ S15328007SEM0902_5 Church, A. T., & Burke, P. J. (1994). Exploratory and confirmatory tests of the Big Five and Tellegen’s three- and four-dimensional models. Journal of Personality and Social Psychology, 66, 93–114. https://doi.org/10.1037/0022-3514.66.1.93 Costa, P. T. Jr., & McCrae, R. R. (1992). NEO PI-R and NEO-FFI professional manual. Odessa, FL: Psychological Assessment Resources. Costa, P. T. Jr., & McCrae, R. R. (1997). Longitudinal stability of adult personality. In R. Hogan, J. A. Johnson, & S. R. Briggs (Eds.), Handbook of personality psychology (pp. 269–290). San Diego, CA: Academic Press. https://doi.org/10.1016/B978012134645-4/50012-3 Costa, P. T. Jr., & McCrae, R. R. (1995). Domains and facets: Hierarchical personality assessment using the Revised NEO Personality Inventory. Journal of Personality Assessment, 64, 21–50. https://doi.org/10.1207/s15327752jpa6401_2 DeFruyt, F., Mervielde, I., Hoekstra, H. A., & Rolland, J.-P. (2000). Assessing adolescent’s personality with the NEO-PI-R. Assessment, 7, 329–345. https://doi.org/10.1177/ 107319110000700403 DePauw, S. S. W., Mervielde, I., & Van Leeuwen, K. G. (2009). How are traits related to problem behavior in preschoolers? Similarities and contrasts between temperament and personality. Journal of Abnormal Child Psychology, 37, 309–325. https://doi. org/10.1007/s10802-008-9290-0 DeYoung, C. G. (2006). Higher-order factors of the Big Five in a multi-informant sample. Journal of Personality and Social Psychology, 91, 1138–1151. https://doi.org/10.1037/00223514.91.6.1138 Fuchs, G., & Brunner, M. (2014). Kompetenzentwicklung und Schulqualität an Brandenburger Grundschulen: Abschlussbericht der KEGS-Studie [Development of competencies in primary school: Final report of the KEGS study]. Berlin, Germany: ISQ. Gerlitz, Y., & Schupp, J. (2005). Assessment of Big Five personality characteristics in the SOEP (DIW Research Notes 4). Berlin, Germany: German Institute of Economic Research. Guenole, N., & Brown, A. (2014). The consequences of ignoring measurement invariance for path coefficients in structural equation models. Frontiers in Psychology, 5, 980. https://doi. org/10.3389/fpsyg.2014.00980 Hahn, E., Gottschling, J., & Spinath, F. M. (2012). Short measurements of personality – Validity and reliability of the GSOEP Big Five Inventory (BFI-S). Journal of Research in Personality, 46, 355–359. https://doi.org/10.1016/j.jrp.2012.03.008 Hildebrandt, A., Lüdtke, O., Robitzsch, A., Sommer, C., & Wilhelm, O. (2016). Exploring factor model parameters across continuous variables with local structural equation models. Multivariate Behavioral Research, 51, 257–258. https://doi.org/10.1080/ 00273171.2016.1142856 Hildebrandt, A., Wilhelm, O., & Robitzsch, A. (2009). Complementary and competing factor analytic approaches for the investigation of measurement invariance. Review of Psychology, 16, 87–102. https://doi.org/10.1037/t27207-000 Hu, L., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to underparameterized model misspecification. Psychological Methods, 3, 424–453. https://doi.org/ 10.1037/1082-989X.3.4.424 John, O. P., Caspi, A., Robins, R. W., Moffitt, T. E., & StouthamerLoeber, M. (1994). The “little five”: Exploring the nomological network of the five-factor model of personality in adolescent
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 162–173
References
172
N. D. Brandt et al., Personality Across the Lifespan
boys. Child Development, 65, 160–178. https://doi.org/10.2307/ 1131373 John, O. P., Donahue, E. M., & Kentle, R. L. (1991). The Big Five Inventory–Versions 4a and 54. Berkeley, CA: University of California, Berkeley. John, O. P., Naumann, L. P., & Soto, C. J. (2008). Paradigm shift to the integrative Big Five Taxonomy. History, measurement, and conceptual issues. In O. P. John, R. W. Robins, & L. A. Pervin (Eds.), Handbook of personality (pp. 114–158). New York, NY: Guilford Press. John, O. P., & Srivastava, S. (1999). The Big Five trait taxonomy: History, measurement, and theoretical perspectives. In L. A. Pervin & O. P. John (Eds.), Handbook of personality (pp. 102– 139). New York, NY: Guilford Press. Lang, F. R. (2005). Erfassung des kognitiven Leistungspotenzials und der “Big Five” mit Computer-Assisted-Personal-Interviewing (CAPI): Zur Reliabilität und Validität zweier ultrakurzer Tests und des BFI-S [Assessment of cognitive capabilities and the Big Five with Computer-Assisted Personal Interviewing (CAPI): Reliability and validity] (DIW Research Notes 9). Berlin Germany: German Institute of Economic Research. Lang, F. R., John, D., Lüdtke, O., Schupp, J., & Wagner, G. G. (2011). Short assessment of the Big Five: Robust across survey methods except telephone interviewing. Behavior Research Methods, 43, 548–567. https://doi.org/10.3758/s13428-0110066-z Little, T. D. (2013). Longitudinal structural equation modeling. New York, NY/London, UK: Guilford Press. Lucas, R. E., & Donnellan, M. B. (2011). Personality development across the life span: Longitudinal analyses with a national sample from Germany. Journal of Personality and Social Psychology, 101, 847–861. https://doi.org/10.1037/a0024298 Lüdtke, O., Roberts, B. W., Trautwein, U., & Nagy, G. (2011). A random walk down university avenue: Life paths, life events, and personality trait change at the transition to university life. Journal of Personality and Social Psychology, 101, 620–637. https://doi.org/10.1037/a0023743 Marsh, H. W. (1986). Negative item bias in rating scales for preadolescent children: A cognitive-developmental phenomena. Developmental Psychology, 22, 37–49. https://doi.org/ 10.1037/0012-1649.22.1.37 Marsh, H. W., Lüdtke, O., Muthén, B., Asparouhov, T., Morin, A. J. S., Trautwein, U., & Nagengast, B. (2010). A new look at the big five factor structure through exploratory structural equation modeling. Psychological Assessment, 22, 471–491. https://doi. org/10.1037/a0019227 Marsh, H. W., Nagengast, B., & Morin, A. J. S. (2013). Measurement invariance of big-five factors over the life span: ESEM tests of gender, age, plasticity, maturity, and la dolce vita effects. Developmental Psychology, 49, 1194–1218. https://doi. org/10.1037/a0026913 McCrae, R. R., & Costa, P. T. Jr. (2008). The five-factor theory of personality. In O. P. John, R. W. Robins, & L. A. Pervin (Eds.), Handbook of personality (pp. 159–181). New York, NY: Guilford Press. McCrae, R. R., Costa, P. T. Jr., Terracciano, A., Parker, W. D., Mills, C. J., DeFruyt, F., & Mervielde, I. (2002). Personality trait development from age 12 to age 18: Longitudinal, crosssectional and cross-cultural analyses. Journal of Personality and Social Psychology, 83, 1456–1468. https://doi.org/10.1037/ 0022-3514.83.6.1456 McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Erlbaum. Measelle, J. R., John, O. P., Ablow, J. C., Cowan, P. A., & Cowan, C. P. (2005). Can children provide coherent, stable, and valid self-reports on the Big Five dimensions? A longitudinal study
from ages 5 to 7. Journal of Personality and Social Psychology, 89, 90–106. https://doi.org/10.1037/0022-3514.89.1.90 Milfont, T. L., & Fischer, R. (2010). Testing measurement invariance across groups: Apllications in cross-cultural research. International Journal of Psychological Research, 3, 111–121. https://doi.org/10.21500/20112084.857 Muthén, L. K., & Muthén, B. O. (1998–2015). Mplus user’s guide (7th ed.). Los Angeles, CA: Muthén & Muthén. Poropat, A. E. (2009). A meta-analysis of the five-factor model of personality and academic performance. Psychological Bulletin, 135, 322–338. https://doi.org/10.1037/a0014996 Putnick, D. L., & Bornstein, M. H. (2016). Measurement invariance conventions and reporting: The state of the art and future directions for psychological research. Developmental Review, 41, 71–90. https://doi.org/10.1016/j.dr.2016.06.004 R Core Team. (2016). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Rammstedt, B., & Farmer, R. F. (2013). The impact of acquiescence on the evaluation of personality structure. Psychological Assessment, 25, 1137–1145. https://doi.org/10.1037/a0033323 Roberts, B. W., & DelVecchio, W. F. (2000). The rank-order consistency of personality traits from childhood to old age: A quantitative review of longitudinal studies. Psychological Bulletin, 126, 3–25. https://doi.org/10.1037/0033-2909.126.1.3 Roberts, B. W., Walton, K. E., & Viechtbauer, W. (2006). Patterns of mean-level change in personality traits across the life course: A meta-analysis of longitudinal studies. Psychological Bulletin, 132, 1–25. https://doi.org/10.1037/0033-2909.132.1.1 Rothbart, M. K., Ahadi, S. A., & Evans, D. E. (2000). Temperament and personality: Origins and outcomes. Journal of Personality and Social Psychology, 78, 122–135. https://doi.org/10.1037// 0022-3514.78.1.122 Schermelleh-Engel, K., Moosbrugger, H., & Müller, H. (2003). Evaluating the fit of structural equation models: Tests of significance and descriptive goodness-of-fit measures. Methods of Psychological Research Online, 8, 23–74. https://www. dgps.de/fachgruppen/methoden/mpr-online/ Sirois, F. M., & Hirsch, J. K. (2015). Big Five traits, affect balance and health behaviors: A self-regulation resource perspective. Personality and Individual Differences, 87, 59–64. https://doi. org/10.1016/j.paid.2015.07.031 Socio-Economic Panel (SOEP). (2016). Data for years 1984–2014 version 31. SOEP. https://doi.org/10.5684/soep.v31.1 Soto, C. J., John, O. P., Gosling, S. D., & Potter, J. (2008). The developmental psychometrics of Big Five self-reports: Acquiescence, factor structure, coherence, and differentiation from ages 10 to 20. Journal of Personality and Social Psychology, 94, 718–737. https://doi.org/10.1037/0022-3514.94.4.718 Specht, J., Egloff, B., & Schmukle, S. C. (2011). Stability and change of personality across the life course: The impact of age and major life events on mean-level and rank-order stability of the Big Five. Journal of Personality and Social Psychology, 101, 862–882. https://doi.org/10.1037/a0024950 Steenkamp, J.-B. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25, 78–107. https://doi.org/ 10.1086/209528 van der Linden, D., te Nijenhuis, J., & Bakker, A. B. (2010). The general factor of personality: A meta-analysis of Big Five intercorrelations and a criterion-related validity study. Journal of Research in Personality, 44, 315–327. https://doi.org/ 10.1016/j.jrp.2010.03.003 Vassend, O., & Skrondal, A. (1997). Validation of the NEO Personality Inventory and the five-factor model. Can findings from exploratory and confirmatory factor analysis be
European Journal of Psychological Assessment (2020), 36(1), 162–173
Ó 2018 Hogrefe Publishing
N. D. Brandt et al., Personality Across the Lifespan
173
reconciled? European Journal of Personality, 11, 147–166. https://doi.org/10.1002/(SICI)1099-0984(199706)11:2<147:: AID-PER278>3.0.CO;2-E Wagner, G. G., Frick, J. R., & Schupp, J. (2007). The German SocioEconomic Panel Study (SOEP)–Scope, evolution and enhancements. Schmollers Jahrbuch, 127, 139–169. https://www.diw. de/sixcms/detail.php?id=diw_02.c.233221.de Widaman, K. F., & Reise, S. P. (1997). Exploring the measurement invariance of psychological instruments: Applications in the substance use domain. In K. J. Bryant, M. Windle, & S. G. West (Eds.), The science of prevention: Methodological advances from alcohol and substance abuse research (pp. 281–324). Washington, DC: American Psychological Association. Zentner, M., & Bates, J. E. (2008). Child temperament: An integrative review of concepts, research programs, and measures. European Journal of Developmental Science, 2, 7–37. Ziegler, M., Poropat, A., & Mell, J. (2014). Does the length of a questionnaire matter? Journal of Individual Differences, 35, 250–261. https://doi.org/10.1027/1614-0001/a000147
Received March 9, 2017 Revision received March 28, 2018 Accepted March 31, 2018 Published online September 25, 2018 EJPA Section/Category Personality
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 162–173
Naemi D. Brandt German Institute for International Educational Research Warschauer Straße 34-38 10243 Berlin Germany brandt@dipf.de
Multistudy Report
A Meta-Analysis of Test Scores in Proctored and Unproctored Ability Assessments Diana Steger,1,5 Ulrich Schroeders,2 and Timo Gnambs3,4 1
Bamberg Graduate School of Social Sciences, University of Bamberg, Germany
2
Department of Psychological Assessment, University of Kassel, Germany
3
Educational Measurement, Leibniz Institute for Educational Trajectories, Germany
4
Institute for Education and Psychology, Johannes Kepler University Linz, Austria
5
Department for Individual Differences and Psychological Assessment, Ulm University, Germany
Abstract: Unproctored, web-based assessments are frequently compromised by a lack of control over the participants’ test-taking behavior. It is likely that participants cheat if personal consequences are high. This meta-analysis summarizes findings on context effects in unproctored and proctored ability assessments and examines mean score differences and correlations between both assessment contexts. As potential moderators, we consider (a) the perceived consequences of the assessment, (b) countermeasures against cheating, (c) the susceptibility to cheating of the measure itself, and (d) the use of different test media. For standardized mean differences, a three-level random-effects meta-analysis based on 109 effect sizes from 49 studies (total N = 100,434) identified a pooled effect of Δ = 0.20, 95% CI [0.10, 0.31], indicating higher scores in unproctored assessments. Moderator analyses revealed significantly smaller effects for measures that are difficult to research on the Internet. These results demonstrate that unproctored ability assessments are biased by cheating. Unproctored assessments may be most suitable for tasks that are difficult to search on the Internet. Keywords: meta-analysis, unproctored assessment, cognitive ability, cheating
Recent technological developments changed the way researchers collect psychological data in general (Miller, 2012) and conduct psychological assessments in particular (Harari et al., 2016). Gathering data outside the laboratory in an unproctored setting, for example, using mobile devices or web-based tests serves as an ecologically valid (Fahrenberg, Myrtek, Pawlik, & Perrez, 2007) and economic method (Buhrmester, Kwang, & Gosling, 2011) to collect psychological data on large, heterogeneous samples (Gosling, Sandy, John, & Potter, 2010). Therefore, unproctored, web-based testing has become the dominant assessment mode in market and public opinion research (Evans & Mathur, 2005) and is similar popular in the academic realm (Allen & Seaman, 2014) or in personnel selection (Lievens & Harris, 2003; Tippins, 2011). The advantages of unproctored testing, however, come at a cost: the lack of supervision results in less standardized test-taking conditions and less control over test-takers’ behavior (Wilhelm & McKnight, 2002). Therefore, the question arises if the opportunity for dishonest behaviors in unproctored assessments leads to biased scores and threatens the usefulness of these tests (Rovai, 2000; Tippins et al., 2006). To this European Journal of Psychological Assessment (2020), 36(1), 174–184 https://doi.org/10.1027/1015-5759/a000494
end, a meta-analysis is presented that compares scores from proctored and unproctored ability tests across assessment contexts and examines potential moderating influences thereon.
Mode Effects in Ability Assessments While scores of self-report instruments can be considered equivalent for proctored and unproctored testing (Gnambs & Kaspar, 2017), results for tests of maximal performance are rather inconclusive (Do, 2009): Some studies found no systematical differences between self-selected web samples and traditional lab samples (e.g., Ihme et al., 2009), whereas others reported significantly higher scores for unproctored tests (e.g., Carstairs & Myors, 2009) or, occasionally, for proctored tests (e.g., Coyne, Warszta, Beadle, & Sheehan, 2005). Inconsistent results were also reported for the prevalence of cheating: some studies found low cheating rates varying from below 2.5% (Nye, Do, Drasgow, & Fine, 2008) to 7.0% (Tendeiro, Meijer, Schakel, & Maijde Meij, 2013). Conversely, in an online survey, every fourth participant reported cheating on a knowledge task Ó 2018 Hogrefe Publishing
D. Steger et al., Meta-Analysis of Unproctored Ability Assessments
175
The aim of this meta-analysis was to investigate to what extent a lack of supervision undermines psychological assessment of cognitive abilities. Given that unproctored assessment procedures are on the rise (Gosling & Mason, 2015), it is crucial to know whether the mode of test administration influences test scores. Our outcome variables are standardized mean differences and correlations between proctored and unproctored ability assessments. We take into account all test situations without a human supervisor present (Tippins, 2009). Accordingly, a setting is proctored if a human supervisor is present or remotely proctored if the testing is supervised via webcam. Additionally, this meta-analysis considers various moderators to explain the heterogeneous findings reported in the literature.
First, test-takers’ cheating motivation can be influenced by the perceived consequences of a test result. If participants anticipate severe consequences such as hiring or university admission, they are most likely more motivated to cheat. Therefore, proctored assessments are still viewed as the gold standard in high-stakes testing (Rovai, 2000). Do (2009) hypothesized that cheating is not as prevalent in low-stakes contexts, even though previous results point in a different direction (Jensen & Thomsen, 2014). We expect that in case important consequences are directly linked to the participants’ performance, test-takers might be more likely to cheat. Conversely, test-takers are presumably less motivated to cheat if no consequences are linked to the test results. Thus, we expect higher score differences in high-stakes settings (Hypothesis 1, H1). Second, test administrators can implement countermeasures that overcome participants’ motivation to cheat. Especially in high-stakes contexts, administrators are advised to use honesty contracts or follow-up verification tests (International Test Commission, 2006). Honesty contracts include explicit policies and negative consequences of cheating. Usually, such honesty contracts are presented to the test-taker prior to the testing and must be signed to indicate commitment. Verification tests are proctored follow-up tests that help to identify participants with aberrant test scores (Guo & Drasgow, 2010; Tendeiro et al., 2013). To work as a countermeasure designed to lower the test-takers’ motivation to cheat, it is important to inform test-takers about the follow-up tests in advance. These procedures are often used in personnel selection (Lievens & Burke, 2011; Nye et al., 2008). In academic settings, institutions often implement honor codes not only to raise students’ awareness of cheating but also to call attention to the consequences linked to unethical behavior (McCabe & Treviño, 2002; O’Neill & Pfeiffer, 2012). Furthermore, other researchers suggested the use of specific instructions to reduce cheating that can contain the note that test results, or feedback, are only valid if the test-taker does not cheat (e.g., Wilhelm & McKnight, 2002). These precautions are intended to lower participants’ cheating motivation, thus should result in reduced score differences (Hypothesis 2, H2). Third, the measurement instrument itself can affect participants’ opportunity to cheat. Diedenhofen and Musch (2017) investigated cheating in an unproctored assessment, comparing a knowledge quiz and a reasoning task. They found that participants switched between browser tabs more often when answering knowledge questions that can be looked up on the Internet. Moreover, a positive relationship between page switches and test performance was found for the knowledge task, whereas no significant relationship was found for the reasoning test. These findings are in line with other studies reporting that cheating was most effective for subtests that assess abilities such as vocabulary and
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 174–184
without being offered performance-dependent incentives (Jensen & Thomsen, 2014). One reason for the heterogeneous results are the varying settings that unproctored assessments were administered in (Reynolds, Wasko, Sinar, Raymark, & Jones, 2009), such as personnel selection (Bartram, 2006; Tippins, 2009), educational contexts (Allen & Seaman, 2014), and research contexts, in which the feasibility, equivalence, and validity of web-based assessments are examined (e.g., Jensen & Thomsen, 2014; Wilhelm & McKnight, 2002). These settings differ in the perceived consequences of the assessment, the countermeasures that are taken to prevent cheating, and the measured cognitive domain. In industrial and organizational (I/O) psychology, ability testing often takes place in high-stakes settings with hiring decisions linked to the individual test results. Thus, test-takers have a strong motivation to perform well to increase their chances of employment. To maximize the benefits for applicants and employers (Gibby, Ispas, Mccloy, & Biga, 2009), countermeasures against cheating are implemented to discourage participants from faking their test scores in recruitment procedures. In educational assessments, online placement tests or exams are commonly knowledge tests that are tailored to the curriculum. In research settings, however, test-takers’ performance in unproctored assessments usually have no severe consequences, thus, participants are expected to cheat less (Do, 2009). In contrast to the applied contexts, a wide range of different measures are examined, such as reasoning tests (e.g., Preckel & Thiemann, 2003), perception tasks (e.g., Williamson, Williamson, & Hinze, 2016), and knowledge tests (e.g., Jensen & Thomsen, 2014). Accordingly, the current metaanalysis investigates whether there are systematic score differences in proctored and unproctored ability assessments depending on the aforementioned differences in the test environment.
Research Questions
176
numeracy, in which performance can be enhanced through the use of a web search, dictionaries, or calculators (Bloemers, Oud, & van Dam, 2016). In contrast, tasks that assess fluid abilities such as reasoning are less susceptible to cheating. Therefore, score difference should be higher for tests with a high searchability (Hypothesis 3, H3). Lastly, a factor that can lead to test score differences is the use of cross-mode comparisons. Unproctored assessments are usually administered over the Internet and, therefore, computer-based. Most studies compared these web-based assessments to proctored, computer-based assessments (e.g., Germine et al., 2012). However, not all studies adopted identical test modes in both contexts: some studies compared unproctored, computerized tests to proctored, paper-and-pencil assessments (e.g., Coyne et al., 2005). Although computer-based and paper-and-pencil ability assessments are considered equivalent for nonspeeded measures (Mead & Drasgow, 1993), Schroeders and Wilhelm (2010) suggested differences in perceptual and motor skills in a smartphone-based assessment as potential confounding factors. These differences, however, might lead to biased scores when proctored and unproctored assessments are compared across test media. If substantial mode differences exist, cross-mode comparisons are expected to result in larger mean differences between proctored and unproctored settings (Hypothesis 4, H4). However, the equivalence of test scores across proctored and unproctored ability assessments should not be solely based on the comparison of mean scores. From a psychometric perspective, it is important to ensure that test scores are only dependent on the trait in question and independent of testing conditions. The comparability of test scores gathered in different settings should be carried out using latent variable modeling (Schroeders & Wilhelm, 2010, 2011). However, such strict psychometric procedures require raw data, which is usually not available for metaanalysis. One of the simplest statistic indexing the similarity of the test-takers’ ranking across conditions are correlation coefficients (Mead & Drasgow, 1993). A low correlation indicates differences across conditions in the assessment of test-takers’ ability. If examinee ranking is invariant across modes (i.e., high cross-mode correlations are obtained), mean scores can be converted using linear transformations (Green, 1991; Hofer & Green, 1985). Therefore, we additionally examine correlations between ability test scores in proctored and unproctored settings.
Method
D. Steger et al., Meta-Analysis of Unproctored Ability Assessments
within the Open Science Framework (Center for Open Science, 2017): https://osf.io/xf8dq/
Literature Search and Study Selection An overview of the literature search is depicted in Figure S1 (https://osf.io/3kaf8/). In total we identified 101 potentially relevant studies, searching in major scientific databases, screening reference lists, and contacting authors. Subsequently, these studies were examined regarding the following criteria to be included in the meta-analysis: The study (a) reported a comparison of test scores obtained in a (remotely) proctored setting versus an unproctored setting, (b) administered cognitive ability measures, (c) was published during the last 25 years (1992–2017), (d) was written in English, and (e) reported appropriate statistical information that allowed the calculation of an effect size. Studies only reporting latent mean scores were excluded from the analyses. Furthermore, studies were excluded from the analyses, if (a) participants were actively instructed to cheat (e.g., Bloemers et al., 2016), (b) participants underwent different training phases prior to the assessments (e.g., online vs. traditional classes), or (c) different tools and aids were allowed across testing conditions (e.g., open vs. closed book exams; Brallier & Palm, 2015; Flesch & Ostler, 2010). After applying these criteria, 50 studies were considered eligible for the meta-analysis (see Table S1 for an overview of all studies included in the analysis; see https://osf.io/3kaf8/). Although we planned to include other assessment modalities that are discussed in the literature (e.g., smartphones; Harari et al., 2016), all but one study included in our analysis only reported paper-and-pencil or computer-based assessments. While assessments that used advanced security checks such as webcams or security biometrics (Khan et al., 2017) were coded as remotely proctored (n = 3), studies that only used specific testing platforms that impeded browser-tab switches or returning back to previous questions were coded as unproctored assessments (see Table S1, column 7 for additional information on the type of proctoring; https://osf.io/3kaf8/).
Coding Process
To make the present analyses transparent and reproducible (Nosek et al., 2015), we provide all material (i.e., coding protocol, data, syntax, additional tables and figures) online
We developed a standardized coding protocol assessing descriptive information, effect sizes, and the moderator variables. For each study, we coded the type of publication (i.e., peer-reviewed journal, contribution to an edited book,
European Journal of Psychological Assessment (2020), 36(1), 174–184
Ó 2018 Hogrefe Publishing
D. Steger et al., Meta-Analysis of Unproctored Ability Assessments
177
Table 1. Meta-analysis of mean differences and separate moderator analyses
Overall
k1
k2
N
Mdn
109
65
100,434
269
g 0.19
SDg
Δ
SEΔ
z
0.46
0.20
0.05
3.85*
Stakes
1.62
QM
σ2(2)
σ2(3)
I2(2)
I2(3)
.14
.03
.80
.17
2.64
High
67
42
79,203
312
0.31
0.47
0.27
0.07
4.02*
.15
.03
.80
.18
Low
42
23
21,231
212
0.01
0.35
0.09
0.08
1.06
.13
.03
.75
.18
Countermeasures
1.64
2.68
Yes
32
19
6,518
128
0.33
0.49
0.35
0.11
3.13*
.18
.04
.80
.18
No
77
46
93,916
298
0.13
0.43
0.15
0.06
2.53*
.13
.03
.79
.17
Searchability
3.73
13.95*
High
51
34
21,407
187
0.40
0.50
0.38
0.08
4.76*
.14
.06
.65
.28
Low
58
34
79,863
336
0.00
0.32
0.02
0.05
0.44
.05
.03
.62
.33
Modality
1.73
3.01
Cross-mode
31
18
16,409
276
0.27
0.55
0.39
0.14
2.89*
.30
.02
.90
.07
Same mode
78
50
84,428
247
0.16
0.41
0.15
0.05
2.87*
.08
.04
.64
.33
Notes. k1 = number of effect sizes; k2 = number of samples; N = total sample size; Mdn = median of studies’ sample size; g = observed mean difference; Δ = weighted standardized mean difference; SEΔ = standard error of Δ; z = Δ/SEΔ; QM = test statistic for the omnibus test of coefficients (df = number of moderator categories 1); σ2(2) = between-cluster variance; σ2(3) = within-cluster variance; I2(2) = proportion of between-cluster heterogeneity; I2(3) = proportion of within-cluster heterogeneity. *p < .05.
master or doctoral thesis, conference presentations, or unpublished manuscripts), year of publication, mean age, percentage of female participants, sample type (i.e., children or adolescents up to 11th grade, college or university students, or mixed/ adult samples), the assessment context (i.e., academic research, educational, or I/O context), and research design (i.e., within- or between-subject). We extracted the sample sizes, means, and standard deviations of the ability scores in the unproctored and proctored setting as well as the correlation coefficients between test scores, and any other information that could be used to calculate an effect size (e.g., t-values). Moreover, we recorded whether test-takers expected consequences of the test results (such as a hiring decision or grading). If test performance yielded important consequences for the test-taker, the assessment was coded as high-stakes. To examine the usefulness of countermeasures against cheating, we coded different procedures (i.e., honesty contracts, honor codes, announcement of verification tests, instructions, or a combination of them). We also rated the proneness of the measure for cheating, that is, whether the searchability was high (e.g., for knowledge tests) or low (e.g., for figural matrices tests). Finally, we noted whether identical presentation modes (i.e., computerized or paper-and-pencil) were used in both assessment conditions. All studies were coded twice by three independent raters. To evaluate the coding process, Cohen’s (1960) κ was calculated. Intercoder agreement is considered strong for values exceeding .70 and excellent for values greater than .90 (LeBreton & Senter, 2008). The pairwise intercoder reliability ranged from .70 to .92. All discrepancies were discussed until consensus was reached. Ó 2018 Hogrefe Publishing
Statistical Analyses Calculation of Effect Sizes As mean differences between scores assessed in proctored and unproctored settings were the primary topic of interest, the standardized mean difference Hedges’ (1981) g was calculated with positive effect sizes indicating higher scores in the unproctored condition. For studies not reporting information necessary to calculate g, we applied transformation formulas to derive g from t values (Morris & DeShon, 2002). Studies that only reported multiple regression weights were excluded from the analysis (Aloe, 2015). For a subsample of studies reporting within-group comparisons, we additionally pooled Pearson correlations between the two test contexts to investigate the effects of mode differences on the rank ordering of test-takers. Extreme effect sizes were identified using internally studentized residuals (Viechtbauer & Cheung, 2010). Two extreme effect sizes with standardized residuals larger than 3 (Tukey, 1977) were removed from the analyses. Meta-Analytic Model Effect sizes were pooled using a random-effects model with a restricted maximum likelihood estimator (Viechtbauer, 2005). To account for dependent effect sizes (e.g., if a study reported more than one effect size for a given sample), we conducted a three-level meta-analysis (Cheung, 2014), in which individual effect sizes are nested within samples: Level 1 refers to the individual effect sizes, Level 2 refers to the effect sizes obtained using different instruments within a sample (with random Level 2 variance indicating the heterogeneity of effects due to the use of different tests European Journal of Psychological Assessment (2020), 36(1), 174–184
178
of cognitive abilities), and Level 3 refers to the different samples (with the random Level 3 variance indicating the heterogeneity of effect sizes across samples after controlling for the different instruments at Level 2). To account for sampling error, we used different weighting procedures for the analysis of standardized mean differences and the correlational analysis. For the analysis of standardized mean differences, each effect size was weighted by the inverse of its variance, which is superior to other weighting procedures and results in more precise estimates of the mean effect (Marín-Martínez & Sánchez-Meca, 2010). Correlations were weighted using sample size weights which is the most accurate procedure (Brannick, Yang, & Cafri, 2011). Heterogeneity in the observed effect sizes was quantified by the I2 statistics (Higgins & Thompson, 2002), which describes the proportion of total variation in study estimates that is due to heterogeneity. Although I2 does not measure heterogeneity on an absolute scale, higher values reflect more inconsistent results (Higgins, Thompson, Deeks, & Altman, 2003). We examined moderating effects on the pooled effect size using mixed-effect regression analyses of the R package metafor version 1.9-9 (Viechtbauer, 2010).
Results The meta-analysis of mean differences was based on 49 studies1 that were published between 2001 and 2017, mainly in peer-reviewed journals (67%). Unpublished work comprised master and doctoral theses (11%), conference proceedings (19%), and unpublished reports (3%). The meta-analytic database included 65 independent samples providing 109 effect sizes, with each sample reporting between 1 and 7 effect sizes. Overall, the meta-analysis covered scores from 100,434 participants (range of samples’ ns: 19–24,750). Most studies were conducted in an educational (43%) or research context (41%); fewer studies reported on I/O contexts (16%). Low-stakes settings were reported more often than high-stakes settings (62% vs. 38%). In 29% of the samples, countermeasures against cheating were implemented. Approximately half of the reported effect sizes (48%) were based on highly searchable tasks. In all cases that reported cross-mode comparisons (29%) the proctored assessment was paper-and-pencil, whereas the unproctored assessment was computerized. The subsample reporting rank order stabilities comprised 5 studies published in peer-reviewed journals between 2005 and 2009. The studies included 7 independent samples providing 15 correlations. The total sample size was 1
D. Steger et al., Meta-Analysis of Unproctored Ability Assessments
1,280 (range of the samples’ ns: 29–856). The subsample covered articles from all settings described above, with three studies being conducted in a research context and one each in educational and I/O context.
Mean Score Differences Between Proctored and Unproctored Assessments The pooled mean difference between proctored and unproctored settings was Δ = 0.20 (SE = 0.05), 95% CI [0.10, 0.31]; thus, on average, test-takers achieved slightly higher scores in unproctored settings (Table 1). The between-cluster heterogeneity was I2 = .80 and the within-cluster heterogeneity was I2 = .17, indicating pronounced variability between samples, but negligible differences within samples. Furthermore, between-cluster variance – an absolute indicator of variability – was σ2(2) = .14, also indicating large heterogeneity according to common rules of thumb (Tett, Hundley, & Christiansen, 2017). To quantify the influence of a potential publication bias, we compared effect sizes from published sources (i.e., journal articles) to effect sizes from unpublished sources (i.e., theses, conference proceedings, and unpublished manuscripts). The respective mixed-effects regression analysis identified no significant difference between effect sizes extracted from both sources, γ = 0.09, SE = 0.11, p = .43. Furthermore, funnel plot analyses (Figure S2, see https://osf.io/3kaf8/) and a rank correlation test (τ = .12, p = .07; Begg & Mazumdar, 1994), which tests the distribution of effect sizes for asymmetry, revealed no evidence of a potential publication bias. Although the funnel plot illustrated pronounced heterogeneity of the effect sizes, this most likely reflects the effects of moderators on score differences in proctored and unproctored settings. To quantify the influence of moderators on the pooled effect, a mixed-effects regression analysis was conducted to examine the effects of test setting, countermeasures, searchability, and test media. The correlations among the moderators varied between rϕ = .18 and rϕ = .44 (Table 2), indicating negligible multicollinearity. Together, the four moderators explained about 18% of the random variance (Table 2). Searchability was the only significant moderator (γ = 0.26, SE = 0.09, p < .01); mean score differences between proctored and unproctored settings were significantly larger for tasks that could be easily solved using the Internet (Δ = 0.38, SE = 0.08, p < .001) as compared to measures for which correct solutions were difficult to identify using ordinary web searches (Δ = 0.02, SE = 0.05, p = .66). Moderator analyses yielded the same results when each moderator was examined individually
One study only reported correlation coefficients and was, therefore, only included in the meta-analysis of rank-order stability.
European Journal of Psychological Assessment (2020), 36(1), 174–184
Ó 2018 Hogrefe Publishing
D. Steger et al., Meta-Analysis of Unproctored Ability Assessments
179
Table 2. Moderator analyses including all four moderator variables simultaneously Moderator analysis γ
SEγ
Correlations z
(1)
(2)
Intercept
0.04
0.09
0.49
(1) Stakes (1 = high; 0 = low)
0.08
0.11
0.69
(2) Countermeasures (1 = yes, 0 = no)
0.12
0.11
1.05
.43
(3) Searchability (1 = high, 0 = low)
0.26
0.09
2.87*
.44
.24
(4) Modality (1 = cross mode, 0 = same mode)
0.14
0.10
1.50
.17
.18
QM σ2(2)/σ2(3) k1/k2
(3)
.10
17.62* 0.11/0.03 109/65
Note. Phi coefficients of correlations for dichotomous variables are displayed (n = 109 effect sizes). γ = fixed effects regression weight; SEγ = standard error of γ; QM = test statistic for the omnibus test of coefficients (df = number of categories of the moderator 1); σ2(2) = between-cluster variance; σ2(3) = withincluster variance; k1 = Number of effect sizes; k2 = Number of samples. *p < .05.
(Table 1). No significant effects were found for the other moderator variables, suggesting that the score differences between proctored and unproctored assessments are not affected by anticipated consequences of test results, the implementation of countermeasures against cheating, or a change of test media.
Unproctored, web-based assessments are typically faced with highly unstandardized settings that allow limited control over the participants’ test-taking behavior. A pressing issue in this regard pertains to the question whether test scores from unproctored assessments can be readily compared to test scores from proctored lab sessions. Although a growing number of studies addressed score differences between proctored and unproctored settings, they reported rather inconclusive results (see also Do, 2009). Therefore, the current meta-analysis provided a comprehensive
overview of the existing findings and studied various moderators of potential cross-mode differences. Overall, the meta-analysis revealed significantly higher scores on cognitive tests in unproctored settings as compared to proctored test contexts. However, with a standardized mean difference of Δ = 0.20, the respective effect was rather small. Because the comparison of mean scores does not warrant conclusions about the equivalence of two measurements (AERA, APA, & NCME, 2014; Schroeders, 2009), we also analyzed correlations between scores of proctored and unproctored ability assessments for a subset of studies. This analysis showed a relationship of ρ = .58, indicating changes in the rank order of participants. These results suggest that participants’ relative standing within a group does not solely depend on their ability but also on other factors such as their motivation or their ability to cheat. However, since only five studies were included in the analyses, this result should be interpreted with caution. In general, the effect sizes exhibited a large heterogeneity between samples. Therefore, we examined the influence of moderators on the observed score differences between proctored and unproctored ability assessments. Using a meta-regression approach, we found significant effects for the searchability of a task. If correct solutions were not easily identifiable over the Internet, mean score differences were approximately zero. This finding corroborates previous research suggesting that some tasks are more prone to cheating than others (Diedenhofen & Musch, 2017; Karim, Kaminsky, & Behrend, 2014). For instance, Bloemers and colleagues (2016) investigated cheating strategies for various subtests of a web-based cognitive ability test battery. They demonstrated that cheating was most effective for subtests that could be tampered through Internet searches, while cheating did not affect tasks that required complex reasoning. Interestingly, moderator analyses found no significant effect for score differences between proctored and unproctored settings for high- and low-stakes
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 174–184
Rank Order Stability Between Proctored and Unproctored Assessments We identified a pooled correlation of ρ = .58 (SE = .10), 95% CI [0.38, 0.78] (Figure 1). This result suggested a moderate relationship between test scores obtained in proctored and unproctored assessments, indicating substantial rank order changes for the different testing conditions. The betweencluster heterogeneity was I2 = .80, and the within-cluster heterogeneity was I2 = .12, indicating a large variability of the pooled effect sizes between samples. As the metaanalysis of correlation coefficients was based on a small number of effects, we did not pursue further moderator analyses.
Discussion
180
D. Steger et al., Meta-Analysis of Unproctored Ability Assessments
Figure 1. Forest plot of the results of the random-effects model for the analysis of correlation coefficients.
testing. This finding does not support the prevailing assumption that cheating only corrupts high-stakes settings (Arthur, Glaze, Villado, & Taylor, 2010; Do, 2009), whereas it can be ignored in low-stakes testing. Furthermore, moderator analyses showed no significant effect for the implementation of countermeasures against cheating. Despite the vast body of research that advocates the implementation of countermeasures to improve data quality in unproctored assessments (Bartram, 2009; Bryan, Adams, & Monin, 2013; Dwight & Donovan, 2003; O’Neill & Pfeiffer, 2012), we found no empirical evidence for their effectiveness. Conversely, on a descriptive level, mean score differences appeared to be higher when countermeasures were implemented. Finally, differences in the test modes did not have a significant effect on the mean score differences. This finding is in line with previous results on the equivalence of paper-and-pencil and computerized ability tests (e.g., Mead & Drasgow, 1993; Schroeders & Wilhelm, 2010, 2011).
Recommendations for Unproctored and Proctored Assessments Unproctored, web-based or mobile assessments promise a low-cost opportunity to reach large, heterogeneous, and geographically scattered samples (Fahrenberg et al., 2007; Gosling et al., 2010) and, thus, increasingly complement or even replace traditional data collection techniques. However, our results demonstrate considerable differences in the mean and variance-covariance structure between proctored and unproctored assessments. Based on our findings, European Journal of Psychological Assessment (2020), 36(1), 174–184
some words of caution are warranted if results obtained in one specific setting are to be generalized to the other. We also recommend against relying on countermeasures to overcome effects of cheating. What makes matters worse is that the present data does not support the assumption that cheating is limited to high-stakes testing and can be ignored in low-stakes settings, including research contexts. Taking a pessimistic view, one might conclude that some participants will always cheat if they have the opportunity, regardless of countermeasures or anticipated consequences. On a more positive stance, participants will not cheat if they are not given the opportunity. Accordingly, a straightforward recommendation for ability assessments in unproctored settings is the development of test batteries that are limited to measures with a low searchability. In any case, administrators of unproctored assessments are encouraged to adopt post hoc strategies to identify potential cheaters, for example, using incidental data (Couper, 2005) such as reaction times or non-reactive behavioral data (Diedenhofen & Musch, 2017), seriousness checks (Aust, Diedenhofen, Ullrich, & Musch, 2012), or data-driven anomaly detection (Karabatsos, 2003). However, these analytical methods are no panacea, since identifying and excluding cheaters results in selective and most likely biased samples.
Limitations and Implications for Future Research Some limitations to the present meta-analysis must be noted. First, most research on the comparability of ability Ó 2018 Hogrefe Publishing
D. Steger et al., Meta-Analysis of Unproctored Ability Assessments
scores in proctored and unproctored assessments focused on mean score differences, which do not allow drawing inferences about the equivalence of a measure. Measurement invariance is best studied with a latent variable approach (Raju, Laffitte, & Byrne, 2002; Schroeders & Wilhelm, 2011). We analyzed correlation coefficients as a proxy indicator for the equivalence of proctored and unproctored settings (Mead & Drasgow, 1993). Despite an extensive literature search, we only identified five studies that reported correlation across conditions. Therefore, we stress that the analysis is tentative and results must be interpreted with caution. Also, the correlations analyzed in the present meta-analysis were highly heterogeneous, ranging from r = .27 to r = .92, leaving open the question of potential moderator variables. Future research should also focus on the covariance structure by meta-analyzing raw data (Kaufmann, Reips, & Merki, 2016). Second, the present research makes no inference about the extent of cheating in unproctored settings. Against the background of the data available for the study, we were able to ascertain that ability scores, on average, are higher in unproctored settings. Although dishonest behavior is one of the major concerns in unproctored settings (Tippins, 2009), the increased test scores might also be the result of reduced test anxiety, since participants might feel more comfortable if they are able to freely choose their testing environment (Stowell & Bennett, 2010). Further research might also address cheating directly by investigating appropriate means for the detection of dishonest behavior in ability tests. These measures include traditional approaches, such as scales measuring personality traits or integrity (McFarland & Ryan, 2000), or over-claiming (Bing, Kluemper, Kristl Davison, Taylor, & Novicevic, 2011), as well as data-driven approaches (Couper, 2005; Diedenhofen & Musch, 2017). Finally, our data does not allow conclusions about groups of people that are more likely to cheat than others. We assume that individual differences in personality, moral beliefs, and social norms are predictive of cheating behavior. For example, some studies suggested culturedependent differences in cheating behavior (Chapman & Lupton, 2004; McCabe, Feghali, & Abdallah, 2008). Future research might focus on test-takers who show large differences between an unproctored and a proctored assessment. For applied contexts, this might exert valuable diagnostic information (e.g., faking ability, Geiger, Sauter, Olderbak, & Wilhelm, 2016).
181
were taken. However, mean score differences highly depended on the administered measure itself and its proneness to cheating. Mean differences were more pronounced for tasks that are easy to look up on the Internet, while no mean differences were found for other tasks. These findings, however, do not imply that unproctored ability assessments are not feasible per se. Based on the present meta-analysis, we recommend to carefully evaluate task characteristics when developing or choosing test instruments for an unproctored test battery. For example, the measurement of declarative knowledge seems better conducted in a proctored setting, whereas figural reasoning tasks might be comparably administered in unproctored contexts. We also caution researchers to generalize statements across test conditions and encourage test users to further examine the equivalence of proctored and unproctored ability tests with appropriate statistical methods.
Acknowledgments This work was supported by the Bamberg Graduate School of Social Sciences which is funded by the German Research Foundation (DFG) under the German Excellence Initiative (GSC1024).
References
The presented meta-analysis identified higher mean scores for unproctored ability assessments, independent of the test setting (high- vs. low-stakes) and whether countermeasures
AERA, APA, & NCME. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Allen, E., & Seaman, J. (2014). Grade change: Tracking online education in the United States. Newburyport, MA: Babson Survey Research Group. Aloe, A. M. (2015). Inaccuracy of regression results in replacing bivariate correlations: Inaccuracy of Regression Results. Research Synthesis Methods, 6, 21–27. https://doi.org/ 10.1002/jrsm.1126 Arthur, W., Glaze, R. M., Villado, A. J., & Taylor, J. E. (2010). The magnitude and extent of cheating and response distortion effects on unproctored Internet-based tests of cognitive ability and personality. International Journal of Selection and Assessment, 18, 1–16. https://doi.org/10.1111/j.14682389.2010.00476.x Aust, F., Diedenhofen, B., Ullrich, S., & Musch, J. (2012). Seriousness checks are useful to improve data validity in online research. Behavior Research Methods, 45, 527–535. https:// doi.org/10.3758/s13428-012-0265-2 Bartram, D. (2006). Testing on the Internet: Issues, challenges and opportunities in the field of occupational assessment. In D. Bartram & R. K. Hambleton (Eds.), Computer-based testing and the Internet: Issues and advances (pp. 13–37). Hoboken, NJ: Wiley. Bartram, D. (2009). The International Test Commission guidelines on computer-based and Internet-delivered testing. Industrial and Organizational Psychology, 2, 11–13. https://doi.org/ 10.1111/j.1754-9434.2008.01098.x Begg, C. B., & Mazumdar, M. (1994). Operating characteristics of a rank correlation test for publication bias. Biometrics, 50, 1088– 1101. https://doi.org/10.2307/2533446
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 174–184
Conclusion
182
D. Steger et al., Meta-Analysis of Unproctored Ability Assessments
Bing, M. N., Kluemper, D., Kristl Davison, H., Taylor, S., & Novicevic, M. (2011). Overclaiming as a measure of faking. Organizational Behavior and Human Decision Processes, 116, 148–162. https://doi.org/10.1016/j.obhdp.2011.05.006 Bloemers, W., Oud, A., & van Dam, K. (2016). Cheating on unproctored Internet intelligence tests: Strategies and effects. Personnel Assessment and Decisions, 2, 21–29. https://doi.org/ 10.25035/pad.2016.003 Brallier, S., & Palm, L. (2015). Proctored and unproctored test performance. International Journal of Teaching and Learning in Higher Education, 27, 221–226. Brannick, M. T., Yang, L.-Q., & Cafri, G. (2011). Comparison of weights for meta-analysis of r and d under realistic conditions. Organizational Research Methods, 14, 587–607. https://doi.org/ 10.1177/1094428110368725 Bryan, C. J., Adams, G. S., & Monin, B. (2013). When cheating would make you a cheater: Implicating the self prevents unethical behavior. Journal of Experimental Psychology: General, 142, 1001–1005. https://doi.org/10.1037/a0030655 Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6, 3–5. https:// doi.org/10.1177/1745691610393980 Carstairs, J., & Myors, B. (2009). Internet testing: A natural experiment reveals test score inflation on a high-stakes, unproctored cognitive test. Computers in Human Behavior, 25, 738–742. https://doi.org/10.1016/j.chb.2009.01.011 Center for Open Science. (2017, September 29). Retrieved from https://cos.io/ Chapman, K. J., & Lupton, R. A. (2004). Academic dishonesty in a global educational market: A comparison of Hong Kong and American university business students. International Journal of Educational Management, 18, 425–435. https://doi.org/ 10.1108/09513540410563130 Cheung, M. W.-L. (2014). Modeling dependent effect sizes with three-level meta-analyses: A structural equation modeling approach. Psychological Methods, 19, 211–229. https://doi. org/10.1037/a0032968 Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. https://doi.org/10.1177/001316446002000104 Couper, M. P. (2005). Technology trends in survey data collection. Social Science Computer Review, 23, 486–501. https://doi.org/ 10.1177/0894439305278972 Coyne, I., Warszta, T., Beadle, S., & Sheehan, N. (2005). The impact of mode of administration on the equivalence of a test battery: A quasi-experimental design. International Journal of Selection and Assessment, 13, 220–224. https://doi.org/10.1111/j.14682389.2005.00318.x Diedenhofen, B., & Musch, J. (2017). PageFocus: Using paradata to detect and prevent cheating on online achievement tests. Behavior Research Methods, 49, 1444–1459. https://doi.org/ 10.3758/s13428-016-0800-7 Do, B.-R. (2009). Research on unproctored Internet testing. Industrial and Organizational Psychology, 2, 49–51. https:// doi.org/10.1111/j.1754-9434.2008.01107.x Dwight, S. A., & Donovan, J. J. (2003). Do warnings not to fake reduce faking? Human Performance, 16, 1–23. https://doi.org/ 10.1207/S15327043HUP1601_1 Evans, J. R., & Mathur, A. (2005). The value of online surveys. Internet Research, 15, 195–219. https://doi.org/10.1108/ 10662240510590360 Fahrenberg, J., Myrtek, M., Pawlik, K., & Perrez, M. (2007). Ambulatory assessment – monitoring behavior in daily life settings. European Journal of Psychological Assessment, 23, 206–213. https://doi.org/10.1027/1015-5759.23.4.206
Flesch, M., & Ostler, E. (2010). Analysis of proctored versus nonproctored tests in online algebra courses. MathAMATYC Educator, 2, 8–14. Geiger, M., Sauter, R., Olderbak, S., & Wilhelm, O. (2016). Faking ability: Measurement and validity. Personality and Individual Differences, 101, 480. https://doi.org/10.1016/ j.paid.2016.05.147 Germine, L., Nakayama, K., Duchaine, B. C., Chabris, C. F., Chatterjee, G., & Wilmer, J. B. (2012). Is the web as good as the lab? Comparable performance from web and lab in cognitive/ perceptual experiments. Psychonomic Bulletin & Review, 19, 847–857. https://doi.org/10.3758/s13423-012-0296-9 Gibby, R. E., Ispas, D., Mccloy, R. A., & Biga, A. (2009). Moving beyond the challenges to make unproctored Internet testing a reality. Industrial and Organizational Psychology, 2, 64–68. https://doi.org/10.1111/j.1754-9434.2008.01110.x Gnambs, T., & Kaspar, K. (2017). Socially desirable responding in web-based questionnaires: A meta-analytic review of the candor hypothesis. Assessment, 24, 746–762. https://doi.org/ 10.1177/1073191115624547 Gosling, S. D., & Mason, W. (2015). Internet research in psychology. Annual Review of Psychology, 66, 877–902. https://doi.org/ 10.1146/annurev-psych-010814-015321 Gosling, S. D., Sandy, C. J., John, O. P., & Potter, J. (2010). Wired but not WEIRD: The promise of the Internet in reaching more diverse samples. Behavioral and Brain Sciences, 33, 94–95. https://doi.org/10.1017/S0140525X10000300 Green, B. F. (1991). Guidelines for computer testing. In T. B. Gutkin & S. L. Wise (Eds.), The computer and the decision-making process (pp. 245–273). Hillsdale, NJ: Erlbaum. Guo, J., & Drasgow, F. (2010). Identifying cheating on unproctored Internet tests: The Z-test and the likelihood ratio test. International Journal of Selection and Assessment, 18, 351–364. https://doi.org/10.1111/j.1468-2389.2010.00518.x Harari, G. M., Lane, N. D., Wang, R., Crosier, B. S., Campbell, A. T., & Gosling, S. D. (2016). Using smartphones to collect behavioral data in psychological science: Opportunities, practical considerations, and challenges. Perspectives on Psychological Science, 11, 838–854. https://doi.org/10.1177/ 1745691616650285 Haworth, C. M. A., Harlaar, N., Kovas, Y., Davis, O. S. P., & Oliver, B. R., Hayiou-Thomas, M. E. . . ., Plomin, R. (2007). Internet cognitive testing of large samples needed in genetic research. Twin Research and Human Genetics, 10, 554–563. https://doi. org/10.1375/twin.10.4.554 Hedges, L. V. (1981). Distribution theory for glass’s estimator of effect size and related estimators. Journal of Educational Statistics, 6, 107. https://doi.org/10.2307/1164588 Higgins, J. P. T., & Thompson, S. G. (2002). Quantifying heterogeneity in a meta-analysis. Statistics in Medicine, 21, 1539– 1558. https://doi.org/10.1002/sim.1186 Higgins, J. P. T., Thompson, S. G., Deeks, J. J., & Altman, D. G. (2003). Measuring inconsistency in meta-analyses. British Medical Journal, 327, 557–560. https://doi.org/10.1136/ bmj.327.7414.557 Hofer, P. J., & Green, B. F. (1985). The challenge of competence and creativity in computerized psychological testing. Journal of Consulting and Clinical Psychology, 53, 826–838. https://doi. org/10.1037/0022-006X.53.6.826 Ihme, J. M., Lemke, F., Lieder, K., Martin, F., Müller, J. C., & Schmidt, S. (2009). Comparison of ability tests administered online and in the laboratory. Behavior Research Methods, 41, 1183–1189. https://doi.org/10.3758/BRM.41.4.1183 International Test Commission. (2006). International guidelines on computer-based and Internet-delivered testing. International
European Journal of Psychological Assessment (2020), 36(1), 174–184
Ó 2018 Hogrefe Publishing
D. Steger et al., Meta-Analysis of Unproctored Ability Assessments
183
Journal of Testing, 6, 143–171. https://doi.org/10.1207/ s15327574ijt0602_4 Jensen, C., & Thomsen, J. P. F. (2014). Self-reported cheating in web surveys on political knowledge. Quality & Quantity, 48, 3343–3354. https://doi.org/10.1007/s11135-013-9960-z Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied Measurement in Education, 16, 277–298. https://doi.org/10.1207/ S15324818AME1604_2 Karim, M. N., Kaminsky, S. E., & Behrend, T. S. (2014). Cheating, reactions, and performance in remotely proctored testing: An exploratory experimental study. Journal of Business and Psychology, 29, 555–572. https://doi.org/10.1007/s10869-014-9343-z Kaufmann, E., Reips, U.-D., & Merki, K. M. (2016). Avoiding methodological biases in meta-analysis: Use of online versus offline individual participant data (IPD) in psychology. Zeitschrift für Psychologie, 224, 157–167. https://doi.org/10.1027/ 2151-2604/a000251 Khan, S. M., Suendermann-Oeft, D., Evanini, K., Williamson, D. M., Paris, S., Qian, Y., . . . Davis, L. (2017). MAP: Multimodal assessment platform for interactive communication competency. In S. Shehata & J. P.-L. Tan (Eds.), Practitioner Track Proceedings of the 7th International Learning Analytics & Knowledge Conference. Vancouver, CA: SoLAR. LeBreton, J. M., & Senter, J. L. (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11, 815–852. https://doi.org/ 10.1177/1094428106296642 Lievens, F., & Burke, E. (2011). Dealing with the threats inherent in unproctored Internet testing of cognitive ability: Results from a large-scale operational test program. Journal of Occupational and Organizational Psychology, 84, 817–824. https://doi.org/ 10.1348/096317910X522672 Lievens, F., & Harris, M. M. (2003). Research on Internet recruiting and testing: Current status and future directions. In C. L. Cooper & I. T. Robertson (Eds.), International Review of Industrial and Organizational Psychology (pp. 131–165). Chichester, UK: Wiley. Marín-Martínez, F., & Sánchez-Meca, J. (2010). Weighting by inverse variance or by sample size in random-effects metaanalysis. Educational and Psychological Measurement, 70, 56–73. https://doi.org/10.1177/0013164409344534 McCabe, D. L., Feghali, T., & Abdallah, H. (2008). Academic dishonesty in the Middle East: Individual and contextual factors. Research in Higher Education, 49, 451–467. https:// doi.org/10.1007/s11162-008-9092-9 McCabe, D. L., & Treviño, L. K. (2002). Honesty and honor codes. Academe, 88, 37. https://doi.org/10.2307/40252118 McFarland, L. A., & Ryan, A. M. (2000). Variance in faking across noncognitive measures. Journal of Applied Psychology, 85, 812– 821. https://doi.org/10.1037/0021-9010.85.5.812 Mead, A. D., & Drasgow, F. (1993). Equivalence of computerized and paper-and-pencil cognitive ability tests: A meta-analysis. Psychological Bulletin, 114, 449–458. https://doi.org/10.1037/ 0033-2909.114.3.449 Miller, G. (2012). The smartphone psychology manifesto. Perspectives on Psychological Science, 7, 221–237. https://doi.org/ 10.1177/1745691612441215 Morris, S. B., & DeShon, R. P. (2002). Combining effect size estimates in meta-analysis with repeated measures and independent-groups designs. Psychological Methods, 7, 105– 125. https://doi.org/10.1037/1082-989X.7.1.105 Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., . . . Yarkoni, T. (2015). Promoting an open research culture. Science, 348, 1420–1422. https://doi.org/ 10.1126/science.aab2374
Nye, C. D., Do, B.-R., Drasgow, F., & Fine, S. (2008). Two-step testing in employee selection: Is score inflation a problem? International Journal of Selection and Assessment, 16, 112– 120. https://doi.org/10.1080/09639284.2011.590012 O’Neill, H. M., & Pfeiffer, C. A. (2012). The impact of honour codes and perceptions of cheating on academic cheating behaviours, especially for MBA bound undergraduates. Accounting Education, 21, 231–245. https://doi.org/10.1080/09639284. 2011.590012 Preckel, F., & Thiemann, H. (2003). Online- versus paper-pencilversion of a high potential intelligence test. Swiss Journal of Psychology, 62, 131–138. https://doi.org/10.1024/14210185.62.2.131 Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2002). Measurement equivalence: A comparison of methods based on confirmatory factor analysis and item response theory. Journal of Applied Psychology, 87, 517–529. https://doi.org/10.1037/00219010.87.3.517 Reynolds, D. H., Wasko, L. E., Sinar, E. F., Raymark, P. H., & Jones, J. A. (2009). UIT or not UIT? That is not the only question. Industrial and Organizational Psychology, 2, 52–57. https://doi. org/10.1111/j.1754-9434.2008.01108.x Rovai, A. P. (2000). Online and traditional assessments: What is the difference? The Internet and Higher Education, 3, 141–151. https://doi.org/10.1016/S1096-7516(01)00028-8 Schroeders, U. (2009). Testing for equivalence of test data across media. In F. Scheuermann & J. Björnsson (Eds.), The transition to computer-based assessment. Lesson learned from the PISA 2006 computer-based assessment of science (CBAS) and implications for large scale testing (pp. 164–170). JRC Scientific and Technical Report EUR 23679 EN. Luxembourg: Publications Office of the European Union. Schroeders, U., & Wilhelm, O. (2010). Testing reasoning ability with handheld computers, notebooks, and paper and pencil. European Journal of Psychological Assessment, 26, 284–292. https://doi.org/10.1027/1015-5759/a000038 Schroeders, U., & Wilhelm, O. (2011). Equivalence of reading and listening comprehension across test media. Educational and Psychological Measurement, 71, 849–869. https://doi.org/ 10.1177/0013164410391468 Stowell, J. R., & Bennett, D. (2010). Effects of online testing on student exam performance and test anxiety. Journal of Educational Computing Research, 42, 161–171. https://doi.org/ 10.2190/EC.42.2.b Templer, K. J., & Lange, S. R. (2008). Internet testing: Equivalence between proctored lab and unproctored field conditions. Computers in Human Behavior, 24, 1216–1228. https://doi. org/10.1016/j.chb.2007.04.006 Tendeiro, J. N., Meijer, R. R., Schakel, L., & Maij-de Meij, A. M. (2013). Using cumulative sum statistics to detect inconsistencies in unproctored Internet testing. Educational and Psychological Measurement, 73, 143–161. https://doi.org/10.1177/ 0013164412444787 Tett, R. P., Hundley, N. A., & Christiansen, N. D. (2017). Metaanalysis and the myth of generalizability. Industrial and Organizational Psychology, 10(03), 421–456. https://doi.org/ 10.1017/iop.2017.26 Tippins, N. T. (2009). Internet alternatives to traditional proctored testing: Where are we now? Industrial and Organizational Psychology, 2, 2–10. https://doi.org/10.1111/j.17549434.2008.01097.x Tippins, N. T. (2011). Overview of technology-enhanced assessments. In N. T. Tippins & S. Adler (Eds.), Technology-enhanced assessment of talent (pp. 1–19). San Francisco, CA: Wiley. Tippins, N. T., Beaty, J., Drasgow, F., Gibson, W. M., Pearlman, K., Segall, D. O., & Shepherd, W. (2006). Unproctored Internet
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 174–184
184
testing in employment settings. Personnel Psychology, 59, 189–225. https://doi.org/10.1111/j.1744-6570.2006.00909.x Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley. Viechtbauer, W. (2005). Bias and efficiency of meta-analytic variance estimators in the random-effects model,. (2005). Journal of Educational and Behavioral Statistics, 30, 261–293. https://doi.org/10.3102/10769986030003261 Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36, 1–48. https://doi.org/10.18637/jss.v036.i03 Viechtbauer, W., & Cheung, M. W.-L. (2010). Outlier and influence diagnostics for meta-analysis. Research Synthesis Methods, 1, 112–125. https://doi.org/10.1002/jrsm.11 Wilhelm, O., & McKnight, P. E. (2002). Ability and achievement testing on the World Wide Web. In B. Batinic, U.-D. Reips, & M. Bosnjak (Eds.), Online social sciences (pp. 167–193). Seattle, WA: Hogrefe & Huber. Williamson, K. C., Williamson, V. M., & Hinze, S. R. (2016). Administering spatial and cognitive instruments in-class and
European Journal of Psychological Assessment (2020), 36(1), 174–184
D. Steger et al., Meta-Analysis of Unproctored Ability Assessments
on-line: Are these equivalent? Journal of Science Education and Technology, 26, 12–23. https://doi.org/10.1007/s10956-0169645-1 Received September 29, 2017 Revision received April 4, 2018 Accepted April 4, 2018 Published online September 18, 2018 EJPA Section/Category Methodological Topics in Assessment Diana Steger Bamberg Graduate School of Social Sciences University of Bamberg Feldkirchenstraße 21 96052 Bamberg Germany diana.steger90@gmail.com
Ó 2018 Hogrefe Publishing
Multistudy Reports
Evaluating the Psychometric Properties of the Short Dark Triad (SD3) in Italian Adults and Adolescents Antonella Somma1 , Delroy L. Paulhus2, Serena Borroni1, and Andrea Fossati1 1
Faculty of Psychology, Vita-Salute San Raffaele University, Milan, Italy
2
Faculty of Psychology, University of British Columbia, Vancouver, Canada
Abstract: The term Dark Triad refers to three socially aversive personality dimensions (i.e., Machiavellianism, narcissism, and psychopathy) that are evident in the normal range of personality. Jones and Paulhus (2014) developed the Short Dark Triad (SD3) as a 27-item measure of the three constructs. To assess the psychometric properties of the Italian translation, 678 adult university students and 442 adolescent high school students were sampled. Cronbach’s α values for the subscales were acceptable in both samples. Subscale intercorrelations ranged from .29 to .55 in adults and .29 to .53 in adolescents. Although subscale means were higher in the adolescent sample, the two item correlation matrices did not differ significantly. A confirmatory factor analysis using multidimensional full-information item response theory showed that a three-correlated-factor model provided the best fit in both adults and adolescents. When controlled for overlap, SD3 subscales showed adequate convergent and discriminant validity coefficients in both samples. The current research contributes to the literature on dark personalities in two ways: (a) It provides detailed psychometric support for the Italian translation of SD3 and (b) it directly compares SD3 performance in younger and older students. Keywords: Short Dark Triad, adults, adolescents, item response theory
Since its introduction by Paulhus and Williams (2002), the multidimensional construct labeled the Dark Triad has gained widespread popularity in social and personality psychology research. The term refers to three socially aversive personality dimensions that are evident in the normal range of personality (for a review, see Furnham, Richards, & Paulhus, 2013). Those three personality dimensions are (a) narcissism, a trait characterized by manipulativeness, callousness, lack of empathy, impulsivity, and risk-taking; (b) psychopathy, a trait characterized by callous-unemotional traits, deceitfulness, impulsivity, and risk-taking; and (c) Machiavellianism, a trait characterized by strategic exploitation. Paulhus and Williams (2002) argued that, because they are intercorrelated, the three traits should be studied together. Otherwise, any apparent link with an outcome could be misattributed to the wrong predictor. Although sound measures of narcissism (e.g., Narcissistic Personality Inventory, NPI; Raskin & Hall, 1979), psychopathy (e.g., Self-Report Psychopathy Scale; Paulhus, Neumann, & Hare, 2016), and Machiavellianism (e.g., Mach IV; Christie & Geis, 1970) already exist, the combined
length of the available measures (124 items) was off-putting for many researchers; indeed, even with the shortest versions of each construct, the total number of items (i.e., 51) may still be taxing when time and space are at a premium (Jones & Paulhus, 2014). Fortunately, shorter combination measures of all three constructs were soon developed. First to appear was the so-called Dirty Dozen questionnaire (DD; Jonason & Webster, 2010). Notwithstanding the adequate reliabilities reported for the three DD subscales, the DD showed some important limitations (Jones & Paulhus, 2014; Jonason et al., 2011; Rauthmann, 2013). For instance, the Machiavellianism subscale of the DD showed problematic discriminant validity with respect to psychopathy (e.g., Miller et al., 2012). To overcome these limitations, Jones and Paulhus (2014) created the Short Dark Triad (SD3) as a 27-item measure of the three constructs. Based on a review of the relevant literature (Jones & Paulhus, 2011), a large item pool was assembled in order to ensure coverage of the key aspects of each concept. Three theoretical principles motivated
Ó 2019 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 185–195 https://doi.org/10.1027/1015-5759/a000499
186
the item selection: (a) Ego-identity goals drive narcissistic behavior, whereas instrumental goals drive Machiavellian and psychopathic behavior, (b) Machiavellianism differs from psychopathy with respect to temporal focus, and (c) all three have a callous core that encourages interpersonal manipulation. Item refinement and structural analyses finally reduced the original item set down to the final set of 27 items. In a series of four studies that included a total of 1,063 community-dwelling adults and college students, Jones and Paulhus (2014) reported adequate reliabilities (i.e., all Cronbach’s α values > .70) for the subscales. Both exploratory factor analysis and exploratory structural equation modeling revealed that three correlated latent factors could explain the observed correlations among the 27 SD3 items. In contrast to the DD, the SD3 subscales seemed to show adequate convergent and discriminant validity. A final validity study showed that SD3 self-reports converge with corresponding informant reports on all three subscales (Jones & Paulhus, 2014). Support for the SD3 has accumulated from a variety of other researchers (e.g., Book, Visser, & Volk, 2015; Dowgwillo & Pincus, 2017; Goncalves & Campbell, 2014; Schneider, McLarnon, & Carswell, 2017; Veselka, Giammarco, & Vernon, 2014). Although translated into several other languages, the psychometric properties of the SD3 have not yet been evaluated in the Italian context. Moreover, the Dark Triad has rarely been studied in adolescents – regardless of the language. One exception is the study by Lau and Marsee (2013), which showed unique associations of delinquency with each of the triad components. Kerig and Stellwagen (2010) showed that associations with “theory of mind” differed across the three components of the Dark Triad. Such research on younger samples can shed light on the different pathways leading to antisocial behavior, as well as offering preventive interventions (van Baardewijk, Vermeiren, Stegge, & Doreleijers, 2011). In the present study, we propose to fill those gaps in the literature.
A. Somma et al., Psychometric Properties of the SD3
Nonetheless, IRT measurement models hold much promise for the development, psychometric analysis, refinement, and scoring of dysfunctional personality measures (Reise & Rodriguez, 2016). Based on considerations raised by Reise, Morizot, and Hays (2007) and Reise (2012), we relied on confirmatory full-information maximum likelihood IRT factor analysis to estimate the latent structure of SD3 items. In particular, we relied on Samejima’s (1997) graded response model to fit the following three models: (a) a unidimensional model, (b) a three-factor model with orthogonal factors, in which SD3 items were forced to load on the corresponding factor they were a priori assigned to, and (c) a three-factor model with correlated factors, in which SD3 items were forced to load on the corresponding factor they were a priori assigned to. Moreover, the hypothesis of the measurement invariance across the adolescent and adult samples was formally tested in a multigroup full-information maximum likelihood IRT factor analysis framework.
Nomological Network In order to extend the construct validity of the SD3 subscales, we evaluated their concurrent associations with several established personality constructs: (a) the traditional measure of Machiavellianism (Mach IV), (b) a measure of grandiose narcissism from the Five-Factor Narcissism Inventory–Short Form (Sherman et al., 2015), and (c) a 3-factor measure of psychopathy, the Triarchic Psychopathy Measure (TriPM). The latter measure comprises boldness, meanness, and disinhibition (for a review, see Patrick & Drislane, 2015). Although never used in previous Dark Triad research, the latter two measures are sufficiently well established to use as sound criteria for assessing concurrent validity. Details of these measures may be found in the Methods section.
Some investigators have questioned the distinctiveness of the Dark Triad – even using the SD3. Jonason and Webster (2010) were among the first to contend that the three traits may best be viewed as facets of a single social orientation, namely short-term exploitation of others. In more recent publications, however, Jonason and colleagues have honored the distinctiveness of the constructs (Jonason, Slomski, & Partyka, 2012). Part of the concern about SD3 factor structure is shared with recent cautions regarding the application of latent variable modeling techniques to such domains as personality, psychopathology, and health (Reise & Waller, 2009).
Predictions Based on findings with the English version (Jones & Paulhus, 2014), the Italian SD3 Machiavellianism scale is expected to show a strong correlation with Mach IV; the SD3 narcissism scale is expected to show a strong positive correlation with FFNI-SF grandiose narcissism; and SD3 psychopathy is expected to show a strong positive correlation with the meanness and disinhibition facets of the TriPM. Although no previous study has directly compared adolescent and adult samples, several studies have shown that, within samples, Dark Triad scores diminish with age (e.g., Jonason et al., 2012; Vernon, Villani, Vickers, & Harris, 2008). Hence, we predict that the adolescent sample will score higher than the university sample on all three triad measures.
European Journal of Psychological Assessment (2020), 36(1), 185–195
Ó 2019 Hogrefe Publishing
Factor Structure
A. Somma et al., Psychometric Properties of the SD3
Materials and Methods Participants Sample 1 This sample was composed of 678 adult university students attending courses at a large State University in the Center of Italy. Participants responded to advertisements requesting potential volunteers for psychological studies taking place on campus and on the university Web. Three hundred forty-seven (51.2%) participants were male, and 331 (48.8%) participants were female; participant’s mean age was 24.1 years, SD = 2.64 years (range: 19–30 years). In terms of participants’ civil status, 670 (98.8%) were unmarried, and 8 (1.2%) were married. Sample 2 This sample was composed of 442 adolescents attending large public high schools in Central Italy; 226 adolescents (51.1%) were female, and 215 adolescents (48.6%) were male, whereas one adolescent (0.2%) did not report his/ her gender. Adolescents’ mean age was 16.06 years, SD = 1.47 years. Owing to space consideration, we relegated details of our missing data analysis to the Electronic Supplementary Material, ESM 1. All participants gave their written consent to participate in the study after it had been explained to them; all participants were volunteers and received no financial or academic incentive to take part in the study.
Measures Short Dark Triad (SD3; Jones & Paulhus, 2014) The SD3 is a 27-item, Likert-type, self-report measure that was specifically designed to assess three socially aversive personality traits, namely, narcissism (9 items), psychopathy (9 items), and Machiavellianism (9 items). Each item is measured on a 5-point ordinal scale, ranging from 1 (= strongly disagree) to 5 (= strongly agree). The SD3 yields separate scores for Machiavellianism, narcissism, and psychopathy scales. As noted earlier, its validity has been supported by structural analyses, concurrent correlations and informant ratings, and observable behavior (e.g., Jones & Paulhus, 2017).
187
Indifference, Lack of Empathy, Manipulativeness, Need for Admiration, Reactive Anger, Shame, and Thrill-Seeking. Previous studies (e.g., Sherman et al., 2015) showed that the FFNI-SF scales showed adequate internal consistency reliability estimates among undergraduates and clinical participants (i.e., all Cronbach’s α values > .70). Grandiose narcissism is the sum of Acclaim-Seeking, Arrogance, Authoritativeness, Entitlement, Exhibitionism, Exploitativeness, and Grandiose Fantasies. Vulnerable narcissism is the sum of the remaining scales. Consistent with the 148-item FFNI (Glover, Miller, Lynam, Crego, & Widiger, 2012), each FFNI-SF item is measured on a 5-point ordinal scale (1 = disagree strongly; 2 = disagree a little; 3 = neither agree nor disagree; 4 = agree a little; 5 = agree strongly). The Italian translation of the FFNI-SF showed good psychometric properties among university students (Fossati, Somma, Borroni, & Miller, 2018). Following Jones and Paulhus (2014), we relied on the items tapping into grandiose narcissism (FFNI-SF GN). Triarchic Psychopathy Measure (TriPM; Patrick & Drislane, 2015) The TriPM is a 58-item self-report inventory divided into subscales indexing the three phenotypic components: boldness, meanness, and disinhibition. Participants respond to each item on a 4-point Likert scale (3 = true, 2 = mostly true, 1 = mostly false, 0 = false). The boldness scale is composed of 19 items that index tendencies toward social poise and effectiveness, emotional resiliency, and venturesomeness. The disinhibition and meanness scales (20 and 19 items, respectively) index broad disinhibition and callousaggression factors, respectively. Published research provides support for the reliability (i.e., all Cronbach’s α values > .75) and validity of the TriPM as a measure of psychopathic features (e.g., Drislane, Patrick, & Arsal, 2014). Of particular relevance here is that the TriPM has been validated in Italian samples (Sica et al., 2015; Somma, Borroni, Drislane, & Fossati, 2016). Given concerns about whether the boldness factor is a genuine component of psychopathy (Evans & Tully, 2016; Neumann, Uzieblo, Crombez, & Hare, 2013), we removed boldness from the total score before using the TriPM as a criterion measure.
Five-Factor Narcissism Inventory – Short Form Grandiose Narcissism Scale (FFNI-SF GN; Sherman et al., 2015) The FFNI-SF is a 60-item, self-report measure of 15 traits related to grandiose and vulnerable narcissism: AcclaimSeeking, Arrogance, Authoritativeness, Distrust, Entitlement, Exhibitionism, Exploitativeness, Grandiose Fantasies,
Machiavellianism (Mach IV; Christie & Geis, 1970) The Mach IV is a 20-item, Likert-type, self-report measure that captures attitudes and behaviors associated with the Machiavellian personality construct. Each item is measured on a 5-point scale, ranging from 1 (= strongly disagree) to 5 (= strongly agree). The instrument includes items tapping into Machiavellian tactics, cynicism, and morality. The Mach IV has typically yielded adequate internal consistency estimates of reliability (i.e., Cronbach’s α
Ó 2019 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 185–195
188
values > .75), and validation evidence has accumulated to include several hundred observers and behavioral studies (see Jones & Paulhus, 2009). Owing to space consideration, we included a detailed description of the translation procedures and an extensive description of data analysis in ESM 1.
Results Sample 1 and Sample 2 did not differ significantly in terms of gender proportion, w2(1) = 0.63, p > .40, ϕ = .02. Understandably, the samples significantly differed on age using a separate-variance t(1,094.99) = 64.93, p < .001. Table 1 summarizes descriptive statistics, item-total correlations corrected for overlap (i.e., item convergent validity), and correlations of each item with the total score for the scale to which it was not assigned (i.e., item discriminant validity). On average, SD3 items showed moderate, positive, albeit significant skewness values, Mdn = 0.45 (item 4), p < .001, min. (item 18) = -0.68, p < .001, max (item 25) = 1.49, p < .001. Four items – namely, 2, 25, 26, and 27 – showed skewness values greater than 1.00. Median item kurtosis value was 0.61 (item 14), p < .005, min (item 23) = 1.08, p < .001; max (item 25) = 1.22.
Group Differences Table 2 summarizes descriptive statistics in the full sample and broken down by gender, Cronbach’s α coefficient values (as well as mean inter-item correlations), and scale intercorrelations for SD3 scales. Male participants and female participants did not add up to the total number of participants in Sample 2, because one Sample 2 participant refused to disclose his/her gender. A two-way MANOVA evidenced a significant omnibus difference between adult university students (i.e., Sample 1) and adolescent high school students (i.e., Sample 2), Pillai V = .16, p < .001, as well as between male and female participants, Pillai V = .07, p < .001, on SD3 scale average scores. There was no significant sample-by-gender interaction, Pillai V = .00, p > .80. As it can be observed in Table 2, male participants scored significantly higher than female participants in both samples, although the effect size estimates (i.e., Cohen’s d values) were in the small-to-moderate range (Cohen, 1988). According to the Bonferroni multiple t-tests (Bonferroni-corrected nominal p < .016), adolescent high school students scored significantly higher than adult university students on SD3 Machiavellianism, t(1,118) = 10.84, p < .001, d = .65, narcissism, 1
A. Somma et al., Psychometric Properties of the SD3
t(1,118) = 5.28, p < .001, d = .32, and psychopathy, t(1,118) = 12.86, p < .001, d = .77 scales. Box M test results suggested that the scale variancecovariance matrices (and hence the correlation matrices) did not differ by gender in either Sample 1, M = 11.86, F(6, 3,291,142.95) = 1.97, p > .05, or Sample 2, M = 4.71, F(6, 1,386,975.92) = 0.78, p > .50. Accordingly, we pooled males and females before calculating the correlation matrices in both Sample 1 and Sample 2. Finally, Box M test results suggested that the item variance–covariance matrix computed in adults was significantly different from that computed in the adolescents, M = 20.89, F(6, 5,976,192.23) = 3.47, p < .01. However, when scale scores were standardized within each sample (i.e., they resulted in two correlation matrices), the Box M test for the comparison between Sample 1 and Sample 2 on scale covariance (actually, correlation) matrix became nonsignificant, M = 11.16, F(6, 5,976,192.23) = 1.86, p > .05. Indeed, when we formally compared bivariate correlations among subscales that were computed in Sample 1 and in Sample 2, respectively, none of the Fisher z tests reached statistical significance – even with no Bonferroni correction of the nominal p-value (i.e., p < .05), min. Fisher z value = 0.35 (correlation between SD3 narcissism and psychopathy scales), max. Fisher z value = 1.94 (correlation between SD3 Machiavellianism and psychopathy scales), all ps > .05. In the present study, full-information maximum likelihood IRT factor analysis models and indices were estimated using the “mirt” R package (Chalmers, 2012), whereas rest-score functions were computed using the “mokken” R package (van der Ark, 2014). Goodness-of-fit indices of multidimensional confirmatory full-information IRT factor models of SD3 items are summarized in Table 3. Rest-score functions (and confidence bands) for all items in Sample 1 and Sample 2 are displayed in Figures 1 and 2 in the ESM 2, respectively. Item discrimination and threshold estimates based on multidimensional confirmatory fullinformation IRT three-correlated-factor model in Sample 1 and in Sample 2 are listed in Table 4. Finally, in order to test the invariance of SD3 IRT factor structure (and detect differential item functioning) across the adolescent and adult samples, we performed multigroup full-information maximum likelihood IRT factor analysis (Chalmers, 2012). First, we fitted a “configural” invariance model (i.e., a model postulating the invariance of the number of factors across groups). This model showed adequate values of fit indices, M2*(480) = 1647.19, RMSEA2 = .047, 90% CI RMSEA2 = 0.44, 0.49, BIC = 84,017.01. Then, we turned to a scalar invariance model to test for measurement invariance1, that is, a model with invariant factor
SD3 items’ factor loadings and thresholds were set as invariant in tandem given that both parameters affect the item characteristic curve simultaneously (e.g., Millsap & Yun-Tein, 2004).
European Journal of Psychological Assessment (2020), 36(1), 185–195
Ó 2019 Hogrefe Publishing
A. Somma et al., Psychometric Properties of the SD3
189
Table 1. SD3 item analyses in adult and adolescent samples: descriptive statistics and part-whole correlations Sample 1 (N = 678 adult university students) Short Dark Triad Scale Items
M
SD
Mach a
Narc
Sample 2 (N = 442 adolescent high school students)
Psych
M
SD
Mach a
1. It’s not wise...(Machiavellianism)
3.40
1.06
.26*
.04
.06
3.34
1.16
.17
2. I like to use clever...(Machiavellianism)
1.85
1.04
.54*a
.30a
.39a
2.33
1.23
.41a
a
a
a
.41
2.80 2.88
3. Whatever it takes...(Machiavellianism)
2.03
1.03
.59*
4. Avoid direct conflict...(Machiavellianism)
2.43
1.15
.52*a
.20a
.25a
1.12
.65*
a
.36
a
a
1.23
a
.27
a
5. It’s wise to keep track...(Machiavellianism) 6. You should wait...(Machiavellianism) 7. There are things...(Machiavellianism) 8. Make sure your plans...(Machiavellianism) 9. Most people. . .(Machiavellianism) 10. People see me. . .(Narcissism)
2.37 2.40 3.56 2.58 2.83 2.42
.56
.47*
a
.57*
a
1.21
.49*
a
0.99
a
1.06 1.11
.21
.34
.43
a
.51
a
.13
.25
.32
a
.36
a
a
.42
a
.40
.45*
a a
a
.24
2.94 3.14 3.74 3.22 3.02 2.44
Narc
Psych
.05
.05
.35a
.34a
a
1.19
.40*
a
.18
.25a
1.10
.21*a
.01
.05
1.17
.42*
a
.16
.30a
1.28
a
.19
.36a
.29*
a
.03
.17a
1.15
.38*
a
.14
.24a
1.20
a
1.01
1.10
.40
.33 .19
a
a
a
.27a
.27
a
.49*
.34a
11. I hate being the center. . .(R; Narcissism)
2.97
1.12
.04
.32*
.18
2.80
1.15
.02
.18
.15
12. Many group activities...(Narcissism)
2.12
0.90
.27a
.38a
.33a
2.67
1.07
.18a
.40a
.31a
13. I know that I am special. . .(Narcissism)
2.12
0.91
.22a
.45*a
.27a
2.55
1.13
.07
.39*a
.26a
1.09
a
14. I like to get acquainted...(Narcissism) 15. I feel embarrassed...(R; Narcissism) 16. I have been compared to... (Narcissism) 17. I am an average person (R; Narcissism) 18. I insist on getting...(Narcissism) 19. I like to get revenge...(Psychopathy)
2.93 2.77 2.38 2.54 3.46 2.55
1.15 1.17 1.14 1.04 1.18
.43 .05
.40
a
.27*
.28
a
.16
a
.35*
.22
a
a
.42
a
.37 .19 .43
a
20. I avoid dangerous situations (R; Psychopathy)
2.65
1.21
.05
.18
21. Payback needs to be quick...(Psychopathy)
2.01
1.03
.28a
.22a
22. People often say. . .(Psychopathy)
2.09
1.11
.35a
.17a
23. It’s true that I can be mean...(Psychopathy)
2.80
1.23
.48a
1.15
a
24. People who mess with me. . .(Psychopathy)
2.46
.43
a
.31 a
a
a
a
a
.15
a
.38 a
a
.23
a
.25
.49
a
.14
a
2.38 3.80 2.88
1.25 1.29 1.02 1.19
a
.03 .17
.28a
.26
.18a
.24 a
.13 .21
a
.33
a
a
.22a
a
.37*
.19a
.09
.17a
.36*
a
.43a
a
.27a
.34
.12
.38*a
2.64
1.27
.26a
.19a
.35a
.48*a
2.79
1.26
.16
.08
.30*a
.21a
.43a
3.13
1.19
.29a
.20a
.33a
.40
a
.49
a
2.94
1.15
.29
a
a
.26
.39a
a
.27
a
2.18
1.40
.02
.14
.13
2.29
1.39
.18a
.33a
.26a
1.29
a
a
.34a
1.78
1.16
.06
.19
1.85
1.14
.30a
.33a
.38a
0.95
a
a
a
.48
2.90
1.19
.30
a
1.19
26. I enjoy having sex...(Psychopathy)
1.83
2.57
1.11
a
2.85
25. I have never gotten...(R; Psychopathy) 27. I’ll say anything to get what I. . .(Psychopathy)
3.22
a
.29
.42
2.74
.27
.18
.36
Notes. Bold indicates item-total correlations corrected for item-total overlap. Correlations’ coefficients equal to, or greater than .13 and .16 in absolute value in Sample 1 and Sample 2, respectively, are significant at Bonferroni-corrected p-level (i.e., p < .0006). Critical values for a relevant difference between convergent validity (i.e., ri-t) coefficients and discriminant validity coefficients were .08 and .10 in Sample 1 and Sample 2, respectively. M = Mean; SD = Standard deviation; Mach = Machiavellianism factor; Narc = Narcissism factor; Psych = Psychopathy factor. Asterisks indicate items whose ri-t values were significantly greater than all discriminant validity coefficients. ap < .0006.
Consistent with previous research on the English version, the SD3 subscales showed substantial positive intercorrelations (see Table 2; median r = .49 and .42 in adults and adolescents, respectively). To isolate their unique components, residualized measures of narcissism, Machiavellianism, and psychopathy were created. Descriptive statistics, internal consistency estimates (i.e., Cronbach’s α values), for the Mach IV total score, FFNI-SF GN, and TriPM total score
in Sample 1 and in Sample 2, are listed in Table 5. Correlations (Pearson’s r coefficients) between these residualized SD3 scale scores and their corresponding criteria are also listed in Table 5. In each sample, the nominal significance level (i.e., p < .05) was corrected according to the Bonferroni procedure and set at p < .0016. Bold highlights indicate convergent validity coefficients. Within each row, different superscripts indicate that convergent validity coefficients were significantly different from discriminant validity coefficients according to Steiger’s z test. Disattenuating for measurement error better highlights these convergent values. In Sample 1, the convergent validity coefficients become .72, .74, and .40 for Machiavellianism, narcissism, and psychopathy scales. In Sample 2, the disattenuated values become .68, .89, and .73 for Machiavellianism, narcissism, and psychopathy, respectively.
Ó 2019 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 185–195
loadings and thresholds across samples. We observed adequate values of fit statistics even for the most restrictive model, M2*(612) = 2,307.92, RMSEA2 = .049, 90% CI RMSEA2 = 0.47, 0.51, BIC = 83,766.79, supporting the hypothesis of measurement invariance of the IRT threefactor structure across subgroups based on participants’ age.
Unique Components
190
A. Somma et al., Psychometric Properties of the SD3
Table 2. SD3 descriptive statistics by gender in adult and adolescent samples Sample 1 (adult university students) Whole sample (N = 678)
Male participants (n = 347)
Female participants (n = 331)
ES
Correlations
M
SD
α (MIC)
M
SD
α (MIC)
M
SD
α (MIC)
d
1. Machiavellianism
2.60
0.71
.84 (.38)
2.70
0.73
.84 (.37)
2.50
0.67
.83 (.34)
0.29***
–
2. Narcissism
2.63
0.56
.87 (.43)
2.76
0.51
.87 (.42)
2.50
0.57
.87 (.44)
0.47***
.39***
–
3. Psychopathy
2.23
0.62
.77 (.28)
2.34
0.60
.73 (.23)
2.10
0.62
.80 (.31)
0.39***
.55***
.49***
Short Dark Triad Scales
1
2
Sample 21 (adolescent high school students) Whole sample (N = 442)
Male participants (n = 215)
Female participants (n = 226)
ES
Correlations
M
SD
α (MIC)
M
SD
α (MIC)
M
SD
α (MIC)
d
1. Machiavellianism
3.05
0.60
.69 (.20)
3.14
0.61
.68 (.19)
2.96
0.59
.69 (.20)
0.31***
.53
2. Narcissism
2.82
0.57
.66 (.18)
2.96
0.57
.63 (.16)
2.67
0.54
.65 (.17)
0.53***
.29***
.56
3. Psychopathy
2.72
0.63
.67 (.19)
2.86
0.62
.63 (.16)
2.58
0.61
.70 (.20)
0.47***
.42***
.47***
1
2
Notes. ES = effect size estimate; male participants and female participants do not add up to the total participants in Sample 2, because one participant did not disclose his/her gender. ***p < .001.
Table 3. Multidimensional IRT confirmatory factor analyses of the SD3 items in adult and adolescent samples Sample 1 (N = 678, adult university students) SD3 Item Models
M2*
df
(a) One factor
1,566.70***
243
.090
.085–.094
(b) Three uncorrelated factors
1,577.31***
243
.090
.086–094
.168
48,462.49
49,072.57
48,643.93
(c) Three correlated factors
1,103.04***
240
.073
.068–.077
.076
47,877.95
48,501.59
48,063.42
34,929.96
35,482.29
35,053.86
RMSEA2
90% CI RMSEA2
SRMSR .084
AIC
BIC
SABIC
48,226.91
48,837.00
48,408.36
Sample 2 (N = 442, adolescent high school students) (a) One factor
596.90***
243
.057
.052–.063
.074
(b) Three uncorrelated factors
519.73***
243
.051
.045–.057
.117
35,001.25
35,553.58
35,125.15
(c) Three correlated factors
443.03***
240
.043
.037–.050
.069
34,761.09
35,325.70
34,887.75
Notes. RMSEA2 = sample bivariate root-mean-square error of approximation; CI = confidence interval; SRMSR = standardized root-mean-square residual; AIC = Akaike information criterion; BIC = Bayesian information criterion; SABIC = sample size-adjusted Bayesian information criterion. ***p < .001.
The current study contributes to the literature on dark personalities in two ways. First, it provides support for the Italian translation of the Short Dark Triad (SD3) personality measure. As with the English version, our findings
suggested that the Italian items tap three latent dimensions of malevolent personality corresponding to narcissism, Machiavellianism, and psychopathy. Second, ours is the first study to compare the performance of the same Dark Triad instrument in adolescent and adult participants. Although both groups were students, the younger group (Mage = 16 years) scored significantly higher than the older group (Mage = 24 years) on all three subscales. Internal consistency reliability estimates (i.e., Cronbach’s α values) were slightly lower among adolescents (.67–.69) than among adult university students (.77–.84). Most importantly, however, the two variancecovariance matrices did not differ significantly. Hence, the overall pattern of item relationships, while not as sharply defined as for the English SD3 (i.e., Jones & Paulhus, 2014), was similar for adults and adolescents. In the adult sample, item-total correlations corrected for item-total overlap (i.e., item convergent validity coefficients) were significant for all items in our adult university student sample (i.e., Sample 1). However, not all findings were in
European Journal of Psychological Assessment (2020), 36(1), 185–195
Ó 2019 Hogrefe Publishing
Note that our comparison of convergent and discriminant validities was hampered to some degree by the overlap of the three criterion variables. In Sample 1, the Mach IV total score correlated .33 and .23, all ps < .001, with FFNI-SF GN and TriPM total score, respectively, whereas FFNI-SF GN score correlated .58, p < .001, with TriPM total score. Among adolescent high school students (i.e., Sample 2 participants), the Mach IV total score correlated .37 and .50, all ps < .001, with FFNI-SF GN and TriPM total score, respectively; the correlation of FFNI-SF GN score with TriPM total score was .58, p < .001.
Discussion
Ó 2019 Hogrefe Publishing 1.28***
9. Most people. . .(Machiavellianism)
–
– –
26. I enjoy having sex...(Psychopathy)
27. I’ll say anything to get. . .(Psychopathy)
–
1.45***
1.08***
0.63***
1.74***
1.28***
1.38***
1.20***
0.22*
1.45***
–
–
–
–
–
–
–
5.17
3.68
3.14
4.01
3.23
4.46
4.49
2.38
3.50
2.11
2.88
3.95
2.42
3.31
6.21
5.64
2.43
4.78
3.07
4.21
1.89
3.76
5.34
3.78
4.97
5.09
1.75
d1
3.55
2.59
2.21
2.46
1.04
2.53
2.85
1.15
1.95
0.24
1.52
1.74
1.00
1.21
3.70
3.52
0.89
2.38
1.10
2.18
0.33
2.32
2.86
1.95
3.58
3.31
0.10
d2
1.98
1.54
1.62
0.53
0.22
1.27
1.30
0.02
0.14
1.70
0.32
0.72
0.04
0.76
1.14
1.20
0.70
0.22
0.23
0.04
1.89
0.62
0.75
0.42
1.52
1.96
1.36
d3
0.16
0.30
0.42
1.60
1.83
0.48
0.42
1.46
1.54
2.79
1.66
1.18
2.25
2.49
1.23
1.30
2.17
1.55
2.04
1.97
3.40
1.12
1.64
1.40
0.65
0.03
3.32
d4 0.30*
–
1.23***
–
0.57***
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
– –
0.33** –
0.98***
1.06*** –
–
–
0.84***
– –
1.36***
0.45***
1.72***
–
–
–
–
–
–
–
–
–
–
–
0.87***
1.06***
0.68***
1.41***
1.17***
0.32**
0.92***
1.21***
Narc a2
1.08***
0.70***
0.35**
1.17***
0.94***
0.77***
1.00***
0.59***
1.25***
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Psych a3
2.42
2.26
2.18
2.67
2.32
2.57
2.62
2.32
2.71
1.03
2.58
2.56
2.70
2.35
3.34
3.38
2.58
4.17
2.58
2.14
1.29
2.12
2.83
2.75
2.71
3.42
1.55
d1
1.25
1.47
1.33
1.08
0.40
0.83
1.27
0.93
1.22
0.68
1.62
0.71
1.27
0.29
1.87
1.82
1.09
2.44
0.55
0.52
0.62
0.49
0.94
0.87
1.11
1.81
0.15
d2
0.19
0.52
0.81
0.76
0.94
0.31
0.04
0.40
0.64
2.15
0.50
0.35
0.17
1.21
0.05
0.19
0.49
0.28
0.78
1.25
2.14
0.87
0.76
0.59
0.48
0.64
1.23
d3
Sample 2 (N = 442, adolescent high school students) Mach a1
1.53
0.34
0.13
2.47
2.33
1.48
1.36
1.88
2.25
3.53
0.89
1.92
1.47
2.57
1.74
2.36
1.67
1.68
2.03
2.82
3.78
2.56
2.23
2.03
1.85
0.90
2.51
d4
Notes. G = General dark triad factor; Mach = Machiavellianism factor; Narc = Narcissism factor; Psych = Psychopathy factor; a1–a4 = item discrimination estimates on the general and specific latent dimensions; d1–d4 = item threshold (i.e., difficulty) estimates; – = item parameter fixed at zero. *p < .05; **p < .01; ***p < .001.
–
–
–
25. I have never gotten...(R; Psychopathy)
– –
–
– –
– –
21. Payback needs to be...(Psychopathy)
22. People often say. . .(Psychopathy) –
–
–
20. I avoid dangerous. . .(R; Psychopathy)
23. It’s true that I can be...(Psychopathy)
–
24. People who mess with. . .(Psychopathy)
0.62***
– –
18. I insist on getting...(Narcissism)
0.79***
–
17. I am an average person (R; Narcissism)
19. I like to get revenge...(Psychopathy)
0.48*** 1.25***
– –
1.29***
–
14. I like to get acquainted...(Narcissism)
15. I feel embarrassed...(R; Narcissism)
1.60***
–
13. I know that I am special. . .(Narcissism)
16. I have been compared to...(Narcissism)
0.57*** 1.48***
– –
11. I hate being the. . .(R; Narcissism)
12. Many group activities...(Narcissism)
–
–
–
– –
–
–
1.17***
1.67***
8. Make sure your...(Machiavellianism)
– –
– –
– –
– –
– –
– –
–
1.03***
7. There are things...(Machiavellianism)
10. People see me. . .(Narcissism)
2.26*** 1.78***
5. It’s wise to keep...(Machiavellianism)
6. You should wait...(Machiavellianism)
1.72*** 1.31***
3. Whatever it takes...(Machiavellianism)
4. Avoid direct...(Machiavellianism)
0.42*** 1.62***
1. It’s not wise...(Machiavellianism)
Psych a3
Sample 1 (N = 678, adult university students) Narc a2
Mach a1
2. I like to use clever...(Machiavellianism)
Short Dark Triad Scale Items
Table 4. Multidimensional IRT confirmatory factor analyses: item discrimination and item threshold estimates in university and high school students
A. Somma et al., Psychometric Properties of the SD3 191
European Journal of Psychological Assessment (2020), 36(1), 185–195
192
A. Somma et al., Psychometric Properties of the SD3
Table 5. Correlations of residualized SD3 scores with criterion measures in adult and adolescent samples Sample 1 (N = 678, adult university students) SD3 subscales Machiavellianism
Mach IV a
.45*
Narcissism
.12*
b
Psychopathy
.16*
b
FFNI-SF GN .25*
TriPM Total Score
b
.13*
a
b
b
Sample 2 (N = 442, adolescent high school students) MACH IV .33*
FFNI-SF GN
a
.16*
b
.34*
.01
.06
.11*b
.31*a
.13b
b
.31*
a
.18*b
TriPM Total Score .18* .14*
.35*a
M
2.78
101.41
67.11
2.86
114.96
69.00
SD
0.43
20.68
27.10
0.40
21.95
20.64
.80
.90
.93
.63
.87
.89
α
b b
Notes. Mach IV = Machiavellianism Inventory–Version IV; FFNI SF = Five-Factor Narcissism Inventory–Short Form; GN = Grandiose Narcissism Scale; TriPM: Triarchic Psychopathy Measure. In each sample, the nominal significance level (i.e., p < .05) was corrected according to the Bonferroni procedure and set at p < .0025. Bold highlights convergent validity coefficients. Within each row and sample, different superscripts indicate that convergent validity coefficients were significantly different (p < .05) from discriminant validity coefficients according to Steiger’s z test. *p < .0025.
accordance with those obtained for the English version. For example, 12 of the convergent validity coefficients failed to exceed the corresponding discriminant validity coefficients (i.e., correlations with a subscale to which it was not assigned). Nonetheless, the mean convergent correlation (.44) exceeded the mean discriminant correlation (.27). Similar considerations were held for our adolescent sample in which the majority of item convergent validity coefficients failed to exceed the corresponding discriminant validity coefficients. Nonetheless, the mean convergent correlation (.32) was higher than the mean discriminant correlation (.19). On the whole, then, the Italian SD3 items appear to capture the dimension to which they were assigned. Because the three SD3 factors were designed to be oblique, cross-loadings are expected to appear in replication samples – even in the same language. In the case of an item set translated into Italian, it is difficult to say whether the higher rate of cross-loadings indicates (a) differences in culture, (b) differences in language connotation, or (c) just a matter of chance.
2014) on their corresponding factors. That number rose to 12 in our adolescent sample. This finding seemed to suggest that the Italian SD3 may be more appropriate for use with adult subjects. Notably, the IRT three-factor structure showed measurement invariance across adolescent and adult samples (thus making mean comparisons legitimate), at least according to the results of our multiple-group IRT analysis.
Nomological Network
Consistent with Jones and Paulhus (2014), our multidimensional IRT confirmatory factor analysis findings suggested the adequacy of the SD3 three-factor structure. Indeed, all fit indices suggested retention of the three-correlated-factor model as the best fitting model in both adults and adolescents. Although the three subscales represented dissociable latent dimensions, the corresponding IRT factors were substantially intercorrelated in both samples. Although there is consensus that the overlap in the English version results from a common core, interpretations of that core range from callousness (Jones & Figueredo, 2013), honesty–humility (Visser, Book, & Volk, 2017), antagonism (Jonason & Tost, 2010), or simply psychopathy (Glenn & Sellbom, 2015). In our adult sample (see Table 4), only 7 of 27 SD3 items failed to show substantial item discrimination parameters (i.e., a coefficient values of .90 or greater; Sharp et al.,
Overall, results of convergent–discriminant validation supported the differentiation of the SD3 subscales. Extending previous evidence on convergent validity of the English subscales (Jones & Paulhus, 2014), our data suggested that (even the) residualized Machiavellianism scale provides at least moderate convergence with the gold standard measure of Machiavellianism (Mach IV) in both adult university students and adolescent high school students. The disattenuated value was even stronger. Confirming and extending evidence of convergence between SD3 narcissism and the NPI (Jones & Paulhus, 2014), we found that the FFNI-SF GN scale scores showed a robust association with the residualized narcissism scale score. Moreover, the SD3 narcissism scale provided adequate discriminant validity with respect to self-reports of both Machiavellian personality features (i.e., Mach IV total score) and psychopathy traits (i.e., TriPM total score). Similar results were held for the two samples. Finally, the residualized psychopathy scale showed adequate convergent validity with respect to the TriPM, and satisfactory discriminant validity with respect to both FFNI-SF GN and Mach IV total scores in both adults and adolescents. Again, disattenuated values were even stronger. In summary, although we relied on criterion measures of narcissism and psychopathy that differed from those used to validate the English version (Jones & Paulhus, 2014), we feel that our findings on the convergent-discriminant validity provided adequate support for the Italian version. Our criterion
European Journal of Psychological Assessment (2020), 36(1), 185–195
Ó 2019 Hogrefe Publishing
Factor Structure
A. Somma et al., Psychometric Properties of the SD3
measures – Mach IV, FFNI-SF GN scale, and TriPM – are all well-established measures. Any difficulties we found with discriminant validities may be attributable, in part, to substantial overlap observed among the criterion measures: Although a replication study is in order, we see utility in the Italian SD3 for research as well as for screening purposes. As with the English version, the measure facilitates quick and valid assessment when time and space are limited.
193
ESM 2. Figures (.docx) Rest-score functions for SD3 items in Sample 1 and 2.
References
ESM 1. Text (.pdf) Additional information regarding participants, the procedure and the data analysis.
Book, A., Visser, B. A., & Volk, A. A. (2015). Unpacking “evil”: Claimingthe core of the Dark Triad. Personality and Individual Differences, 73, 29–38. https://doi.org/10.1016/j.paid.2014.09.016 Carré, J. R., Mueller, S. M., Schleicher, K. M., & Jones, D. N. (2018). Psychopathy and deviant workplace behavior: A comparison of two psychopathy models. Journal of Personality Disorders, 32, 242–261. https://doi.org/10.1521/pedi_2017_ 31_296 Chalmers, R. P. (2012). mirt: A Multidimensional Item Response Theory package for the R environment. Journal of Statistical Software, 48, 1–29. https://doi.org/10.18637/jss.v048.i06 Christie, R., & Geis, F. (1970). Studies in Machiavellianism. New York, NY: Academic Press. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Dowgwillo, E. A., & Pincus, A. L. (2017). Differentiating dark triad traits within and across interpersonal circumplex surfaces. Assessment, 24, 24–44. https://doi.org/10.1177/ 1073191116643161 Drislane, L. E., Patrick, C. J., & Arsal, G. (2014). Clarifying the content coverage of differing psychopathy inventories through reference to the Triarchic Psychopathy Measure. Psychological Assessment, 26, 350–362. https://doi.org/10.1037/a0035152 Evans, L., & Tully, R. J. (2016). The Triarchic Psychopathy Measure (TriPM): Alternative to the PCL-R? Aggression and Violent Behavior, 27, 79–86. https://doi.org/10.1016/j.avb.2016.03.004 Fossati, A., Somma, A., Borroni, S., & Miller, J. D. (2018). Assessing dimensions of pathological narcissism: Psychometric properties of the short form of the Five-Factor Narcissism Inventory in a sample of Italian University Students. Journal of Personality Assessment, 100, 250–258. https://doi.org/10.1080/00223891. 2017.1324457 Furnham, A., Richards, S. C., & Paulhus, D. L. (2013). The Dark Triad of personality: A 10-year review. Social and Personality Psychology Compass, 7, 199–216. https://doi.org/10.1111/ spc3.12018 Glenn, A. L., & Sellbom, M. (2015). Theoretical and empirical concerns regarding the dark triad as a construct. Journal of Personality Disorders, 29(3), 360–377. Glover, N., Miller, J. D., Lynam, D. R., Crego, C., & Widiger, T. A. (2012). The Five-Factor Narcissism Inventory: A five-factor measure of narcissistic personality traits. Journal of Personality Assessment, 94, 500–512. https://doi.org/10.1080/ 00223891.2012.670680 Goncalves, M. K., & Campbell, L. (2014). The Dark Triad and the derogation of mating competitors. Personality and Individual Differences, 67, 42–46. Jonason, P. K., Kavanagh, P., Webster, G. D., & Fitzgerald, D. (2011). Comparing the measured and latent Dark Triad: Are three measures better than one? Journal of Methods and Measurement in the Social Sciences, 2, 28–44. Jonason, P. K., Slomski, S., & Partyka, J. (2012). The Dark Triad at work: How toxic employees get their way. Personality and Individual Differences, 52, 449–453. https://doi.org/10.1016/ j.paid.2011.11.008 Jonason, P. K., & Tost, J. (2010). I just cannot control myself: The Dark Triad and self-control. Personality and Individual Differences, 49, 611–615. https://doi.org/10.1016/j.paid.2010. 05.031
Ó 2019 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 185–195
Limitations Of course, our findings should be considered in light of several limitations. Although it involved a large number of participants (N = 1,120), our research was based on volunteers. Both adult and adolescent samples were convenient study groups rather than samples representative of the Italian population. Moreover, we relied only on student participants in both the adult and adolescent samples. This sampling procedure could potentially limit the generalizability of our findings. Completely unknown is the degree to which the findings can be extended to clinical and forensic populations. Finally, we relied only on a limited number of criterion measures in order to evaluate convergent-discriminant validity. Although the FFNI, Mach IV, and TriPM are well-established measures of narcissism, Machiavellianism, and psychopathy, respectively, they all have their critics. Debates continue over the appropriate constituents of narcissism (e.g., grandiose vs. vulnerable aspects; Miller, Lynam, Hyatt, & Campbell, 2017) and psychopathy (e.g., the relevance of boldness; Carré, Mueller, Schleicher, & Jones, 2018; Lilienfeld et al., 2016). Hence, future research should include alternative measures of narcissism (e.g., the NPI to measure narcissism, and the Self-Report Psychopathy scale [Paulhus et al., 2016] to capture psychopathy). Moreover, our criterion measures were all self-report instruments. Hence, we were not able to disentangle the effect of shared construct variance from the effect of shared method variance (Nunnally & Bernstein, 1994). Accordingly, our convergent-discriminant validity data need to be replicated using alternative modes of measurement, for example, peer ratings (Jones & Paulhus, 2014) and overt behavior (Jones & Paulhus, 2017).
Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000499
194
A. Somma et al., Psychometric Properties of the SD3
Jonason, P. K., & Webster, G. D. (2010). The Dirty Dozen: A concise measure of the Dark Triad. Psychological Assessment, 22, 420–432. https://doi.org/10.1037/a0019265 Jones, D. N., & Figueredo, A. J. (2013). The core of darkness: Uncovering the heart of the Dark Triad. European Journal of Personality, 27(6), 521–531. Jones, D. N., & Paulhus, D. L. (2009). Machiavellianism. In M. R. Leary & R. H. Hoyle (Eds.), Handbook of individual differences in social behavior (pp. 93–108). New York, NY: Guilford Press. Jones, D. N., & Paulhus, D. L. (2011). Differentiating the Dark Triad within the interpersonal circumplex. In L. M. Horowitz & S. Strack (Eds.), Handbook of interpersonal psychology: Theory, research, assessment, and therapeutic interventions (pp. 249–268). New York, NY: Wiley. Jones, D. N., & Paulhus, D. L. (2014). Introducing the short dark triad (SD3): A brief measure of dark personality traits. Assessment, 21, 28–41. https://doi.org/10.1177/1073191113514105 Jones, D. N., & Paulhus, D. L. (2017). Duplicity among the Dark Triad: Three faces of deceit. Journal of Personality and Social Psychology, 113, 329–342. https://doi.org/10.1037/ pspp0000139 Kerig, P. K., & Stellwagen, K. K. (2010). Roles of callousunemotional traits, narcissism, and Machiavellianism in childhood aggression. Journal of Psychopathology and Behavioral Assessment, 32, 343–352. https://doi.org/10.1007/s10862-0099168-7 Lau, K. S. L., & Marsee, M. A. (2013). Exploring narcissism, Machiavellianism, and psychopathy in youth: Examination of associations with antisocial behavior and aggression. Journal of Child and Family Studies, 22, 355–367. https://doi.org/ 10.1007/s10826-012-9586-0 Lilienfeld, S. O., Smith, S. F., Sauvigné, K. C., Patrick, C. J., Drislane, L. E., Latzman, R. D., & Krueger, R. F. (2016). Is boldness relevant to psychopathic personality? Meta-analytic relations with non-Psychopathy Checklist-based measures of psychopathy. Psychological Assessment, 28, 1172–1185. https://doi.org/10.1037/pas0000244 Miller, J. D., Few, L. R., Seibert, L. A., Watts, A., Zeichner, A., & Lynam, D. R. (2012). An examination of the Dirty Dozen measure of psychopathy: A cautionary tale about the costs of brief measures. Psychological Assessment, 24, 1048–1053. https:// doi.org/10.1037/a0028583 Miller, J. D., Lynam, D. R., Hyatt, C. S., & Campbell, W. K. (2017). Controversies in narcissism. Annual Review of Clinical Psychology, 13, 291–315. Millsap, R. E., & Yun-Tein, J. (2004). Assessing factorial invariance in ordered-categorical measures. Multivariate Behavioral Research, 39, 479–515. Neumann, C. S., Uzieblo, K., Crombez, G., & Hare, R. D. (2013). Understanding the Psychopathic Personality Inventory (PPI) in terms of the unidimensionality, orthogonality, and construct validity of PPI-I and -II. Personality Disorders: Theory, Research, and Treatment, 4, 77–79. https://doi.org/10.1037/a0027196 Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York, NY: McGraw-Hill. Patrick, C. J., & Drislane, L. E. (2015). Triarchic model of psychopathy: Origins, operationalizations, and observed linkages with personality and general psychopathology. Journal of Personality, 83, 627–643. https://doi.org/10.1111/jopy.12119 Paulhus, D. L., Neumann, C. S., & Hare, R. D. (2016). Manual for the Self-Report Psychopathy Scale (4th ed.). Toronto, Canada: Multi-Health Systems. Paulhus, D. L., & Williams, K. M. (2002). The Dark Triad of personality: Narcissism, Machiavellianism, and psychopathy.
Journal of Research in Personality, 36, 556–563. https://doi. org/10.1016/S0092-6566(02)00505-6 Raskin, R., & Hall, C. S. (1979). A Narcissistic Personality Inventory. Psychological Reports, 45, 590. https://doi.org/10.2466/ pr0.1979.45.2.590 Rauthmann, J. F. (2013). Investigating the Mach IV with item response theory and proposing the trimmed the Mach*. Journal of Personality Assessment. https://doi.org/10.1080/00223891. 2012.742905 Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47(5), 667–696. Reise, S. P., Morizot, J., & Hays, R. D. (2007). The role of the bifactor model in resolving dimensionality issues in health outcomes measures. Quality of Life Research, 16(1), 19–31. Reise, S. P., & Rodriguez, A. (2016). Item response theory and the measurement of psychiatric constructs: Some empirical and conceptual issues and challenges. Psychological Medicine, 46, 2025–2039. https://doi.org/10.1017/S0033291716000520 Reise, S. P., & Waller, N. G. (2009). Item response theory and clinical measurement. Annual Review of Clinical Psychology, 5, 27–48. https://doi.org/10.1146/annurev.clinpsy.032408. 153553 Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of item response theory (pp. 85–100). New York, NY: Springer. Schneider, T. J., McLarnon, M. J., & Carswell, J. J. (2017). Career interests, personality, and the Dark Triad. Journal of Career Assessment, 25, 338–351. https://doi.org/10.1177/ 1069072715616128 Sharp, C., Steinberg, L., Temple, J., & Newlin, E. (2014). An 11-item measure to assess borderline traits in adolescents: Refinement of the BPFSC using IRT. Personality Disorders: Theory, Research, and Treatment, 5(1), 70–78. https://doi.org/ 10.1037/per0000057 Sherman, E. D., Miller, J. D., Few, L. R., Campbell, W. K., Widiger, T. A., Crego, C., & Lynam, D. R. (2015). Development of a Short Form of the Five-Factor Narcissism Inventory: The FFNI-SF. Psychological Assessment, 27, 1110–1116. https://doi.org/ 10.1037/pas0000100 Sica, C., Drislane, L., Caudek, C., Angrilli, A., Bottesi, G., Cerea, S., & Ghisi, M. (2015). A test of the construct validity of the Triarchic Psychopathy measure in an Italian community sample. Personality and Individual Differences, 82, 163–168. https://doi.org/10.1016/j.paid.2015.03.015 Somma, A., Borroni, S., Drislane, L. E., & Fossati, A. (2016). Assessing the triarchic model of psychopathy in adolescence: Reliability and validity of the Triarchic Psychopathy Measure (TriPM) in three samples of Italian community-dwelling adolescents. Psychological Assessment, 28, e36–e43. https://doi. org/10.1037/pas0000184 van Baardewijk, Y., Vermeiren, R., Stegge, H., & Doreleijers, T. (2011). Self-reported psychopathic traits in children: Their stability and concurrent and prospective association with conduct problems and aggression. Journal of Psychopathology and Behavioral Assessment, 33, 236–245. https://doi.org/ 10.1007/s10862-010-9215-4 van der Ark, L. A. (2014). New developments in Mokken scale analysis in R. Journal of Statistical Software, 48, 1–27. Vernon, P. A., Villani, V. C., Vickers, L. C., & Harris, J. A. (2008). A behavioral genetic investigation of the Dark Triad and the Big 5. Personality and Individual Differences, 44, 445–452. https://doi. org/10.1016/j.paid.2007.09.007 Veselka, L., Giammarco, E. A., & Vernon, P. A. (2014). The Dark Triad and the seven deadly sins. Personality and Individual Differences, 67, 75–80. https://doi.org/10.1016/j.paid.2014.01.055
European Journal of Psychological Assessment (2020), 36(1), 185–195
Ó 2019 Hogrefe Publishing
A. Somma et al., Psychometric Properties of the SD3
Visser, B. A., Book, A. S., & Volk, A. A. (2017). Is Hillary dishonest and Donald narcissistic? A HEXACO analysis of the presidential candidates’ public personas. Personality and Individual Differences, 106, 281–286.
195
ORCID Antonella Somma https://orcid.org/0000-0003-2982-505X
Received October 23, 2017 Revision received June 3, 2018 Accepted June 4, 2018 Published online February 19, 2019 EJPA Section/Category Personality
Antonella Somma Faculty of Psychology Vita-Salute San Raffaele University via Stamira d’Ancona 20 20127 Milan Italy somma.antonella@hsr.it
Ó 2019 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 185–195
Multistudy Report
I Like Myself, I Really Do (at Least Right Now) Development and Validation of a Brief and Revised (German-Language) Version of the State Self-Esteem Scale Almut Rudolph,1 Michela Schröder-Abé,2 and Astrid Schütz3 1
Department of Psychology, Clinical Psychology and Psychotherapy, University of Leipzig, Germany
2
Department of Psychology, Personality Psychology and Psychological Assessment, University of Potsdam, Germany
3
Department of Psychology, Personality Psychology and Psychological Assessment, University of Bamberg, Germany
Abstract: In five studies, we evaluated the psychometric properties of a revised German version of the State Self-Esteem Scale (SSES; Heatherton & Polivy, 1991). In Study 1, the results of a confirmatory factor analysis on the original scale revealed poor model fit and poor construct validity in a student sample that resembled those in the literature; thus, a revised 15-item version was developed (i.e., the SSES-R) and thoroughly validated. Study 2 showed a valid three-factor structure (Performance, Social, and Appearance) and good internal consistency of the SSES-R. Correlations between subscales of trait and state SE empirically supported the scale’s construct validity. Temporal stability and intrapersonal sensitivity of the scale to naturally occurring events were investigated in Study 3. Intrapersonal sensitivity of the scale to experimentally induced changes in state SE was uncovered in Study 4 via social feedback (acceptance vs. rejection) and performance feedback (positive vs. negative). In Study 5, the scale’s interpersonal sensitivity was confirmed by comparing depressed and healthy individuals. Finally, the usefulness of the SSES-R was demonstrated by assessing SE instability as calculated from repeated measures of state SE. Keywords: self-esteem, state self-esteem, State Self-Esteem Scale, positive affect, negative affect
“Usually, I am contemplative and withdrawn, and I really feel inferior to others. In my class, I am afraid of what others think of me, I feel concerned about the impression I am making, and I tend to underestimate my skills. Following my last presentation, however, my professor gave me a positive evaluation that I really had not expected. In that moment, I really liked myself and was confident that I am capable of getting things done and even better than others in areas I am really interested in.” (self-description of a female student, 21 years old, when asked to describe herself in the “Who am I” task) As illustrated in the quotation above, self-esteem (SE), defined as a positive evaluation of the self (Baumeister, 1998), includes self-evaluative reactions that are provoked by particular positive or negative experiences. There are three basic ways of assessing SE (Leary & Tangney,
European Journal of Psychological Assessment (2020), 36(1), 196–206 https://doi.org/10.1027/1015-5759/a000501
2003): as a trait, which refers to the way people generally feel about themselves (assumed to be relatively stable); as a state, by referring to a person’s current feelings of self-worth (assumed to vary across situations and time); and as fluctuations in state levels (as a result of situational and time-related state-level variation). Investigations of the latter two offer reliable and valid assessments of state SE. Heatherton and Polivy (1991) developed the State SelfEsteem Scale (SSES), a multidimensional 20-item scale that was designed to assess short-term fluctuations in SE by asking people to identify how they “feel at the moment.” The SSES is based on the Janis–Field Feelings of Inadequacy Scale (JFS; Janis & Field, 1959) and facilitates the assessment of state SE in three domains: Performance (e.g., “I feel confident about my abilities”; 7 items), Social (e.g., “I feel worried about what other people think of me”; 7 items), and Appearance (e.g., “I am pleased with my appearance
Ó 2018 Hogrefe Publishing
A. Rudolph et al., State Self-Esteem Scale
197
right now”; 6 items). The items are rated on a 5-point scale that ranges from 1 (= not at all) to 5 (= extremely). The SSES total score demonstrated a high internal consistency (α = .92) as well as temporal stability over the course of an academic semester (rtt = .72). Furthermore, Heatherton and Polivy (1991) presented evidence for the validity of the SSES by showing that the three independent factors were sensitive to change; for example, students’ Performance State SE was affected by naturally occurring and laboratory-induced failure in the academic environment such as examination grades, whereas Social State SE was affected by public failure (in the laboratory) but not private failure (on midterm examination), and Appearance State SE was relatively unaffected by academic failure in the laboratory. As of August 2017, the original publication on the SSES by Heatherton and Polivy (1991) had been cited more than 700 times in PsycINFO, and the SSES has been applied in various languages (e.g., Chinese: Fung, Lui, & Chau, 2006; Dutch: Jansen, Rijken, Heijmans, & Boeschoten, 2010; French: Bardel, Fontayne, & Colombel, 2008). Moreover, it has been repeatedly used to examine experimental manipulations in the laboratory (e.g., Kavanagh, Robins, & Ellis, 2010) and to assess SE reactions in diary studies (e.g., Zeigler-Hill & Besser, 2013). The factor structure of the SSES, however, has been comprehensively evaluated in only two studies (Chau, Thompson, Chang, & Woo, 2012; Heatherton & Polivy, 1991). In their original publication, Heatherton and Polivy (1991) conducted an exploratory factor analysis (EFA) of the SSES in an undergraduate sample (N = 428) and found a threefactor structure with all items except for two (Item 6: “I feel that others respect and admire me”; Item 10: “I feel displeased with myself”) loaded on their theoretical factor. Chau et al. (2012) investigated the psychometric properties of a Chinese version of the SSES in a clinical sample of stroke patients (N = 265) and excluded one item (Item 7: “I am dissatisfied with my weight”) due to low loadings in an EFA. Furthermore, 6 items were assigned to different subscales: Three items (Item 10: “I feel displeased with myself”; Item 15: “I feel inferior to others at this moment”; Item 20: “I am worried about looking foolish”) were assigned to the Performance factor instead of the Social factor, 1 item (Item 16: “I feel unattractive”) was assigned to the Performance factor instead of the Appearance factor, and two items (Item 1: “I feel confident about my abilities”; Item 14: “I feel confident that I understand things”) were assigned to the Appearance factor instead of the Performance factor. The internal consistency of the total scale was marginal (α = .70). Thus, the adapted Chinese version of the SSES did not demonstrate an optimal factor structure or good reliability.
We conducted five studies to investigate the validity of a German version of the SSES. The German version of the SSES was based on a translation/back-translation procedure. Two experts in the field of self-esteem independently translated the items of the SSES into German to prepare a preliminary consensual translation. A bilingual native speaker of English evaluated the clarity of items and back-translated the items which indicated a significant degree of equivalence. Minor discrepancies in wording were resolved based on consensus discussion. The comprehensibility of the scale was further evaluated with pilot testing (Schütz, 2000). In Study 1, we investigated the factor structure by applying a CFA in a student sample and developed a 15-item revised version of the SSES. In Study 2, we attempted to replicate the findings from Study 1 in a larger community sample and further investigated the internal consistency and convergent and discriminant validity of the SSES-R with relevant trait and mood measures. Study 3 investigated the temporal stability of the SSES-R over the course of an academic semester and included a field-based assessment of intrapersonal sensitivity (i.e., differential effects of naturally occurring events). In Study 4, we extended the findings of Study 3 by assessing intrapersonal sensitivity in the laboratory with the use of experimental manipulations of social feedback and performance feedback. Finally, in Study 5, we examined interpersonal sensitivity by comparing depressed and healthy individuals.
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 196–206
To date, however, the factor structure of the SSES has not been confirmed using confirmatory factor analysis (CFA). The present research aimed to close this gap in the literature. The internal structure of the SSES was investigated in two German samples yielding problems with internal test structure that resembled those in the literature (Chau et al., 2012; Heatherton & Polivy, 1991). We thus decided to develop a revised 15-item version of the SSES that overcomes these problems. To ensure comparability with the original 20-item scale, we then thoroughly validated the revised scale in student, community, and clinical samples. Among others, we provide findings on the sensitivity of the SSES to natural events and experimental manipulation and its ability to discriminate between healthy and clinical samples that provide original findings relevant to research on state SE that are interesting in their own right and go well beyond scale validation.
Overview of the Present Studies
198
A. Rudolph et al., State Self-Esteem Scale
Study 1
able to good internal consistencies for all subscales: Performance (α = .79), Social (α = .87), Appearance (α = .84).
In our first study, we aimed to evaluate the factor structure of the 20-item SSES in a CFA with three intercorrelated factors (cf. Heatherton & Polivy, 1991).1
Discussion
Method The sample was comprised of 227 students (148 female; Mage = 21.71, SD = 3.66) participating in exchange for study participation credit who individually completed the German version of the SSES with items answered on a 5-point scale (1 = not at all, 5 = extremely). AMOS and SPSS output are available in the Electronic Supplementary Material, ESM 1; data of Study 1 are available in ESM 2.
Results The expected three-factor solution (i.e., Performance, Social, and Appearance) showed goodness-of-fit indices that did not meet the criteria for an adequate fit (Hu & Bentler, 1999): w2(167) = 620.35, p < .001; CMIN/df = 3.72; RMSEA = .11, p < .001; SRMR = .12; CFI = .79. We then applied a stepwise examination of nonsignificant factor loadings and large modification indices to determine what caused the poor fit. We identified two items from the Social factor (Item 10: “I feel displeased with myself”; Item 15: “I feel inferior to others at this moment”) that had also shown cross-loadings in the EFA, one item from the Appearance factor (Item 7:”I am dissatisfied with my weight”) that had also shown the lowest loading on the Appearance factor in the EFA, and two items from the Performance factor (Item 5: “I feel that I am having trouble understanding things that I read”; Item 18: “I feel that I have less scholastic abilities now than others”). The majority of these items had also previously been found to show poor fit in the original SSES and in adaptations in another language (Chau et al., 2012; Heatherton & Polivy, 1991). The removal of the five items resulted in a revised version of the SSES (SSES-R) that was comprised of a total of 15 items, making up three subscales with five items each (see Table 1 for the items). A CFA investigating the hypothesized three-factor structure of the SSES-R showed an acceptable fit: w2(87) = 220.71, p < .001; CMIN/df = 2.54; RMSEA = .08, p < .001; SRMR = .07; CFI = .92, and accept1
The German version of the SSES did not show the expected three-factor structure. These findings parallel those of past research who found substantial cross-loadings in some of the items. Since past studies (Chau et al., 2012; Heatherton & Polivy, 1991) did not use CFA methods, it is not possible to directly compare our findings to the literature. Given substantial cross-loadings in the data by Heatherton and Polivy (1991), it is possible that the original scale would not stand a more in-depth test of factor structure using CFA. However, lacking more detailed information, language, and cultural issues cannot be completely ruled out. As a consequence, we removed items causing poor fit due to factor loadings that were not in accordance with the model. This resulted in a revised version of the SSES (SSES-R). With 15 instead of 20 items, the revised scale has the advantage of a clear factor structure, being more economic, and consisting of an identical number of items per subscale. The reduction of the scale, however, limits comparability with the original measure and thus calls for a thorough validation of the revised version which was performed in the following studies.
Study 2 In this study, we cross-validated the factor structure of the SSES-R in a larger sample and examined its internal consistency and intercorrelations with trait SE and mood measures. According to the model by Shavelson, Hubner, and Stanton (1976), self-esteem is hierarchically organized and includes evaluations of the self in specific domains (e.g., Performance, Social, Physical) that can be further differentiated into self-evaluations in specific situations. In addition, classical and more recent personality theories agree in the assumption that individual differences in states correspond to individual differences in traits (e.g., Fleeson, 2001; Mischel, 2004).With regard to convergent and discriminant validity, we thus expected higher correlations between equivalent subscales of state SE and trait SE than between measures of state SE and mood.
We also followed a suggestion by an anonymous reviewer and performed an approach that combined EFA and CFA techniques (cf. Hopwood & Donnellan, 2010). An initial examination of the data revealed that our data were suitable for factor analysis: First, all items were correlated at .3 with at least one other item. Second, Bartlett’s test of sphericity was significant, w2(190) = 2,280.01, p < .0001, and Kaiser–Meyer–Olkin measure of sampling adequacy was .89. For reasons of comparison with Heatherton and Polivy (1991), we first performed a principal-axis factor analysis with oblique rotation with extraction of components fixed to three that accounted for 50.09% of the overall variability in scores. With this procedure, we reproduced identical factor structure with similar corresponding loadings as in the original validation paper except for two items, i.e., item 10 (“I feel displeased with myself”) primarily loading on the Appearance factor (.41) instead of the Social factor (.26), and item 15 (“I feel inferior to others at this moment”) primarily loading on the Performance factor (.56) instead of the Social factor (.32).
European Journal of Psychological Assessment (2020), 36(1), 196–206
Ó 2018 Hogrefe Publishing
A. Rudolph et al., State Self-Esteem Scale
199
Table 1. Descriptive statistics and factor loadings of the SSES-R (Study 2) Item
M
SD
Loading
rit
Performance State SE (M = 3.71, SD = 0.84) 1. Confidence about abilities
3.53
1.08
.78
.66
4. Frustration about performance (-)
3.57
1.23
.76
.62
9. As smart as others
3.52
1.26
.44
.35
14. Understanding of things
4.22
0.88
.50
.39
19. Not doing well (-)
3.72
1.25
.75
.63
2. Regarded as a success or failure (-)
2.99
1.39
.92
.70
8. Feelings of self-consciousness (-)
3.31
1.37
.89
.66
13. Worried about others’ thoughts (-)
2.94
1.37
.90
.65
17. Concerns about impression (-)
2.34
1.23
.62
.37
20. Worried about looking foolish (-)
3.06
1.44
.74
.69
3. Body satisfaction
2.87
1.29
.70
.51
6. Admired by others
2.68
1.13
.89
.55
11. Feel good about myself
3.03
1.28
.81
.66
12. Pleased with appearance
3.01
1.18
.86
.62
16. Feelings of unattractiveness (-)
3.75
1.34
.79
.63
Social State SE (M = 2.93, SD = 1.11)
Appearance State SE (M = 3.07, SD = 1.00)
Total State SE (M = 3.24, SD = 0.81) Notes. N = 699, SE = self-esteem, descriptive statistics of scale means in parentheses, (-) reverse-coded, loading = the factor loading of the item on the respective subscale in the confirmatory factor analysis, rit = item-total correlation. The German items can be obtained from the authors upon request.
Method Participants A total of 901 individuals consented to participate; however, data from 15.4% of them (N = 139) were excluded due to extensive missing data (i.e., more than 20%) on the SSES-R. A total of 699 participants between the ages of 18 and 67 years (455 female; Mage = 29.25, SD = 11.33) provided full data on the SSES-R. Furthermore, a subsample of 495 participants (331 female; Mage = 25.82, SD = 8.73) completed additional measures that were used in validity analyses. AMOS and SPSS output are available in ESM 1; data of Study 2 are available in ESM 3. Procedure Participants were recruited through public advertisements (e.g., bulletin boards). They were provided with a link to the Web-based survey and were offered to take part in a lottery (€25) upon completion as reimbursement. After providing informed consent, participants completed the measures described below, as well as some additional measures unrelated to the present study.
and Physical Abilities), and responses were given on 7-point scale ranging from 1 (= not at all) to 7 (= very much) or 1 (= never) to 7 (= always). Mood was assessed with the Positive and Negative Affect Schedule (PANAS; German version: Krohne, Egloff, Kohlmann, & Tausch, 1996). Participants were asked to use 5-point scale (1 = very slightly, 5 = very much) to rate the extent to which they were experiencing each of 20 emotions “at the moment.” Half of the items concerned negative affect (e.g., ashamed, nervous) and the other half positive affect (e.g., active, strong).
Results Factor Structure A second CFA supported the factor solution with items loading on only one of the three intercorrelated factors (see Table 1) and demonstrated an acceptable fit for the SSES-R (Hu & Bentler, 1999): w2(87) = 548.16, p = .000; CMIN/df = 6.30; RMSEA = .09, p > .05; SRMR = .07; CFI = .92. Reliability All SSES-R subscales displayed acceptable to good internal consistencies (see Table 2).
Measures State SE was measured with the SSES-R. Trait SE was measured with the 32-item Multidimensional Self-Esteem Scale (MSES; German version: Schütz & Sellin, 2006). The MSES consists of six subscales (i.e., Self-Regard, Social Skills, Social Confidence, Performance SE, Physical Appearance,
Validity We inspected the intercorrelations and found that the measures of state SE and trait SE were highly related (see
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 196–206
200
A. Rudolph et al., State Self-Esteem Scale
Table 2. Correlations, internal consistencies, and retest reliabilities of the state and trait measures of self-esteem (Studies 2 and 3) Study 2 N = 495 Scale
Performance State SE
Social State SE
Appearance State SE
Study 3 N = 63 Total State SE
Internal consistency
Retest reliability
State Self-Esteem (SSES-R) Performance State SE
–
Social State SE
.51**
.55**
.81**
.80
.44**
–
.48**
.83**
.87
.59**
–
.83**
.88
.82**
–
.90
.74**
Appearance State SE Total State SE Trait Self-Esteem (MSES) Self-Regard
.73**
.59**
.70**
.82**
.92
.81**
Performance Self-Esteem
.74**
.42**
.51**
.66**
.83
.69**
Social Skills
.46**
.62**
.48**
.64**
.89
.83**
Social Confidence
.52**
.84**
.50**
.76**
.89
.78**
Physical Appearance
.52**
.52**
.87**
.78**
.89
.83**
Physical Abilities
.47**
.46**
.55**
.60**
.81
.73**
Total MSES
.73**
.74**
.77**
.91**
.95
.87**
Mood (PANAS) Positive Affect
.10*
.14**
.02
.10*
.59
.45**
Negative Affect
.17**
.13**
.34**
.26**
.55
.41**
Notes. SSES-R = Revised State Self-Esteem Scale; SE = Self-Esteem; MSES = Multidimensional Self-Esteem Scale; PANAS = Positive and Negative Affect Schedule. Correlations between corresponding subscales are printed in bold. *p < .05; **p < .01.
Table 2). The subscales of the state measure were most strongly related to the respective trait measure, confirming their convergent validity (e.g., Social State SE was most highly correlated with the Social Skills and Social Confidence subscales of the MSES, and Appearance State SE showed the highest correlations with the Physical Appearance and Physical Abilities subscales of the MSES). Correlations between state SE and mood were lower, indicating discriminant validity.
Finally, we specifically investigated differential effects of naturally occurring events (i.e., at the end of the academic semester after students received their grades). A previous study showed that academic performance was related to moods that are strongly associated with self-esteem such as confidence, pride, or shame (McFarland & Ross, 1982). Assuming that the Performance State SE subscale would be more reactive to academic performance feedback, we expected grade point average to be strongly related to Performance State SE but not to the other SSES-R subscales or to mood (Heatherton & Polivy, 1991).
Discussion The CFA supported the three-factor solution for the SSESR. Moreover, convincing levels of internal consistency as well as evidence of convergent and discriminant validity were found.
Study 3
Method Participants A total of 221 students (186 female; Mage = 21.39, SD = 3.70) participated in the first wave of this study in the beginning of a semester. Four months later, at the end of the semester, a smaller subset of the sample comprising 63 students (57 female, Mage = 21.08, SD = 4.00) participated. The demographic data of participants taking part in the second wave did not differ significantly from the total sample (all ps > .11). SPSS output is available in ESM 1; data of Study 3 are available in ESM 4.
We investigated the temporal stability of the SSES-R when measured at the beginning and the end of the academic semester. Building on the results from the original scale (rtt = .48 for Performance SSE, rtt = .64 for Appearance SSE, and rtt = .72 for Social SSE), which were descriptively but not significantly higher than those found in mood measures (.43 < rtt < .56), we expected state SE to show a moderate degree of stability over the course of the semester.
Procedure Participants completed paper-and-pencil measures of trait SE, state SE, and mood in a mass testing session in exchange for study participation credit. Sixteen weeks later,
European Journal of Psychological Assessment (2020), 36(1), 196–206
Ó 2018 Hogrefe Publishing
A. Rudolph et al., State Self-Esteem Scale
201
at the end of the academic semester, the sample answered an online questionnaire to increase response rate comprising the same measures as well as a self-report of their grades. Participants reported the grades from their written examinations at the end of the semester, with the majority of students having taken seven examinations during examination period.
and Polivy (1991), we expected that state SE would be sensitive to experimental manipulations used in psychological experiments. Therefore, we used the SSES-R to show that social feedback would affect the participants’ Social State SE (Study 4a) and that performance feedback would affect the participants’ Performance State SE (Study 4b).
Measures State SE was measured with the SSES-R, and trait SE was assessed with the MSES. Grade point average was calculated from self-reported grades, with lower scores indicating better academic performances in the German system. Grade point average was normally distributed (M = 2.31, SD = 0.62, z = 0.11, p = .07, d = 0.01).
Study 4a
Results Reliability As expected, over a 4-month interval, the SSES-R showed a moderate degree of stability, with Performance SSE being least stable (see Table 2). Sensitivity When analyzing the differential effects of self-reported academic performance on state SE, as expected, we found a significant correlation (one-tailed test with α = .05, tcritical (59) = 1.826, and level of baseline trait SE partialed out) between grade point average and Performance State SE, rpartial(59) = .25, p = .03, but not with any of the other SSES-R subscales or with mood, .10 < rpartial(59) < .10, all ps > .22. These results indicate that Performance State SE but not the other SE or mood measures reflected recent academic outcomes. Discussion Over a 4-month interval, results on retest stability were comparable to those reported by Heatherton and Polivy (1991). Furthermore, the SSES-R was differentially sensitive to naturally occurring events: Performance State SE was more reactive to academic performance feedback than the other subscales were.
Study 4 Study 3 provided initial evidence for the sensitivity of the SSES-R to variations in natural contexts. Study 4 followed up on interpersonal sensitivity by applying experimental designs and laboratory manipulations of performance feedback (success vs. failure) and social feedback (acceptance vs. rejection). In replicating and expanding Heatherton Ó 2018 Hogrefe Publishing
Method Participants Eighty-six students (75 female, Mage = 22.44, SD = 4.42) participated in exchange for study participation credit in the randomized parallel group experiment. SPSS output is available in ESM 1; data of Study 4a are available in ESM 5. Procedure The study was approved by the ethics committee at the University of Bamberg. Before participants were individually subjected to a manipulation procedure involving social feedback, they completed a measure of trait SE. Participants received personal questions and were informed that answers would be sent electronically to another participant in the laboratory next door. In fact, there was no other participant. The feedback condition was randomly assigned, and feedback was provided after 3 min. In the acceptance condition, participants were evaluated positively, and the putative participant was reported to have expressed the desire to work with the participant. In the rejection condition, participants were evaluated negatively, and cooperation was refused. After the experimental manipulation, participants completed a manipulation check item that was rated on a 10-point scale (i.e., “How much did the other participant want to work with you?”) and measures of state SE and mood. Finally, participants were fully debriefed and were clearly informed that the feedback was part of the experimental manipulation. Measures State SE was measured with the SSES-R, trait SE was assessed with the MSES, and mood was assessed with the PANAS. Results Manipulation Check Participants correctly identified the degree to which the putative participant wanted to work with them, t(84) = 42.33, p < .001, d = 9.17. Social Feedback Participants in the acceptance and rejection feedback conditions did not differ in their levels of trait SE. As expected, the results yielded significantly lower scores on Social State European Journal of Psychological Assessment (2020), 36(1), 196–206
202
A. Rudolph et al., State Self-Esteem Scale
Table 3. Differences in state and mood measures in experimental conditions (Studies 4a and 4b) Study 4a
Study 4a
Acceptance (N = 39)
Rejection (N = 47)
t-testa
M (SD)
M (SD)
t(84)
Performance State SE
4.09 (0.52)
3.89 (0.52)
Social State SE
3.34 (0.69)
3.03 (0.83)
Appearance State SE
3.51 (0.63)
Total State SE
t-testa
Success (N = 37)
Failure (N = 42)
d
M (SD)
M (SD)
t(77)
d
1.82*
0.39
4.14 (0.59)
3.78 (0.74)
2.38*
0.54
1.86*
0.41
3.59 (0.91)
3.44 (0.95)
0.69
0.16
3.44 (0.68)
0.45
0.10
3.58 (0.70)
3.28 (0.72)
1.89*
0.43
3.65 (0.47)
3.46 (0.55)
1.74*
0.38
3.77 (0.59)
3.50 (0.62)
1.97*
0.44
Positive Affect
2.23 (0.30)
2.20 (0.40)
0.39
0.08
3.03 (0.81)
2.78 (0.63)
1.53
0.35
Negative Affect
2.19 (0.44)
2.06 (0.34)
1.46
0.32
1.42 (0.73)
1.39 (0.42)
0.29
0.07
Scale
Notes. NStudy 3 = 86; NStudy 4 = 79. SE = Self-Esteem; d = Cohen’s d (small effect = 0.10, medium effect = 0.30, large effect = 0.50). One-tailed t-test with α = .05 and tcritical (84) = 1.663 and tcritical (77) = 1.665.*p < .05. a
SE after rejection feedback (medium effect; see Table 3). There was also a significant decrease in Performance State SE, which may have been due to an overlap between social and performance aspects in the feedback provided (i.e., the putative participant/confederate had refused to work with the participant, and evaluations referred to likability and intelligence). No differences between the conditions were found with respect to mood.
SE and mood. Finally, participants were fully debriefed and were clearly informed that the feedback was part of the experimental manipulation.
Study 4b
Results Manipulation Check Participants correctly identified the level at which the assumed trained assistant judged their performance, t(77) = 17.80, p < .001, d = 4.01.
Method Participants In exchange for study participation credit, 79 students (55 female; Mage = 23.73, SD = 4.24) participated in the randomized parallel group experiment. SPSS output is available in ESM 1; data of Study 4b are available in ESM 6.
Measures As in Study 4a, participants were assessed on state SE, trait SE, and mood with the SSES-R, MSES, and PANAS, respectively.
Procedure The procedure in Study 4b was similar to Study 4a except that participants were subjected to a manipulation procedure involving performance feedback instead of social feedback. Participants were asked to summarize a scientific text. They were individually told that a trained research assistant would rate their summaries on style, language, and completeness. However, both the scientific text and performance feedback were part of the randomly assigned experimental manipulation. Two versions of the text that differed in length, structure, and complexity made the respective feedback more plausible. Feedback was given after participants had waited for 5 min. In the positive condition, the feedback indicated that the summary was rated as being in the 90th percentile for all students. In the negative condition, the feedback indicated that the summary was in the 10th percentile for all students. After the experimental manipulation, the participants rated a manipulation check item on a 10-point scale (i.e., “How did you perform in comparison with other students?”) and measures of state
Performance Feedback No differences were found between the positive and negative feedback conditions for the trait SE subscales except for the Physical Appearance Trait SE, which was significantly lower for participants in the negative feedback condition, t(77) = 2.33, p = .02, d = 0.53. As expected, participants in the negative feedback condition had significantly lower scores on Performance State SE (large effect) and lower scores on Appearance State SE (medium effect). The latter difference, however, might be attributable to the significantly lower Physical Appearance Trait SE (which occurred despite randomization) instead of the manipulation. Study 2 showed that the respective subscales of the trait SE and state SE measures were highly related. Thus, with participants in the negative feedback condition showing significantly lower levels of Appearance Trait SE, it was not surprising to find a significant difference in Appearance State SE. This finding was underscored by a nonsignificant effect yielded by an ANCOVA with Appearance State SE as the dependent variable, feedback condition as a between-subjects factor, and Physical Appearance Trait SE as a covariate, F(1, 76) = 0.06, p = .82. No differences between the two conditions were found for mood.
European Journal of Psychological Assessment (2020), 36(1), 196–206
Ó 2018 Hogrefe Publishing
A. Rudolph et al., State Self-Esteem Scale
Discussion for Studies 4a and 4b Studies 4a and 4b demonstrated the sensitivity of the SSESR to an experimental manipulation such that negative social feedback decreased participants’ Social State SE, and negative performance feedback decreased participants’ Performance State SE.
Study 5: Interpersonal Sensitivity In Study 5, we aimed to comprehensively examine the interpersonal sensitivity of the SSES-R by comparing two samples of individuals who were expected to differ in their SE levels. To complement the previous studies that were mainly comprised of student samples, we recruited healthy individuals from a community sample and depressed individuals from a clinical sample. Patients with Major Depressive Disorder (MDD) typically report lower levels of SE than healthy individuals (Silverstone & Salsali, 2003). Depression is also associated with higher SE instability (i.e., the magnitude of fluctuations in contextually based SE; Kernis, 2003). Previous studies operationalized SE instability as the withinperson standard deviation of repeated measures of the SSES (e.g., Zeigler-Hill & Besser, 2013). Thus, we tested the usefulness of the SSES-R as a measure of state SE and SE instability and expected lower state SE and higher SE instability in the MDD sample compared with healthy controls.
203
a thorough explanation of the study, and then they provided their written informed consent. Trained research assistants recruited the participants, reviewed the inclusion and exclusion criteria, and administered the measures. Measures The 10-item Rosenberg Self-Esteem Scale (RSES; German version: Von Collani & Herzberg, 2003) was used to assess trait SE (1 = strongly disagree, 5 = strongly agree) with higher mean scores representing higher SE. The Beck Depression Inventory (BDI; German version: Hautzinger, Bailer, Worall, & Keller, 1995) was employed to assess the severity of depressive symptoms with 21 items. Participants were asked to select statements that described how they had been feeling during the past week. Responses were coded on a 4-point scale with higher sum scores representing greater depressiveness. The sum score of the Symptom Check List-90-Revised (SCL-90-R; German version: Franke, 2002) indicates general psychopathological impairments during the last week with higher scores indicating higher psychological symptoms and distress. State SE was measured with the SSES-R. SE instability was calculated as the within-person standard deviation across five repeated measures of the SSES-R taken every other day over the course of 10 days (cf. Leary & Tangney, 2003). Higher scores indicate greater instability.
Results Method Participants In this cross-sectional control group design study, thirty-six patients (28 female; Mage = 40.25, SD = 11.17) with a diagnosis of MDD according to the DSM-IV-TR (APA, 2000) and 36 nonclinical controls (29 female; Mage = 36.56, SD = 10.26) participated. Exclusion criteria for the patients were a history of psychotic disorder, current mania or hypomania, and current substance-induced disorder. Nonclinical participants were excluded when they met any criteria for a current or lifetime Axis I disorder. The two groups did not differ in age, t(70) = 1.46, p = .15, d = 0.33, gender, w2(1, 70) = 0.08, p = .77, φ = .11, or education, w2(1, 70) = 1.35, p = .51, φ = .02 (SPSS output is available in ESM 1). Procedure The study was approved by the ethics committee at the Charité – University Medicine Berlin as the data presented here were part of a cooperation between our institutions. All patients were in inpatient treatment at a psychiatric clinic. The nonclinical participants were recruited from the general population using newspaper advertisements, flyers, and personal contacts. Patients and participants were given Ó 2018 Hogrefe Publishing
Internal Consistency and Intercorrelations All SSES-R subscales displayed satisfactory to good levels of internal consistency across all measurement occasions (Performance State SE: .77 < α = .87; Social State SE: .85 < α = .92; Appearance State SE: .79 < α = .89; Total State SE: .91 < α = .93). As expected, in both groups, depressive symptoms and general psychopathological impairment were highly correlated with the total trait SE and state SE scores, indicating that higher SE scores were associated with lower symptom severity. The correlations for the nonclinical participants were rBDI- -TSE(35) = .57, p < .001; rBDI- -SSE(35) = .49, p = .003; rSCL-90-R- -TSE(35) = .37, p = .031; rSCL-90-R.46, p = .006. The correlations for the patients -SSE(35) = with MDD were rBDI- -TSE(29) = .79, p < .001; rBDI- -SSE(29) = .78, p < .001; rSCL-90-R- -TSE(18) = .32, p = .19; rSCL-90-R.32, p = .001. Furthermore, trait SE and state SE -SSE(18) = were highly correlated in both groups: rcontrol(35) = .62 and rMDD(36) = .79 (all ps < .001). Group Differences Table 4 presents descriptive statistics and the results of the t-tests that compared the nonclinical control participants with the MDD patients. As expected, the BDI and SCL-90-R scores were significantly higher in the clinical European Journal of Psychological Assessment (2020), 36(1), 196–206
204
A. Rudolph et al., State Self-Esteem Scale
Table 4. Differences in state, trait, and mood measures in nonclinical control participants and in patients with Major Depressive Disorder (Study 5) HC
MDD
t-testa
M (SD)
M (SD)
t(72)
d
Psychopathology and Trait Self-Esteem BDI
4.51 (4.90)
22.70 (11.05)
8.22***
19.54 (16.62)
95.89 (53.59)
5.90***
2.26
4.31 (0.53)
2.83 (0.88)
8.68***
2.05
Performance State SE
4.18 (0.53)
2.87 (0.75)
8.56***
2.02
Social State SE
3.53 (0.74)
2.42 (0.95)
5.49***
1.30
Appearance State SE
3.39 (0.62)
2.43 (0.83)
5.55***
1.31
Total State SE
3.70 (0.48)
2.57 (0.69)
8.02***
1.90
0.19 (0.09)
0.25 (0.11)
2.62*
0.60
SCL-90-R RSES
2.20
State Self-Esteem (SSES-R)b
State Self-Esteem Instability (SSES-R)c State SE Instability
Notes. N = 72. HC = Healthy Controls, MDD = Major Depressive Disorder, BDI = Beck Depression Inventory, SCL-90-R = Symptom Checklist-90-Revised, RSES = Rosenberg Self-Esteem Scale, SSES-R = Revised State Self-Esteem Scale, SE = Self-Esteem. d = Cohen’s d (small effect = 0.10, medium effect = 0.30, large effect = 0.50). aOne-tailed t-test with α = .05 and tcritical (72) = 1.993. bMean score of the single baseline measure of the SSES-R. cWithinperson standard deviation across five repeated measures of the SSES-R over the course of 10 days. *p < .05; ***p < .01.
In the present studies, the factor structure of the German version of the SSES (Heatherton & Polivy, 1991) was investigated using CFA. Paralleling past findings from EFAs in US and Chinese samples (Chau et al., 2012; Heatherton & Polivy, 1991), we found that the assumed three-factor structure was not confirmed. We thus revised the SSES and developed a 15-item version with a clear three-factor structure. We administered the revised scale to replicate its factor structure and investigate its psychometric properties, known-group differences, and sensitivity to experimental and naturally occurring feedback and SE fluctuations. Taken together, the results of our studies provide converging evidence that the revised SSES (SSES-R) is a reliable and valid measure of state SE. After the removal of five items with nonsignificant factor loadings or cross-loadings, we were able to devise a
measure of state self-esteem with a clear three-factor structure. Moreover, convincing levels of internal consistency were found. Evidence of convergent and discriminant validity was provided as the equivalent subscales (i.e., Performance, Social, and Appearance) of trait SE and state SE measures were more highly related than the nonmatching scales. By contrast, measures of state SE and mood were less highly correlated, indicating the discriminant validity of the SSES-R. Furthermore, as expected, the SSES-R’s stability across a 4-month semester was comparable to those reported by Heatherton & Polivy (1991). Replicating previous studies with student samples, the retest reliabilities for Performance State SE and Performance Trait SE were lower than for the other subscales (Heatherton & Polivy, 1991; Schütz & Sellin, 2006), indicating a certain amount of fluctuation. We also demonstrated the sensitivity of the SSES-R to both naturally occurring events (i.e., academic performance at the end of the semester) and experimental manipulation (i.e., rejection and failure feedback). Specifically, negative social feedback decreased the participants’ Social State SE, and negative performance feedback decreased the participants’ Performance State SE. Finally, all subscales of the SSES-R detected differences in state SE between patients with MDD and nonclinical participants (i.e., significantly lower state SE in MDD patients), thus indicating successful known-group validation. Most importantly, the SSES-R is a valid measure of SE instability as indicated by a higher level of SE instability in patients with MDD than in nonclinical participants (cf. Leary & Tangney, 2003). The major strength of the study is that analyses, study designs, and samples go far beyond the data presented by Heatherton and Polivy (1991). In detail, we applied experimental designs (e.g., differential social and performance
European Journal of Psychological Assessment (2020), 36(1), 196–206
Ó 2018 Hogrefe Publishing
sample than in the nonclinical sample. In line with our hypothesis, patients with MDD had significantly lower scores on measures of trait SE and state SE than nonclinical participants. As hypothesized, we found significantly higher levels of SE instability in the clinical sample in comparison with the community sample.
Discussion The SSES-R demonstrated interpersonal sensitivity in showing lower levels of state SE in individuals with MDD when compared with healthy individuals. Moreover, we demonstrated the validity of the SSES-R for assessing SE instability.
General Discussion
A. Rudolph et al., State Self-Esteem Scale
feedback manipulations), conducted sophisticated statistical analyses (e.g., CFA and instability indices), and collected data in community and clinical samples. As a limitation, however, the analysis of temporal stability relied on a small sample and a larger sample would increase the generalizability of the results. Moreover, the factor structure was examined only in healthy samples as our clinical sample was too small. Thus, a replication that tests the factor structure in a clinical sample would be helpful. Furthermore, it would be very interesting to compare different samples and conduct analyses testing the invariance of the factor structure, for example, comparing our data to the data collected by Heatherton and Polivy (1991) as well as Chau et al. (2012). We hope that the present paper is helpful in providing a revised measure of state self-esteem that is shorter and has a clearer factor structure than the original scale. The findings may stimulate further research on the factor structure of the SSES in other languages. For example, the factor structure of the original version (Heatherton & Polivy, 1991) could be investigated using CFA. Given that the SSES has been used in hundreds of studies, this is an important question. Still, it would be premature to question the literature using the original SSES based on problematic factor structure (cf. Hopwood & Donnellan, 2010), and in fact, or results using the revised scale were very similar to those found with the original scale. If the factor structure is problematic in other language versions, too, researchers might draw on the present results for scale revisions. If the present revised version and the three-factor solution hold in other languages, researchers may build on the evidence regarding validity in diverse samples that we reported here.
205
References
ESM 1. Data (.pdf) SPSS and AMOS output file Study 1, Study 2, Study 3, Study 4a and 4b. ESM 2. Data (.sav) Data Study 1. ESM 3. Data (.sav) Data Study 2. ESM 4. Data (.sav) Data Study 3. ESM 5. Data (.sav) Data Study 4a. ESM 6. Data (.sav) Data Study 4b.
APA. (2000). Diagnostic and statistical manual of mental disorders. Text revision (4th ed.). Washington, DC: Author. Bardel, M.-H., Fontayne, P., & Colombel, F. (2008). The “EESES”: A French adaptation of the Sport-State Self-Esteem Scale. International Journal of Sport Psychology, 39, 77–95. Baumeister, R. F. (1998). The self. In D. T. Gilbert, S. T. Fiske, & G. Lindzey (Eds.), The handbook of social psychology (4th ed., Vol. 1, pp. 680–740). New York, NY: McGraw-Hill. Chau, J. P. C., Thompson, D. R., Chang, A. M., & Woo, J. (2012). Psychometric properties of the Chinese version of State SelfEsteem Scale: An analysis of data from a cross-sectional survey of patients in the first four months after stroke. Journal of Clinical Nursing, 21, 3268–3275. https://doi.org/10.1111/ j.1365-2702.2011.03724.x Fleeson, W. (2001). Toward a structure- and process-integrated view of personality: Traits as density distribution of states. Journal of Personality and Social Psychology, 80, 1011–1027. Franke, G. H. (2002). SCL-90-R – Die Symptom-Checkliste von Derogatis, Deutsche Version [SCL-90-R symptom checklist by Derogatis, German version]. Zeitschrift für Klinische Psychologie und Psychotherapie, 32, 333–334. Fung, L. C. L., Lui, M. H. L., & Chau, J. P. C. (2006). Relationship between post-stroke depression and self-esteem of stroke patients in Hong Kong. Journal of Clinical Nursing, 15, 505–506. Hautzinger, M., Bailer, M., Worall, H., & Keller, F. (1995). BeckDepressions-Inventar (BDI). Testhandbuch [Beck Depression Inventory manual]. Bern, Switzerland: Huber. Heatherton, T. F., & Polivy, J. (1991). Development and validation of a scale for measuring state self-esteem. Journal of Personality and Social Psychology, 60, 895–910. Hopwood, C. J., & Donnellan, M. B. (2010). How should the internal structure of personality inventories be evaluated? Personality and Social Psychological Review, 14, 332–346. https://doi.org/10.1177/1088868310361240 Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55. Janis, I. L., & Field, P. B. (1959). Sex differences and factors related to persuability. In C. I. Hovland & I. L. Janis (Eds.), Personality and persuability (pp. 55–68). New Haven, CT: Yale University Press. Jansen, D. L., Rijken, M., Heijmans, M., & Boeschoten, E. W. (2010). Perceived autonomy and self-esteem in Dutch dialysis patients: The importance of illness and treatment perceptions. Psychology & Health, 25, 733–749. https://doi.org/10.1080/ 08870440902853215 Kavanagh, P. S., Robins, S. C., & Ellis, B. J. (2010). The mating sociometer: A regulatory mechanism for mating aspirations. Journal of Personality and Social Psychology, 99, 120–132. https://doi.org/10.1037/a0018188 Kernis, M. H. (2003). Toward a conceptualization of optimal selfesteem. Psychological Inquiry, 14, 1–26. Krohne, H. W., Egloff, B., Kohlmann, C.-W., & Tausch, A. (1996). Untersuchungen mit einer deutschen Version der “Positive and Negative Affect Schedule” (PANAS) [Investigations with a German version of the Positive and Negative Affect Schedule (PANAS)]. Diagnostica, 42, 139–156. Leary, M. R., & Tangney, J. P. (2003). Handbook of self and identity. New York, NY: Guilford Press. McFarland, C., & Ross, M. (1982). Impact of causal attributions on affective reactions to success and failure. Journal of Personality and Social Psychology, 43, 937–946.
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 196–206
Electronic Supplementary Materials The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000501
206
Mischel, W. (2004). Toward an integrative science of the person. Annual Review of Psychology, 55, 1–22. https://doi.org/ 10.1146/annurev.psych.55.042902.130709 Schütz, A. (2000). Deutsche Version der Heatherton-Polivy Zustandsselbstwertskala [German version of the Heatherton Polivy State Self-Esteem Scale]. Unpublished Manuscript, Chemnitz University of Technology. Schütz, A., & Sellin, I. (2006). Die multidimensionale Selbstwertskala (MSWS) [The Multidimensional Self-Esteem Scale]. Göttingen, Germany: Hogrefe. Shavelson, R. J., Hubner, J. J., & Stanton, G. C. (1976). Selfconcept: Validation of construct interpretations. Review of Educational Research, 46, 407–441. Silverstone, P. H., & Salsali, M. (2003). Low self-esteem and psychiatric patients: Part I – the relationship between low selfesteem and psychiatric diagnosis. Annals of General Hospital Psychiatry, 2, 2. Von Collani, G., & Herzberg, P. Y. (2003). Eine revidierte Fassung der deutschsprachigen Skala zum Selbstwertgefühl von Rosenberg [A revised Version of the German Adaptation of Rosenberg’s Self-Esteem Scale]. Zeitschrift für Differentielle und Diagnostische Psychologie, 24, 3–7. Zeigler-Hill, V., & Besser, A. (2013). A glimpse behind the mask: Facets of narcissism and feelings of self-worth. Journal of Personality Assessment, 95, 249–260. https://doi.org/10.1080/ 00223891.2012.717150
European Journal of Psychological Assessment (2020), 36(1), 196–206
A. Rudolph et al., State Self-Esteem Scale
History Received September 15, 2016 Revision received May 31, 2018 Accepted July 5, 2018 Published online December 19, 2018 EJPA Section/Category Personality Acknowledgments We thank two anonymous reviewers for providing helpful comments on earlier drafts of the manuscript. Funding This research was supported by the German Research Foundation (DFG), grant SCHU 1459/2. Almut Rudolph Department of Clinical Psychology and Psychotherapy University of Leipzig Neumarkt 9-19 04081 Leipzig Germany almut.rudolph@uni-leipzig.de
Ó 2018 Hogrefe Publishing
Brief Report
Perfectionism in Italy and the USA Measurement Invariance and Implications for Cross-Cultural Assessment Sean P. M. Rice,1 Yura Loscalzo,2 Marco Giannini,2 and Kenneth G. Rice3 1
Department of Psychology, Washington State University, Vancouver, WA, USA
2
Department of Health Sciences, School of Psychology, University of Florence, Firenze, Italy
3
Center for the Study of Stress, Trauma, and Resilience, Department of Counseling and Psychological Services, Georgia State University, Atlanta, GA, USA
Abstract: Perfectionism research has been recently extending its scope internationally. The short forms of the Almost Perfect Scale-Revised (APS-R; Slaney, Rice, Mobley, Trippi, & Ashby, 2001; Rice, Richardson, & Tueller, 2014) and the Multidimensional Perfectionism Scale (MPS; Cox, Enns, & Clara, 2002; Hewitt & Flett, 1990), originally validated with North American samples, have been translated for use on Italian samples. However, these tests have yet to be evaluated for measurement equivalence between the respective countries. Both scales were administered to undergraduate students in the USA (N = 336) and Italy (N = 201). Multiple group confirmatory factor analyses supported partial scalar invariance for both scales, indicating functional equivalence across cultures. Italian students reported lower levels of perfectionistic strivings. No meaningful differences in perfectionistic concerns were found between countries. Further study is needed to assess why some items and factors may differ between Italians and Americans. Keywords: perfectionism, cross-cultural, measurement invariance
Although predominately studied in English-speaking countries, perfectionism research has more recently begun to extend its reach across countries, languages, and cultures. However, psychometric evaluation should occur when using a perfectionism scale in one country that was developed in a different country or language. Specifically, the equivalence of item-factor loadings and item intercepts between groups is often used to support the adequacy of measures used in cross-cultural comparisons (Chen, 2008). The present study focuses on analyzing the consistency of the structure of perfectionism between Italian and American college students. To date, no self-report scale of perfectionism has been evaluated for measurement equivalence between Italians and Americans. In order to assess meaningful differences in a construct between cultures, we must first confirm that the scale itself functions in the same way among each group (Milfont & Fischer, 2010), because people who speak different languages or are part of different cultures may interpret self-report items in idiosyncratic ways (Chen, 2008). The present study is the first to report on the psychometric equivalence of two measures of perfectionism, the Short Multidimensional Perfectionism Scale (SMPS; Cox, Enns, & Clara, 2002) and the Short Almost Perfect Scale (SAPS; Rice, Richardson, & Tueller, 2014). The scales provide indicators
of two main perfectionism constructs: perfectionistic strivings, as represented by the Standards factor within the SAPS and the Self-Oriented Perfectionism factor within the SMPS, and perfectionistic concerns, as represented by the Discrepancy factor within the SAPS and the SociallyPrescribed Perfectionism factor within the SMPS (Stoeber, 2017). Other-Oriented Perfectionism, as measured by the SMPS, is an idiosyncratic sub-construct of perfectionism, in that it represents neither perfectionistic strivings nor concerns, but rather is associated with narcissistic personality (Stoeber, 2017). These scales were selected for evaluation due to application of their measured constructs in other cultures (e.g., Arana, Rice, & Ashby, 2017; Ghisi, Chiri, Marchetti, Sanavio, & Sica, 2010), as well as their consistency in factor measurement (Stoeber, 2017). With no reason to expect otherwise, we hypothesized that both scales would demonstrate strong measurement invariance between individuals from the USA and Italy. We also explored between-country factor mean differences for each perfectionism subscale. Both the USA and Italy are western, industrialized countries, with comparable levels of conscientiousness (McCrae & Terracciano, 2005). Because perfectionistic strivings might reflect conscientiousness (Stoeber, 2017), we had no reason to expect factor level differences on strivings. McCrae and Terracciano (2005) found that,
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 207–211 https://doi.org/10.1027/1015-5759/a000476
208
compared with the USA, Italian college students had somewhat higher levels of Neuroticism, which has been strongly correlated with perfectionistic concerns (Rice et al., 2014). Thus, if factor differences in perfectionistic concerns were found, we anticipated a small effect size consistent with slightly higher levels for Italians than Americans.
Method Participants and Procedure The American sample included 336 undergraduate psychology students (66% women) at a university in the southeast USA. Ages ranged from 18–64 years (M = 24.0, SD = 6.6). They were part of an undergraduate research participation pool and received research credit for study participation. The Italian sample included 201 undergraduate students (73% women) from central Italian universities. Ages ranged from 18–58 years (M = 24.6, SD = 5.3). Italian students participated without additional incentive. Both samples represented a diverse set of majors. Surveys were administered online through web-based survey tools. Data collection was approved by Institutional Review Boards at each respective school.
Measures SMPS The SMPS (Cox et al., 2002) consists of 15 items from the original 45-item Multidimensional Perfectionism Scale (Hewitt & Flett, 1990). It measures three perfectionism factors: Self-Oriented Perfectionism (SOP; an individual’s own perfectionistic tendencies), Other-Oriented Perfectionism (OOP; perfectionistic expectations of others), and Socially-Prescribed Perfectionism (SPP; perception that others expect perfection). Participants respond on a 7-point scale (ranging from “agree” to “disagree”). The Italian translation of the MPS (Sica, 2004) has shown adequate reliability and validity (Ghisi et al., 2010). Although the Italian translated SMPS has not yet been formally evaluated, it follows that it would likely also possess adequate psychometrics because it uses the same, albeit fewer, items from the original MPS and the same factor structure has previously been supported (Cox et al., 2002). In the USA sample, Raykov’s reliability coefficient ρ was .82 for SOP, .76 for OOP, and .79 for SPP, and in the Italian sample, ρ was .85 for SOP, .71 for OOP, and .76 for SPP. SAPS The SAPS (Rice et al., 2014) is an 8-item scale of two perfectionism factors: Standards (high-performance goals) and Discrepancy (perceived gap between performance European Journal of Psychological Assessment (2020), 36(1), 207–211
Sean P. M. Rice et al., Perfectionism in Italy and the USA
and expectations). Participants respond using a 7-point scale, ranging from “strongly agree” to “strongly disagree.” The SAPS is a subset of items from the Almost Perfect Scale-Revised (APS-R; Slaney, Rice, Mobley, Trippi, & Ashby, 2001) and has shown similar psychometric properties to its parent measure (Rice et al., 2014). In the current USA sample, ρ was .91 for Standards and .84 for Discrepancy. In the Italian sample, ρ was .72 for Standards and .84 for Discrepancy. For both samples in the current study, positive correlations were observed between indicators of the respective striving and concerns factors (see Table 5 in Electronic Supplementary Material, ESM 1).
Statistical Analyses Analyses were conducted using Mplus version 7.4 (Muthén & Muthén, 1998–2015). Confirmatory factor analyses (CFAs) were first conducted on each scale within each sample separately in order to evaluate model fit. Based on recommendations by Hu and Bentler (1999), at least two fit indices were evaluated for acceptability. Measurement invariance was then assessed using multiple group CFAs, with maximum likelihood robust estimation (Brown, 2015). In separate steps, both scales were assessed for parameter pattern similarity (configural invariance), equality of factor loadings (metric invariance), and equality of intercepts (scalar invariance). Although change in the Comparative Fit Index (CFI) is often used as a criterion for invariance testing, it may be less effective under some conditions (Kang, McNeish, & Hancock, 2016). Therefore, McDonald’s Noncentrality Index (NCI; McDonald & Marsh, 1990) was used to evaluate invariance, in conjunction with CFI and the standardized root mean square residual (SRMR). Invariance was supported if at least two of the reported fit indices met an acceptable cutoff: .01 for NCI (Kang et al., 2016), .002 for CFI (Meade, Johnson, & Braddy, 2008), and between .01 and .03 for SRMR (Cheung & Rensvold, 2002). In instances of failed invariance, partial invariance models can be explored to locate sources of non-invariance. If partial scalar invariance was supported, latent factor differences between the USA and Italian samples were also evaluated. The effects coding approach, which uses the “optimally weighted average” of indicator means to estimate factor means, was used to evaluate mean differences (Little, Slegers, & Card, 2006, p. 63; see ESM 2). This method reduces the potential bias in arbitrarily assigning a marker indicator or assigning a fixed factor variance (Little et al., 2006).
Results Within the USA sample, the SAPS demonstrated good fit, but the SMPS did not (e.g., CFI = .882; SRMR = .085). Ó 2018 Hogrefe Publishing
Sean P. M. Rice et al., Perfectionism in Italy and the USA
209
Table 1. Invariance analyses and Model Fit Indices of the SMPS and SAPS w2
df
NCI
CFI
RMSEA (90% CI)
SRMR
Configural invariance
362.05
172
.838
.900
.064 [.055, .073]
.083
Italy
155.88
87
.938
.904
.063 [.047, .078]
.084
USAb
204.11
85
.900
.900
.065 [.053, .076]
.083
Model
Δw2a
Δdf
p
ΔNCI
ΔCFI
ΔSRMR
SMPS
Metric invariance
382.51
184
.831
.898
.063 [.054, .072]
.087
20.33
12
.061
.007
.002
.004
Scalar invariance
609.36
196
.681
.787
.089 [.081, .097]
.105
295.54
12
<.001
.150
.111
.018
Partial scalar invariancec
401.15
190
.822
.891
.064 [.056, .073]
.087
19.78
6
.003
.009
.007
.000
101.21
36
.941
.951
.082 [.063, .101]
.062
.027
SAPS Configural invariance Italy
d
USA
58.95
17
.962
.899
.111 [.081, .142]
.076
44.60
19
.976
.970
.063 [.039, .088]
.052
Metric invariance
129.57
42
.922
.934
.088 [.071, .106]
.089
28.45
6
<.001
.019
.017
Partial metric invariancee
118.05
41
.931
.942
.084 [.066, .102]
.082
16.81
5
.005
.010
.009
.020
Scalar invariance
161.79
47
.899
.914
.095 [.080, .112]
.092
48.89
6
<.001
.032
.028
.010
Partial scalar invariancef
132.63
45
.922
.934
.085 [.069, .102]
.087
14.95
4
.005
.009
.008
.005
Notes. SMPS = Short Multidimensional Perfectionism Scale; SAPS = Short Almost Perfect Scale. All w values were significant, p < .001. Δw values were calculated with the Yuan-Bentler scaling correction for MLR estimation. bTwo residuals correlated. cSix freed intercepts. dTwo residuals correlated. eOne freed loading. fTwo freed intercepts. 2
Allowing two SMPS item residuals to be correlated produced adequate fit. Conversely, within the Italian sample, we found adequate fit for the SMPS, but not for the SAPS (e.g., CFI = .861; SRMR = .089). Again, allowing two correlated residuals produced a sufficient model to examine further invariance testing. Correlated residuals were included within their respective sample in subsequent invariance tests (see Table 1 for revised fit statistics of each model). Adequate fit for the configural (unconstrained) model was found for the SAPS and the SMPS (see Table 1 in ESM 1 for item characteristics). Results supported metric invariance for the SMPS (e.g., ΔNCI = .007; ΔSRMR = .004), but not scalar invariance (ΔNCI = .150; ΔSRMR = .018). Modification indices indicated that 6 of the 15 item intercepts should be freely estimated. Consecutively freeing one Self-Oriented Perfectionism (SOP) item, one Socially-Prescribed Perfectionism (SPP) item, and four Other-Oriented Perfectionism (OOP) items, individually, resulted in partial scalar invariance for the SMPS (ΔNCI = .009; ΔSRMR = .000). However, the test failed according to the CFI (Δ = .007). For the SAPS, partial metric invariance (e.g., ΔNCI = .010; ΔSRMR = .020) was achieved by freeing one factor loading within the Discrepancy factor (Note: ΔCFI = .009). Partial scalar invariance for the SAPS (e.g., ΔNCI = .009; ΔSRMR = .005) was achieved by consecutively freeing two Discrepancy item intercepts (Note: ΔCFI = .008) (see ESM 1: Table 2 for fit statistics of each consecutive model and Table 3 for descriptions of non-invariant items. See ESM 2 for invariance outputs). Ó 2018 Hogrefe Publishing
a
2
Italians had significantly lower levels of Standards, Cohen’s d = .27, 95% CI [.07, .46], and SOP, d = .29, 95% CI [.07, .50], than Americans on average. There were no meaningful differences in Discrepancy, d = .16, 95% CI [ .02, .48], or SPP, d = .05, 95% CI [ .15, .27], between groups (see Table 4 in ESM 1). The OOP factor means were not compared due to a majority of the indicator intercepts being freely estimated, therefore considered non-invariant (see Vandenberg & Lance, 2000, p. 38).
Discussion To our knowledge, psychometric properties of the Short Almost Perfect Scale (SAPS) and the Short Multidimensional Perfectionism Scale (SMPS) have not before been evaluated for equivalence between the USA and Italy. We aimed to report novel and practical information on the use of two major perfectionism scales across cultures, in-line with recent efforts to improve analytical rigor in cross-cultural assessment (e.g., Arana et al., 2017). We found preliminary support for cultural comparisons of perfectionism using either the SAPS or the SMPS. However, Other-Oriented Perfectionism was not invariant, and therefore, ambiguity remains with its interpretation in Italian samples. Results provide support for partial metric and scalar invariance for both scales. As such, the scales may represent relatively functional equivalent measurements of perfectionism for American and Italian samples. Further study is needed to assess the possible translational and European Journal of Psychological Assessment (2020), 36(1), 207–211
210
conceptual differences in the non-invariant items before more thorough recommendations for measuring OtherOriented Perfectionism can be made. Because partial scalar invariance was established, we were able to compare factor means. American students had higher levels of perfectionistic strivings (Standards and SelfOriented Perfectionism) than Italian students, indicating that they may be more inclined to set higher expectations for themselves and be driven by personal goals. However, effect sizes for these differences were small. Finally, no meaningful group differences in Discrepancy or Socially-Prescribed Perfectionism indicate that the samples had similar levels of perfectionistic concerns (in opposition to our hypotheses). Although both countries could be considered individualistic, Americans have presented higher levels of individualism (Hofstede, 2001). Perhaps these differences in individualism help explain the disparity found regarding strivings. This is obviously speculative because it is unknown how these constructs specifically relate. Nevertheless, individualism is described as promoting one’s goals and desires above others’ (Wagner & Moch, 1986), which seems to have conceptual overlap with strivings. Some limitations should be considered. It is recommended that samples be larger than 200 for invariance testing at the low end, and our Italian sample was only just accommodating. However, because the scales were tested separately in each sample, they present a relatively simple factor structure. As such, the sample sizes in the present study were acceptable. We also allowed correlated residuals within the models. Modification indices and item phrasing were evaluated to assess potential nonrandom relationships. Due to face similarity in certain items, we allowed their residuals to correlate (Brown, 2015), although some strongly advise against such procedures (Hermida, 2015). We aimed to replicate the validated factor structure of the scales, and our results suggest that a lower-order model provides adequate fit. However, it should be noted that this may not be the best representation of the construct. Recent research has also provided evidence of a bifactor structure to perfectionism (see Gäde, Schermelleh-Engel, & Klein, 2017), which may be a more accurate representation of perfectionism than the lower-order model presented. Furthermore, a bifactor model may inform to what extent items relate to their respective perfectionism factors (i.e., strivings, concerns, and a general factor). Using such a model, we may be able to directly compare scales such as the SMPS and SAPS in their abilities to represent perfectionistic personality. Finally, there is debate over which fit indices and cutoffs should be used in invariance testing (e.g., Kang et al., 2016; Meade et al., 2008). Many of our models met acceptable cutoffs for at least two indices, but may have not met acceptable cutoffs for others. As such, model fit results in
European Journal of Psychological Assessment (2020), 36(1), 207–211
Sean P. M. Rice et al., Perfectionism in Italy and the USA
future studies with different samples are likely to be of value, in addition to any substantive findings from future cross-cultural comparisons involving perfectionism. Future research should evaluate differences in potentially adaptive and maladaptive implications of perfectionism by examining associations with outcomes such as depression (e.g., Arana et al., 2017) and academic performance (e.g., Rice et al., 2014). Associations between other personality constructs (e.g., conscientiousness; individualism) and perfectionism should also be assessed, to inform theoretical consistency of these constructs across countries. Finally, both lower- and higher-order factor structures, such as bifactor models, should be assessed to evaluate the crosscultural extent of such models. Electronic Supplementary Materials The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000476 ESM 1. Tables (.pdf) Tables describing item characteristics (Table 1), consecutive partial invariance models (Table 2), a review of adjusted items (Table 3), factor mean comparisons (Table 4), and factor correlations (Table 5). ESM 2. Data (.pdf) Original MPlus output for invariance analyses.
References Arana, F. G., Rice, K. G., & Ashby, J. S. (2017). Perfectionism in Argentina and the United States: Measurement structure, invariance, and implications for depression. Journal of Personality Assessment, 100, 219–230. https://doi.org/10.1080/ 00223891.2017.1296845 Brown, T. A. (2015). Confirmatory factor analysis for applied research (2nd ed.). New York, NY: Guilford Press. Chen, F. F. (2008). What happens if we compare chopsticks with forks? The impact of making inappropriate comparisons in cross-cultural research. Journal of Personality and Social Psychology, 95, 1005–1018. Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-offit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233–255. Cox, B. J., Enns, M. W., & Clara, I. P. (2002). The multidimensional structure of perfectionism in clinically distressed and college student samples. Psychological Assessment, 14, 365–373. Gäde, J. C., Schermelleh-Engel, K., & Klein, A. G. (2017). Disentangling the common variance of perfectionistic strivings and perfectionistic concerns: A bifactor model of perfectionism. Frontiers in Psychology, 8, 1–13. https://doi.org/10.3389/ fpsyg.2017.00160 Ghisi, M., Chiri, L. R., Marchetti, I., Sanavio, E., & Sica, C. (2010). In search of specificity: “Not just right experiences” and obsessive-compulsive symptoms in non-clinical and clinical Italian individuals. Journal of Anxiety Disorders, 24, 879–886.
Ó 2018 Hogrefe Publishing
Sean P. M. Rice et al., Perfectionism in Italy and the USA
Hermida, R. (2015). The problem of allowing correlated errors in structural equation modeling: Concerns and considerations. Computational Methods in Social Sciences, 3, 5–17. Hewitt, P. L., & Flett, G. L. (1990). Perfectionism and depression: A multidimensional analysis. Journal of Social Behavior and Personality, 5, 423–438. Hofstede, G. (2001). Culture’s consequences: Comparing values, behaviors, institutions, and organizations across nations. Thousand Oaks, CA: Sage. Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55. Kang, Y., McNeish, D. M., & Hancock, G. R. (2016). The role of measurement quality on practical guidelines for assessing measurement and structural invariance. Educational and Psychological Measurement, 76, 533–561. Little, T. D., Slegers, D. W., & Card, N. A. (2006). A non-arbitrary method of identifying and scaling latent variables in SEM and MACS models. Structural Equation Modeling, 13, 59–72. McCrae, R. R., & Terracciano, A. (2005). Universal features of personality traits from the observer’s perspective: Data from 50 cultures. Journal of Personality and Social Psychology, 88, 547–561. McDonald, R. P., & Marsh, H. W. (1990). Choosing a multivariate model: Noncentrality and goodness of fit. Psychological Bulletin, 107, 247–255. Meade, A. W., Johnson, E. C., & Braddy, P. W. (2008). Power and sensitivity of alternative fit indices in tests of measurement invariance. Journal of Applied Psychology, 93, 568–592. Milfont, T. L., & Fischer, R. (2010). Testing measurement invariance across groups: Applications in cross-cultural research. International Journal of Psychological Research, 3, 111–121. Muthén, L. K., & Muthén, B. O. (1998–2015). Mplus user’s guide (7th ed.). Los Angeles, CA: Muthén & Muthén.
Ó 2018 Hogrefe Publishing
211
Rice, K. G., Richardson, C. M., & Tueller, S. (2014). The short form of the revised Almost Perfect Scale. Journal of Personality Assessment, 96, 368–379. Sica, C. (2004). The Italian version of questionnaires for the OCCWG cross-cultural project. Description and psychometric properties Unpublished manuscript, University of Florence, Firenze, Italy. Slaney, R. B., Rice, K. G., Mobley, M., Trippi, J., & Ashby, J. S. (2001). The revised Almost Perfect Scale. Measurement and Evaluation in Counseling and Development, 34, 130–145. Stoeber, J. (2017). The psychology of perfectionism: Theory, research, applications. London, UK: Routledge. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4–70. Wagner, J. A. III, & Moch, M. K. (1986). Individualism-collectivism: Concept and measure. Group and Organization Studies, 11, 280–303. Received August 5, 2017 Revision received December 8, 2017 Accepted December 10, 2017 Published online June 15, 2018 Sean P. M. Rice Department of Psychology Washington State University, Vancouver 14204 NE Salmon Creek Ave Vancouver, WA, 98686 USA sean.rice@wsu.edu
European Journal of Psychological Assessment (2020), 36(1), 207–211
Brief Report
Reexamining the Factorial Validity of the 16-Item Scale Measuring Need for Cognition Ying Zhang,1 Eric Klopp,2 Heike Dietrich,3 Roland Brünken,2 Ulrike-Marie Krause,4 Birgit Spinath,3 Robin Stark,2 and Frank M. Spinath1 1
Department of Psychology, Saarland University, Saarbrücken, Germany
2
Department of Educational Research, Saarland University, Saarbrücken, Germany
3
Department of Psychology, Heidelberg University, Germany
4
Department of Education, University of Oldenburg, Germany
Abstract: A growing body of studies has emphasized the need to consider method effects due to positively and negatively worded items for a better understanding of the factorial structure of psychological constructs. In particular, several researchers identified such method factors besides the content factor for various scales measuring Need for Cognition (NFC). However, regarding the factorial validity of the 16-item NFC scale developed by Bless, Wänke, Bohner, Fellhauer, and Schwartz (1994), only a one-factor structure without the inclusion of possible method factors has been examined so far. Therefore, we considered such method factors in a broader reexamination of the factorial validity of this measure by investigating a range of structural models in two samples (n = 830, n = 500). We found that a one-factor solution as proposed by Bertrams and Dickhäuser (2010) and Bless et al. (1994) did not fit the data, whereas the inclusion of method factors improved the model fit significantly. According to our results, the model including both the content factor and two uncorrelated method factors yielded the best model fit. In sum, our results provide an extended view of the factorial validity of the 16-item scale of NFC. Keywords: need for cognition, measurement, factorial validity, item polarity
Need for Cognition (NFC) as a trait captures individual differences in the intrinsic tendency to engage in and enjoy cognitively challenging activities (Cacioppo & Petty, 1982). In particular, individuals scoring higher in NFC are more intrinsically motivated to engage in and enjoy effortful cognitive challenges (Cacioppo & Petty, 1982; Cacioppo, Petty, & Kao, 1984) since they process information more actively, conscientiously, and analytically (Cacioppo, Petty, Feinstein, & Jarvis, 1996). In contrast, individuals scoring lower in NFC spend less effort to process information since they prefer superficial and heuristic cognitive processing (Cacioppo et al., 1996). As a construct positively related to fluid intelligence (Fleischhauer et al., 2010), NFC has become a widely investigated construct in a range of research fields (see Cacioppo et al., 1996, for an overview). Accordingly, NFC scales providing adequate factorial validity are an important prerequisite for solid conclusions from empirical investigations of this construct.
European Journal of Psychological Assessment (2020), 36(1), 212–215 https://doi.org/10.1027/1015-5759/a000484
Factorial Validity of NFC Measures Cacioppo and Petty (1982) and Cacioppo et al. (1984) originally demonstrated the unidimensionality of the first established NFC scales comprising 34 items (hereafter: NFC34) and 18 items (hereafter: NFC18), respectively. Both scales have been translated into many languages (Forsterlee & Ho, 1999). Particularly for German-speaking samples, Bless, Wänke, Bohner, Fellhauer, and Schwartz (1994) developed the German adaptions including 33 (hereafter: NFC33) and 16 items (hereafter: NFC16), respectively. Consistent with the results of the original English language scales from Cacioppo and Petty (1982) and Cacioppo et al. (1984), Bertrams and Dickhäuser (2010) confirmed the unidimensionality of the NFC33 and the NFC16. Against the background that researchers are becoming aware of method effects due to item wording in the investigation of factorial validity of psychological scales (e.g., Rauch, Schweizer, & Moosbrugger, 2007; Tomás & Oliver,
Ó 2018 Hogrefe Publishing
Y. Zhang et al., A 16-Item Scale of Need for Cognition: Factorial Validity
1999), selected studies have extended previous investigations by demonstrating the need to consider method factors for a better understanding of the factorial validity of NFC scales. For example, Forsterlee and Ho (1999) demonstrated method effects due to negatively worded items of the NFC18. Moreover, Bors, Vigneau, and Lalande (2006) identified two uncorrelated method factors reflecting negatively and positively worded items besides the content factor of the NFC18, whereas Preckel (2014) demonstrated the same trait-method model for a German NFC scale for young adolescents. In general, scales including positively and negatively worded items often provide reduced psychometric quality (Schriesheim & Hill, 1981). However, the inclusion of often-neglected method factors serves to improve the factorial and internal validity of such scales (Quilty, Oakman, & Risko, 2006). Nonetheless, in most of the studies related to NFC, researchers ignored method factors of NFC measures by calculating sum scores or applying unidimensional modeling although such calculations are susceptible to imprecise measurement and therefore potentially misleading conclusions (Quilty et al., 2006). As an example, Bors et al. (2006) demonstrated that the relation of NFC to verbal intelligence was overestimated by Cacioppo et al. (1996) since the method factor referring to negatively worded items of the NFC18 correlated to the same extent with vocabulary as the content factor. Accordingly, the relations of NFC to other constructs may be underestimated if possible method factors showed correlations with these constructs in the opposite direction as the content factor.
The Current Study To our knowledge, no study has investigated the factorial validity considering method factors of the NFC16, a widely established short scale for German-speaking adults. Therefore, we conducted the current study to provide an extended reexamination of the factorial validity of the NFC16 drawing on the studies by Bless et al. (1994) and Bertrams and Dickhäuser (2010). Based on the findings by Bors et al. (2006) and Preckel (2014), we expected that structural models including method factors provide better model fits than the trait model (Hypothesis 1). Moreover, based on the findings by Fleischhauer et al. (2010), we expected a moderate, positive correlation between the NFC16 and fluid intelligence (Hypothesis 2).
Method
213
M = 21.87 years, SD = 3.67 years). Second, we repeated the investigation in an additional sample comprising 500 participants (73% females; M = 30.10 years; SD = 11.83 years; educational level ranges from “without qualification” to “doctoral degree”). Third, we investigated the relation of the NFC16 to fluid intelligence in a subsample comprising 746 university students (75% females; M = 21.77 years; SD = 3.65 years; 92% native German speakers).
Measures Need for Cognition Need for Cognition (NFC) was assessed with the NFC16 (Bless et al., 1994) on a 5-point Likert scale. Fluid Intelligence Fluid intelligence was assessed with three subscales comprising analogies, numerical series, and matrices of the Intelligence Structure Test 2000 R (I-S-T 2000 R; Amthauer, Brocke, Liepmann, & Beauducel, 2001).
Data Analysis We conducted confirmatory factor analyses employing maximum likelihood estimation to investigate the factor structure. Based on the previous studies (Bertrams & Dickhäuser, 2010; Bors et al., 2006; Forsterlee & Ho, 1999; Preckel, 2014), we selected the following five models: (1) a trait model including solely the content factor; (2) a method model including two correlated method factors reflecting negatively and positively worded items, respectively, but no content factor; (3) trait-method model A including the content factor and one method factor reflecting positively worded items; (4) trait-method model B including the content factor and one method factor reflecting negatively worded items; (5) trait-method model C including the content factor and two uncorrelated method factors. We evaluated model fit by the ratio of the chisquare (w2) and the degrees of freedom (df), the root mean square error of approximation (RMSEA), the comparative fit index (CFI), and the Akaike information criterion (AIC).
Results and Discussion Results
Participants First, we investigated the factorial validity of the NFC16 in a sample comprising 830 university students (76% females;
The input files and the output files of the statistical analyses have been supplied as Electronic Supplementary Material (ESM 1). As shown in Table 1, the trait model showed an
Ó 2018 Hogrefe Publishing
European Journal of Psychological Assessment (2020), 36(1), 212–215
214
Y. Zhang et al., A 16-Item Scale of Need for Cognition: Factorial Validity
Table 1. Fit indices for the investigated models Sample 1 (n = 830)
Sample 2 (n = 500)
w2 (df)
w2/df
RMSEA
CFI
AIC
w2 (df)
w2/df
RMSEA
CFI
AIC
1. Trait model
741.12 (104)*
7.13
.09
.80
805.12
429.87 (104)*
4.13
.08
.84
493.87
2. Method model
598.39 (103)*
5.81
.08
.84
664.39
333.39 (103)*
3.24
.07
.89
399.39
3. Trait-method model A
550.22 (98)*
5.62
.08
.86
626.22
308.95 (98)*
3.15
.07
.90
384.95
4. Trait-method model B
450.43 (94)*
4.79
.07
.89
534.43
306.25 (94)*
3.26
.07
.90
390.25
5. Trait-method model C
246.57 (88)*
2.80
.05
.95
342.57
263.03 (88)*
3.00
.06
.92
359.03
Configural
440.97 (177)*
2.49
.03
.95
630.97
–
Metric
482.72 (209)*
2.31
.03
.95
608.72
41.74 (32)
Scalar
525.74 (225)*
2.34
.03
.94
619.74
84.77 (48)*
Model 5 (N = 1,330)
Note. *p < .005.
This study was the first one to reexamine the factorial validity under consideration of method factors of the NFC16. Thereby, we extended previous studies in several ways. First, according to our results, the trait model as proposed by Bertrams and Dickhäuser (2010) and Bless et al. (1994) did not fit the data, whereas the inclusion of method factors improved the factorial validity significantly. Accordingly, the unidimensional conceptualization of NFC was supported only when the method factors were specified. Second, we extended the study by Bors et al. (2006) by specifying three trait-method models to provide a more differentiated investigation, instead of fitting only one trait-method model. Third, we used a more comprehensive measure of fluid intelligence across three domains and aggregated the results to form a robust composite to validate our results, whereas Bors et al. (2006) considered two domains separately. Fourth, compared to Preckel (2014), our sample covering a wider range of age and a
wider range of educational background serves to enhance the generalizability of the results. Our results strengthen the notion that the acknowledgment of method factors improves factorial validity, in line with Quilty et al. (2006). Consequently, our results contribute to accurate measurement of constructs using scales including mixed item wording by modeling the method factors and content factor adequately. Moreover, our study adds to investigations on the relation of NFC to other constructs by reporting change of the relationship between NFC and fluid intelligence depending on whether method factors were included or not. Although method factors did not change the divergent validity drastically, they have psychological substance, since more intelligent individuals endorsed positively worded items more. However, several limitations should be noted. First, the first sample comprising solely university students reflects a positive selection from the population regarding cognitive ability. Therefore, the generalizability of our results based on the first sample is limited to comparable samples. Additionally, method effects may be influenced by sample properties, such as verbal ability (Corwyn, 2000; Marsh, 1996), social desirability (Rauch et al., 2007), and test-taking attitude (Wang, Siegal, Falck, & Carlson, 2001). Interestingly, the method effect is less pronounced for samples with higher verbal ability (Corwyn, 2000; Marsh, 1996). In this respect, our results based on the first sample support the influence of method factors, because we confirmed their existence although the sample represents a positive restriction from the population regarding verbal ability. However, the investigation of the factor structure over time should be addressed to establish our findings. Since we provided first evidence of the associations between method factors in the NFC16 and fluid intelligence, further studies should be conducted to establish our result. Moreover, we recommend further investigation on the associations between personality measures and the NFC16 to provide further evidence for the external validity of the NFC16.
European Journal of Psychological Assessment (2020), 36(1), 212–215
Ó 2018 Hogrefe Publishing
insufficient fit and models including method factors fit the data better than the trait model. However, only model 5 including the content factor and two uncorrelated method factors showed an acceptable model fit. The main findings were comparable in both samples. Moreover, we investigated the measurement invariance regarding gender on model 5 and found metric invariance based on the results of the w2 significance tests (see Table 1). Consistent with Hypothesis 2, we found a moderate, positive correlation between the NFC 16 (content factor/scale mean) and fluid intelligence at r = .20 (p < .001)/r = .22 (p < .001). Moreover, the method factor regarding positively worded items showed a small, positive correlation to fluid intelligence at r = .12 (p < .01), whereas no significant correlation between the method factor regarding negatively worded items and fluid intelligence was observed.
General Discussion
Y. Zhang et al., A 16-Item Scale of Need for Cognition: Factorial Validity
In sum, we recommend precise measurement of NFC by modeling the content factor and method factors explicitly. We especially recommend specifying the contribution of NFC and method factors explicitly in research studies related to NFC. For investigations on NFC with insufficient sample sizes for structural equation modeling, we recommend cautious interpretation and the consideration of established findings regarding the relation between method factors and external constructs. Furthermore, we recommend investigations into method factors of other constructs aiming to improve factorial and construct validity in a wide range of research fields.
Acknowledgment This research was in large part supported by Grants 01PK11008A and 01PK11008B from the German Federal Ministry of Education and Research. Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 1015-5759/a000484 ESM 1. Text, Tables, and Figures (.pdf) The investigated models and the outputs of the investigations on the factorial validity of the NFC16.
References Amthauer, R., Brocke, B., Liepmann, D., & Beauducel, A. (2001). Intelligenz-Struktur-Test 2000 R (I-S-T 2000 R) – Handanweisung [Intelligence-Structure-Test 2000 R (I-S-T 2000 R) – Handbook]. Göttingen, Germany: Hogrefe. Bertrams, A., & Dickhäuser, O. (2010). University and school students’ motivation for effortful thinking: Factor structure, reliability, and validity of the German need for cognition scale. European Journal of Psychological Assessment, 26, 263–268. https://doi.org/10.1027/1015-5759/a000035 Bless, H., Wänke, M., Bohner, G., Fellhauer, R. F., & Schwartz, N. (1994). Need for cognition: Eine Skala zur Erfassung von Engagement und Freude bei Denkaufgaben [Need for cognition: A scale measuring engagement and happiness in cognitive tasks]. Zeitschrift für Sozialpsychologie, 25, 147–154. Bors, D. A., Vigneau, F., & Lalande, F. (2006). Measuring the need for cognition: Item polarity, dimensionality and the relation with ability. Personality and Individual Differences, 40, 819–828. https://doi.org/10.1016/j.paid.2005.09.007 Cacioppo, J. T., & Petty, R. E. (1982). The need for cognition. Journal of Personality and Social Psychology, 42, 116–131. https://doi.org/10.1037/0022-3514.42.1.116 Cacioppo, J. T., Petty, R. E., Feinstein, J. A., & Jarvis, W. B. G. (1996). Dispositional differences in cognitive motivation: The
Ó 2018 Hogrefe Publishing
215
life and times of individuals varying in need for cognition. Psychological Bulletin, 119, 197–253. https://doi.org/10.1037/ 0033-2909.119.2.197 Cacioppo, J. T., Petty, R. E., & Kao, C. F. (1984). The efficient assessment of need for cognition. Journal of Personality Assessment, 48, 306–307. https://doi.org/10.1207/ s15327752jpa4803_13 Corwyn, R. F. (2000). The factor structure of global self-esteem among adolescents and adults. Journal of Research in Personality, 34, 357–379. https://doi.org/10.1006/jrpe.2000.2291 Fleischhauer, M., Enge, S., Brocke, B., Ullrich, J., Strobel, A., & Strobel, A. (2010). Same or different? Clarifying the relationship of need for cognition to personality and intelligence. Personality and Social Psychological Bulletin, 36, 82–96. https://doi.org/ 10.1177/0146167209351886 Forsterlee, R., & Ho, R. (1999). An examination of the short form of the need for cognition scale applied in an Australian sample. Educational and Psychological Measurement, 59, 471–480. https://doi.org/10.1177/00131649921969983 Marsh, H. W. (1996). Positive and negative global self-esteem: A substantively meaningful distinction or artifactors? Journal of Personality and Social Psychology, 70, 810–819. https://doi. org/10.1037/0022-3514.70.4.810 Preckel, F. (2014). Assessing need for cognition in early adolescence: Validation of a German adaption of the Cacioppo/ Petty Scale. European Journal of Psychological Assessment, 30, 65–72. https://doi.org/10.1027/1015-5759/a000170 Quilty, L. C., Oakman, J. M., & Risko, E. (2006). Correlates of the Rosenberg Self-Esteem Scale and method effects. Structural Equation Modeling, 13, 99–117. https://doi.org/10.1207/ s15328007sem1301_5 Rauch, W. A., Schweizer, K., & Moosbrugger, H. (2007). Method effects due to social desirability as a parsimonious explanation of the deviation from unidimensionality in LOT-R scores. Personality and Individual Differences, 42, 1597–1607. https:// doi.org/10.1016/j.paid.2006.10.035 Schriesheim, C. A., & Hill, K. D. (1981). Controlling acquiescence response bias by item reversals: The effect on questionnaire validity. Educational and Psychological Measurement, 41, 1101–1114. https://doi.org/10.1177/001316448104100420 Tomás, J. M., & Oliver, A. (1999). Rosenberg’s Self-Esteem Scale: Two factors or method effects. Structural Equation Modeling, 6, 84–98. https://doi.org/10.1080/10705519909540120 Wang, J., Siegal, H. A., Falck, R. S., & Carlson, R. G. (2001). Factorial structure of Rosenberg’s Self-Esteem Scale among crack-cocaine drug users. Structural Equation Modeling, 8, 275–286. https://doi.org/10.1207/S15328007SEM0802_6 Received March 14, 2017 Revision received February 20, 2018 Accepted February 26, 2018 Published online September 18, 2018 EJPA Section/Category Methodological Topics in Assessment Ying Zhang Department of Psychology Saarland University 66041 Saarbrücken Germany ying.zhang@mx.uni-saarland.de
European Journal of Psychological Assessment (2020), 36(1), 212–215
Instructions to Authors The main purpose of the European Journal of Psychological Assessment is to present important articles, which provide seminal information on both theoretical and applied developments in this field. Articles reporting the construction of new measures or an advancement of an existing measure are given priority. The journal is directed to practitioners as well as to academicians: The conviction of its editors is that the discipline of psychological assessment should, necessarily and firmly, be attached to the roots of psychological science, while going deeply into all the consequences of its applied, practice-oriented development. Psychological assessment is experiencing a period of renewal and expansion, attracting more and more attention from both academic and applied psychology, as well as from political, corporate, and social organizations. The EJPA provides a meeting point for this movement, contributing to the scientific development of psychological assessment and to communication between professionals and researchers in Europe and worldwide. European Journal of Psychological Assessment publishes the following types of articles: Original Articles, Brief Reports, Multistudy Reports, and Registered Reports. Manuscript submission: All manuscripts should in the first instance be submitted electronically at http://www.editorialmanager.com/ejpa. Detailed instructions to authors are provided at http://www.hogrefe.com/j/ejpa Copyright Agreement: By submitting an article, the author confirms and guarantees on behalf of him-/herself and any coauthors that the manuscript has not been submitted or published elsewhere, and that he or she holds all copyright in and titles to the submitted contribution, including any figures, photographs, line drawings, plans, maps, sketches, tables, and electronic supplementary material, and that the article and its contents do not infringe in any way on the rights of third parties. ESM will be published online as received from the author(s) without any conversion, testing, or reformatting. They will not be checked for typographical errors or functionality. The author indemnifies and holds harmless the publisher from any third-party claims. The author agrees, upon acceptance of the article for publication, to transfer to the publisher the exclusive right to reproduce and distribute the article and its contents, both physically and in nonphysical, electronic, or other form, in the journal to which it has
European Journal of Psychological Assessment (2020), 36(1)
been submitted and in other independent publications, with no limitations on the number of copies or on the form or the extent of distribution. These rights are transferred for the duration of copyright as defined by international law. Furthermore, the author transfers to the publisher the following exclusive rights to the article and its contents: 1. The rights to produce advance copies, reprints, or offprints of the article, in full or in part, to undertake or allow translations into other languages, to distribute other forms or modified versions of the article, and to produce and distribute summaries or abstracts. 2. The rights to microfilm and microfiche editions or similar, to the use of the article and its contents in videotext, teletext, and similar systems, to recordings or reproduction using other media, digital or analog, including electronic, magnetic, and optical media, and in multimedia form, as well as for public broadcasting in radio, television, or other forms of broadcast. 3. The rights to store the article and its content in machinereadable or electronic form on all media (such as computer disks, compact disks, magnetic tape), to store the article and its contents in online databases belonging to the publisher or third parties for viewing or downloading by third parties, and to present or reproduce the article or its contents on visual display screens, monitors, and similar devices, either directly or via data transmission. 4. The rights to reproduce and distribute the article and its contents by all other means, including photomechanical and similar processes (such as photocopying or facsimile), and as part of so-called document delivery services. 5. The right to transfer any or all rights mentioned in this agreement, as well as rights retained by the relevant copyright clearing centers, including royalty rights to third parties.
Online Rights for Journal Articles: If you wish to post the article to your personal or institutional website or to archive it in an institutional or disciplinary repository, please use either a pre-print or a post-print of your manuscript in accordance with the publication release for your article and the document ‘‘Guidelines on sharing and use of articles in Hogrefe journals’’ on the journal’s web page at www.hogrefe.com/j/ejpa.
September 2019
Ó 2020 Hogrefe Publishing
EAPA
APPLICATION FORM EAPA membership includes a free subscription to the European Journal of Psychological Assessment. To apply for membership in the EAPA, please fill out this application form and return it together with your curriculum vitae to: David Gallardo-Puj ol, PhD (EAPAH Secretary Gener al), Dept. of Clinical Psychology & Psychobiology, Campus , Mundet, Pg. de la VallH d H ebron, 171, 08035 B arcelona, Spain, E-mail secretary-general@eapa.science.
Family name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . First name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Affiliation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . City
. . . . . . . . . . . . . . . .
Postcode . . . . . . . . . . . . . . . . . . . .
Country . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Phone
. . . . . . . . . . . . . . .
Fax . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ANNUAL FEES ◆ EURO 75.00 (US $ 98.00) – Ordinary EAPA members ◆ EURO 50.00 (US $ 65.00) – PhD students ◆ EURO 10.00 (US $ 13.00) – Undergraduate student members
FORM OF PAYMENT ◆ Credit card VISA
Mastercard/Eurocard
IMPORTANT! 3-digit security code in signature field on reverse of card (VISA/Mastercard) or 4 digits on the front (AmEx)
American Express
Number Expiration date
/
CVV2/CVC2/CID#
Card holder’s name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Signature . . . . . . . . . . . . . .
Date
. . . . . . . . . . . . . . . . . . . . .
◆ Cheque or postal order Send a cheque or postal order to the address given above Signature . . . . . . . . . . . . . .
Date
. . . . . . . . . . . . . . . . . . . . .
Hogrefe OpenMind Open Access Publishing? It’s Your Choice! Your Road to Open Access Authors of papers accepted for publication in any Hogrefe journal can now choose to have their paper published as an open access article as part of the Hogrefe OpenMind program. This means that anyone, anywhere in the world will – without charge – be able to read, search, link, send, and use the article for noncommercial purposes, in accordance with the internationally recognized Creative Commons licensing standards.
The Choice Is Yours 1. Open Access Publication: The final “version of record” of the article is published online with full open access. It is freely available online to anyone in electronic form. (It will also be published in the print version of the journal.) 2. Traditional Publishing Model: Your article is published in the traditional manner, available worldwide to journal subscribers online and in print and to anyone by “pay per view.” Whichever you choose, your article will be peer-reviewed, professionally produced, and published both in print and in electronic versions of the journal. Every article will be given a DOI and registered with CrossRef.
www.hogrefe.com
How Does Hogrefe’s Open Access Program Work? After submission to the journal, your article will undergo exactly the same steps, no matter which publishing option you choose: peer-review, copy-editing, typesetting, data preparation, online reference linking, printing, hosting, and archiving. In the traditional publishing model, the publication process (including all the services that ensure the scientific and formal quality of your paper) is financed via subscriptions to the journal. Open access publication, by contrast, is financed by means of a one-time article fee (€ 2,500 or US $3,000) payable by you the author, or by your research institute or funding body. Once the article has been accepted for publication, it’s your choice – open access publication or the traditional model. We have an open mind!
New edition of the popular text that separates the facts from the myths about drug and substance use “I highly recommend this book for an accurate, very readable, and useful overview of drug use problems and their treatment.” Stephen A. Maisto, PhD, ABPP, Professor of Psychology, Syracuse University, NY
Mitch Earleywine
Substance Use Problems (Series: Advances in Psychotherapy – Evidence-Based Practice – Volume 15) 2nd ed. 2016, viii + 104 pp. US $29.80 / € 24.95 ISBN 978-0-88937-416-4 Also available as eBook The literature on diagnosis and treatment of drug and substance abuse is filled with successful, empirically based approaches, but also with controversy and hearsay. Health professionals in a range of settings are bound to meet clients with troubles related to drugs – and this text helps them separate the myths from the facts. It provides trainees and professionals with a handy, concise guide for helping problem drug users build enjoyable, multifaceted lives using approaches based on decades of research.
www.hogrefe.com
2nd n editio
Readers will improve their intuitions and clinical skills by adding an overarching understanding of drug use and the development of problems that translates into appropriate techniques for encouraging clients to change behavior themselves. This highly readable text explains not only what to do, but when and how to do it. Seasoned experts and those new to the field will welcome the chance to review the latest developments in guiding self-change for this intriguing, prevalent set of problems.
32nd International Congress of Psychology
July 19 - 24, 2020 Prague, Czech Republic 5-day Scientific Programme Over 25 State-of-the-Art Lectures Over 100 Keynote Addresses Over 190 Invited Symposia Over 5 Controversial Debates and much more …
Represent your country and join us at the ICP 2020! Follow us on Facebook, Twitter, Instagram!