Michael Bošnjak Edgar Erdfelder (Editors)
Zeitschrift für Psychologie Founded in 1890 Volume 226 / Number 1 / 2018 Editor-in-Chief Edgar Erdfelder Associate Editors Michael Bošnjak Herta Flor Benjamin E. Hilbig Heinz Holling Bernd Leplow Steffi Pohl Christiane Spiel Elsbeth Stern
Hotspots in Psychology 2018
Methodology
nline free o issue le samp
European Journal of Research Methods for the Behavioral and Social Sciences Official Organ of the European Association of Methodology (EAM)
ISSN-Print 1614-1881 ISSN-Online 1614-2241 ISSN-L 1614-1881 4 online issues and 1 print compendium per annum (= 1 volume)
Subscription rates (2018) Libraries / Institutions US $324.00 / € 249.00 Individuals US $159.00 / € 114.00 Postage / Handling US $8.00 / € 6.00
www.hogrefe.com
Editors Jost Reinecke University of Bielefeld, Germany
Managing Editors Andreas Pöge University of Bielefeld, Germany
José-Luis Padilla University of Granada, Spain
Luis-Manuel Lozano University of Granada, Spain
About the Journal Methodology is the official organ of the European Association of Methodology (EAM), a union of methodologists working in different areas of the social and behavioral sciences (e.g., psychology, sociology, economics, educational and political sciences). The journal provides a platform for interdisciplinary exchange of methodological research and applications in the different fields, including new methodological approaches, review articles, software information, and instructional papers that can be used in teaching. Three main disciplines are covered: data analysis, research methodology, and psychometrics. The articles published in the journal are not only accessible to methodologists but also to more applied researchers in the various disciplines.
Manuscript Submissions All manuscripts should be submitted online at www.editorialmanager.com/methodology, where full instructions to authors are also available. Electronic Full Text The full text of the journal – current and past issues (from 2005 onward) – is available online at econtent.hogrefe.com/loi/med (included in subscription price). A free sample issue is also available here. Abstracting Services The journal is abstracted / indexed in Social Sciences Citation Index (SSCI) and Current Contents / Social & Behavioral Sciences (CC / S&BS) (since 2009), PsycINFO, PSYNDEX, ERIH, and Scopus. Impact Factor (Journal Citation Reports®, Clarivate Analytics): 2016 = 1.143
Michael Bošnjak Edgar Erdfelder (Editors)
Hotspots in Psychology 2018
Zeitschrift für Psychologie Volume 226 /Number 1/2018
Library of Congress Cataloging in Publication information is available via the Library of Congress Marc Database under the LC Control Number 2018930286 Ó 2018 Hogrefe Publishing Hogrefe Publishing Incorporated and registered in the Commonwealth of Massachusetts, USA, and in Göttingen, Lower Saxony, Germany No part of this publication may be reproduced, stored in a retrieval system or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the publisher. Cover image Ó alvarez – istockphoto.com Printed and bound in Germany ISBN 978-0-88937-547-5 The Zeitschrift für Psychologie, founded by Hermann Ebbinghaus and Arthur König in 1890, is the oldest psychology journal in Europe and the second oldest in the world. Since 2007, it appears in English and is devoted to publishing topical issues that provide convenient state-of-the-art compilations of research in psychology, each covering an area of current interest. The Zeitschrift für Psychologie is available as a journal in print and online by annual subscription and the different topical compendia are also available as individual titles by ISBN.
Editor-in-Chief
Edgar Erdfelder, University of Mannheim, Psychology III, Schloss, Ehrenhof-Ost, 68131 Mannheim, Germany, Tel. +49 621 181-2146, Fax +49 621 181-3997, erdfelder@psychologie.uni-mannheim.de
Associate Editors
Michael Bošnjak, Trier, Germany Herta Flor, Mannheim, Germany
Benjamin Hilbig, Landau, Germany Heinz Holling, Münster, Germany Bernd Leplow, Halle, Germany
Steffi Pohl, Berlin, Germany Christiane Spiel, Vienna, Austria Elsbeth Stern, Zurich, Switzerland
Editorial Board
G. M. Bente, Cologne, Germany D. Dörner, Bamberg, Germany N. Foreman, London, UK D. Frey, Munich, Germany J. Funke, Heidelberg, Germany W. Greve, Hildesheim, Germany W. Hacker, Dresden, Germany R. Hartsuiker, Ghent, Belgium J. Hellbrück, EichstättIngolstadt, Germany
F. W. Hesse, Tübingen, Germany R. Hübner, Konstanz, Germany A. Jacobs, Berlin, Germany M. Jerusalem, Berlin, Germany A. Kruse, Heidelberg, Germany W. Miltner, Jena, Germany T. Moffitt, London, UK A. Molinsky, Waltham, MA, USA H. Moosbrugger, Frankfurt/Main, Germany
W. Schneider, Würzburg, Germany B. Schyns, Durham, UK B. Six, Halle, Germany P. K. Smith, London, UK W. Sommer, Berlin, Germany A. von Eye, Vienna, Austria K. Wiemer-Hastings, DeKalb, IL, USA I. Zettler, Copenhagen, Denmark
Publisher
Hogrefe Publishing, Merkelstr. 3, 37085 Göttingen, Germany, Tel. +49 551 999 50 0, Fax +49 551 999 50 425, publishing@hogrefe.com North America: Hogrefe Publishing, 7 Bulfinch Place, 2nd floor, Boston, MA 02114, USA Tel. +1 (866) 823 4726, Fax +1 (617) 354 6875, customerservice@hogrefe-publishing.com
Production
Christina Sarembe, Hogrefe Publishing, Merkelstr. 3, 37085 Göttingen, Germany, Tel. +49 551 999 50 424, Fax +49 551 999 50 425, production@hogrefe.com
Subscriptions
Hogrefe Publishing, Herbert-Quandt-Str. 4, 37081 Göttingen, Germany, Tel. +49 551 999 50 900, Fax +49 551 999 50 998
Advertising/Inserts
Hogrefe Publishing, Merkelstr. 3, 37085 Göttingen, Germany, Tel. +49 551 999 50 423, Fax +49 551 999 50 425, marketing@hogrefe.com
ISSN
ISSN-L 2151-2604, ISSN-Print 2190-8370, ISSN-Online 2151-2604
Copyright Information
Ó 2018 Hogrefe Publishing. This journal as well as the individual contributions and illustrations contained within it are protected under international copyright law. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without prior written permission from the publisher. All rights, including translation rights, reserved.
Publication
Published in 4 topical issues per annual volume.
Subscription Prices
Calendar year subscriptions only. Rates for 2018: Institutions US $372.00 / 292.00; Individuals US $195.00 / 139.00 (all plus US $16.00 / 12.00 shipping & handling; 6.00 in Germany). Single issue US $49.00 / 34.95 (plus shipping & handling).
Payment
Payment may be made by check, international money order, or credit card, to Hogrefe Publishing, Merkelstr. 3, 37085 Göttingen, Germany. US and Canadian subscriptions can also be ordered from Hogrefe Publishing, 7 Bulfinch Place, 2nd floor, Boston, MA 02114, USA
Electronic Full Text
The full text of Zeitschrift für Psychologie is available online at www.econtent.hogrefe.com and in PsycARTICLESTM.
Abstracting Services
Abstracted/indexed in Current Contents/Social and Behavioral Sciences (CC/S&BS), Social Sciences Citation Index (SSCI), Research Alert, PsycINFO, PASCAL, PsycLit, IBZ, IBR, ERIH, and PSYNDEX. Impact Factor (2016): 1.830
Zeitschrift für Psychologie (2018), 226(1)
Ó 2018 Hogrefe Publishing
Contents Editorial
Hotspots in Psychology – 2018 Edition Michael Bošnjak and Edgar Erdfelder
1
Original Article
How to Identify Hot Topics in Psychology Using Topic Modeling André Bittermann and Andreas Fischer
3
Review Articles
The Structure of the Rosenberg Self-Esteem Scale: A Cross-Cultural Meta-Analysis Timo Gnambs, Anna Scharl, and Ulrich Schroeders
14
An Update of a Meta-Analysis on the Clinical Outcomes of Deep Transcranial Magnetic Stimulation (DTMS) in Major Depressive Disorder (MDD) Helena M. Gellersen and Karina Karolina Kedzior
30
A Meta-Analytic Re-Appraisal of the Framing Effect Alexander Steiger and Anton Kühberger
45
Effect Size Estimation From t-Statistics in the Presence of Publication Bias: A Brief Review of Existing Approaches With Some Extensions Rolf Ulrich, Jeff Miller, and Edgar Erdfelder
56
‘‘Sustainable Human Development: Challenges and Solutions for Implementing the United Nations’ Goals’’: A Topical Issue of the Zeitschrift für Psychologie Guest Editors: Suman Verma, Anne C. Petersen, and Jennifer E. Lansford
81
‘‘Advances in HEXACO Personality Research’’: A Topical Issue of the Zeitschrift für Psychologie Guest Editors: Reinout E. de Vries, Michael C. Ashton, and Kibeom Lee
83
Calls for Papers
Ó 2018 Hogrefe Publishing
Zeitschrift für Psychologie (2018), 226(1)
Editorial Hotspots in Psychology – 2018 Edition Michael Bošnjak1 and Edgar Erdfelder2 1
Leibniz Institute for Psychology Information (ZPID), Trier, Germany
2
University of Mannheim, Department of Psychology, Mannheim, Germany
This is the second “Hotspots in Psychology” issue of the Zeitschrift für Psychologie. The format is devoted to systematic reviews and meta-analyses in research-active (i.e., hotspot) fields that have generated a considerable number of primary studies. The common denominator is the research synthesis nature of the articles included, not a specific psychological topic or theme that all articles have to address. Resembling the first Hotspots issue (Erdfelder & Bosnjak, 2016), the call for papers for this second Hotspots in Psychology issue sought to attract contributions related to at least one of the following four topics: (1) Systematic reviews and meta-analyses on topics currently being debated in any subfield of psychology; (2) Systematic reviews and meta-analyses contributing to the recent discussion about replicability, transparency, and research integrity in psychology; (3) Meta-analytic replications and extensions of previously published syntheses, for example, by applying more recent approaches and/or by including more recent primary studies, and (4) Methodological advances in research synthesis methods relevant for any subfield of psychology. The papers that were accepted for publication do almost all relate to these areas. An exception is the first contribution in this issue, authored by André Bittermann and Andreas Fischer (2018) and entitled “How to identify hot topics in psychology using topic modeling.” We editors consider Bitterman and Fischer’s approach ground-breaking for identifying hotspot topics. The authors apply a textmining technique called latent Dirichlet allocation to identify research topics of particular significance over time using a bibliographic database corpus covering publications that appeared between the years 1980 and 2016. The authors provide evidence for the superiority of the text mining approach in comparison to a more traditional content classification system and identify five psychological topics with increasing numbers of publications in recent years. These are: Neuropsychology, online therapy, crosscultural research, traumatization, and visual attention. Ó 2018 Hogrefe Publishing
The text-mining approach presented by the authors might serve as a blueprint for identifying promising research areas associated with a considerable attention value which may deserve to be systematically synthesized. One of the hotspot topics identified by Bittermann and Fischer is addressed in the second contribution by Timo Gnambs, Anna Scharl, and Ulrich Schroeders (2018), namely the cross-cultural comparability of a well-established psychometric instrument capturing self-esteem. In their contribution entitled “The structure of the Rosenberg Self-Esteem Scale: A cross-cultural meta-analysis,” the authors employ an up-to-date meta-analytic structural equation modeling approach including more than 100 independent samples. While the unidimensional structure of the scale could have been corroborated, item-interpretation was partly driven by culture, limiting the scale’s usefulness for cross-cultural comparisons. Concerning the third and fourth category of papers we invited, the present Hotspots issue is somewhat complementary to the first one. Whereas replications and extensions of previous meta-analyses were not represented in the first issue, two articles of the current issue fall in this category (i.e., Gellersen & Kedzior, 2018; Steiger & Kühberger, 2018). The contribution by Helena Gellersen and Karina Karolina Kedzior (2018) entitled “An update of a metaanalysis on the clinical outcomes of deep transcranial magnetic stimulation (DTMS) in major depressive disorder (MDD)” aims at replicating and extending previous metaanalytic findings on DTMS effectiveness in depression. Major depressive disorder is the most widespread mental illness and for a significant number of patients, antidepressant medication is ineffective. An alternative treatment is noninvasive DTMS over the prefrontal cortex. One limitation of the previous meta-analysis the authors refer to is that it reported only short-term outcomes of DTMS in mixed samples with unipolar and bipolar MDD. The current meta-analysis focuses on unipolar MDD and longerterm clinical outcomes. The findings corroborate the longer-term effectiveness of the treatment in question for Zeitschrift für Psychologie (2018), 226(1), 1–2 https://doi.org/10.1027/2151-2604/a000323
2
various symptom domains and should therefore decisively contribute to improving mental health interventions. The second meta-analytic replication and extension has been contributed by Alexander Steiger and Anton Kühberger (2018). In their article “A meta-analytic re-appraisal of the framing effect,” the authors re-analyzed the data of Kühberger’s (1998) meta-analysis on framing effects in risky decision making by using p-curve (Simonsohn, Nelson, & Simmons, 2014). This approach uses the distribution of only significant p-values to correct the effect size, taking publication bias into account. The corrected overall effect size is considerably higher than the effect previously reported. In addition, the authors report on a new metaanalysis of risk framing experiments published in 2016, and do again estimate a sizeable effect. They conclude that risky choice framing effects are highly reliable and robust. The replicability crisis found in some other domains of psychology does not seem to exist here. Whereas the first Hotspots issue included two methodological contributions on possible research synthesis biases induced by narrative reviews (Kühberger, Scherndl, Ludwig, & Simon, 2016) and on avoiding meta-analytic biases by using individual participant data (Kaufmann, Reips, & Merki, 2016), the current issue provides one methodological contribution on “Effect size estimation from t-test statistics in the presence of publication bias: A brief review of existing approaches with some extensions” by Rolf Ulrich, Jeff Miller, and Edgar Erdfelder (2018). Publication bias, the phenomenon that statistically significant findings are more likely to be published in comparison to nonsignificant ones, is a major threat to the validity of meta-analytic findings. In the paper by Ulrich and colleagues, an approach to estimating the true effect size from published test statistics is presented, allowing to estimate the percentage of researchers who consign undesired results in a research domain to the file drawer. R and MATLAB versions of syntax to estimate the true unbiased effect size and the prevalence of publication bias in the literature are provided, allowing all researchers to quantify the bias invoked by systematically missing publications and to employ the appropriate corrections. Overall, we editors are delighted about the vibrant metaanalytic research activity that is underway in psychology, including not only applications of state-of-the-art metaanalytic methods to a steadily increasing number of substantive research fields in psychology but also methodological innovations to address challenges of meta-analytic research. We are grateful for the appreciation devoted to the Hotspots in Psychology format by all contributing authors, reviewers, and readers, and we are confident that this appreciation will contribute to fostering high-quality submissions to future editions of Hotspots in Psychology.
Zeitschrift für Psychologie (2018), 226(1), 1–2
Editorial
We conclude this editorial by inviting you to a newly established scientific conference associated with the Hotspots in Psychology format to be held in Trier, Germany, in June 2018 for the first time. Further details can be found here: http://researchsynthesis2018.leibnizpsychology.org/
References Bittermann, A., & Fischer, A. (2018). How to identify hot topics in psychology using topic modeling. Zeitschrift für Psychologie, 226, 3–13. https://doi.org/10.1027/2151-2604/a000318 Erdfelder, E., & Bosnjak, M. (2016). Hotspots in psychology: A new format for special issues of the Zeitschrift für Psychologie. Zeitschrift für Psychologie, 224, 141–144. https://doi.org/ 10.1027/2151-2604/a000249 Gellersen, H., & Kedzior, K. K. (2018). An update of a metaanalysis on the clinical outcomes of deep transcranial magnetic stimulation (DTMS) in major depressive disorder (MDD). Zeitschrift für Psychologie, 226, 30–44. https://doi.org/ 10.1027/2151-2604/a000320 Gnambs, T., Scharl, A., & Schroeders, U. (2018). The structure of the Rosenberg Self-Esteem Scale: A cross-cultural metaanalysis. Zeitschrift für Psychologie, 226, 14–29. https://doi. org/10.1027/2151-2604/a000317 Kaufmann, E., Reips, U.-D., & Merki, K. M. (2016). Use of offline versus online individual participant data (IPD) meta-analysis in educational psychology. Zeitschrift für Psychologie, 224, 157–167. https://doi.org/10.1027/2151–2604/a000251 Kühberger, A. (1998). The influence of framing on risky decisions: A meta-analysis. Organizational Behavior and Human Decision Processes, 75, 23–55. https://doi.org/10.1006/obhd. 1998.2781 Kühberger, A., Scherndl, T., Ludwig, B., & Simon, D. M. (2016). Comparative evaluation of narrative reviews and meta analyses: A case study. Zeitschrift für Psychologie, 224, 145–156. https://doi.org/10.1027/2151–2604/a000250 Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve and effect size: Correcting for publication bias using only significant results. Perspectives on Psychological Science, 9, 666–681. https://doi.org/10.1177/1745691614553988 Steiger, A., & Kühberger, A. (2018). A meta-analytic re-appraisal of the framing effect. Zeitschrift für Psychologie, 226, 45–55. https://doi.org/10.1027/2151-2604/a000321 Ulrich, R., Miller, J., & Erdfelder, E. (2018). Effect size estimation from t-test statistics in the presence of publication bias: A brief review of existing approaches with some extensions. Zeitschrift für Psychologie, 226, 56–80. https://doi.org/10.1027/21512604/a000319
Published online February 2, 2018
Michael Bošnjak FB I - Psychologie Universität Trier 54286 Trier Germany mb@leibniz-psychology.org
Ó 2018 Hogrefe Publishing
Original Article
How to Identify Hot Topics in Psychology Using Topic Modeling André Bittermann1 and Andreas Fischer2 1
Leibniz Institute for Psychology Information (ZPID), Trier, Germany
2
Forschungsinstitut Betriebliche Bildung (f-bb), Nuremberg, Germany Abstract: Latent topics and trends in psychological publications were examined to identify hotspots in psychology. Topic modeling was contrasted with a classification-based scientometric approach in order to demonstrate the benefits of the former. Specifically, the psychological publication output in the German-speaking countries containing German- and English-language publications from 1980 to 2016 documented in the PSYNDEX database was analyzed. Topic modeling based on latent Dirichlet allocation (LDA) was applied to a corpus of 314,573 publications. Input for topic modeling was the controlled terms of the publications, that is, a standardized vocabulary of keywords in psychology. Based on these controlled terms, 500 topics were determined and trending topics were identified. Hot topics, indicated by the highest increasing trends in this data, were facets of neuropsychology, online therapy, cross-cultural aspects, traumatization, and visual attention. In conclusion, the findings indicate that topics can reveal more detailed insights into research trends than standardized classifications. Possible applications of this method, limitations, and implications for research synthesis are discussed. Keywords: topic modeling, hotspots, scientometrics, trends, controlled terms
Topics of particular significance in research-active fields have been referred to as “hotspots” (Erdfelder & Bošnjak, 2016). From a scientometric point of view, the occurrence of hotspots may reflect areas of current scientific discourse. On the other hand, hotspots may also derive from current needs of society, for example, consider the impact of topics such as digitalization, terrorism, or the German “refugee crisis” (beginning in 2015) on psychological research. Thus, addressing hotspots might help to deliver research results that are interesting to both the scientific community and/ or the general public if the research is imparted comprehensibly (Friedman, 2008). Nevertheless, it is an open question of how to identify the set of potentially hot topics in a domain of interest. In this paper, we will contrast two ways of identifying topics based on a corpus of scientific publications (manifest classifications vs. latent topics). A comparatively simple and straightforward approach for identifying research topics is based on existing classification systems, such as the “classification codes”1 outlined in the Thesaurus of Psychological Index Terms (Tuleya, 2007) published by the American Psychological Association (APA). Currently, this thesaurus provides 157 categories to 1
describe the content included in the publication database, and each category may be considered a research topic. However, with regard to identifying hotspots, the apparent simplicity of this approach is burdened with multiple drawbacks: first, the approach is based on an established classification system, and thus some of the most recent (and hot) topics may not be represented in the analysis until the classification system is expanded accordingly; second, classifications may be too broad and abstract to capture the topics that are of particular significance in researchactive fields (e.g., there is no classification code specific to “evaluation,” even if some researchers may consider treatment evaluation an interesting topic); a third problem arises due to the fact that some publications address more than one topic. Consider a study that examines the neuropsychological correlates of emotional lability in traumatized refugees. If only one classification is assigned (e.g., “Neuropsychology & Neurology”) the information on disorders and migration-related aspects remains hidden; instead, when using additional classifications to categorize these contents, the respective proportions remain unspecified (i.e., there may be equal or varying shares of each content).
A list of all codes can be retrieved from http://www.apa.org/pubs/databases/training/class-codes.aspx. Each publication can be linked to one or more main classifications (e.g., “Psychometrics & Statistics & Methodology,” “Human Experimental Psychology,” or “Personality Psychology”) and/or respective subcategories (e.g., a publication classified as “Psychometrics & Statistics & Methodology” may be classified more specifically as “Sensory & Motor Testing” or “Clinical Psychology Testing”).
Ó 2018 Hogrefe Publishing
Zeitschrift für Psychologie (2018), 226(1), 3–13 https://doi.org/10.1027/2151-2604/a000318
4
A more complex approach to identifying topics is to derive latent topics from the manifest content addressed within a corpus of publications through methods such as topic modeling (e.g., Blei, Ng, & Jordan, 2003; Griffiths & Steyvers, 2004). The basic idea behind topic modeling is that every document can address different topics that are not known a priori. Thus, the goal is to identify these latent topics based on the documents’ manifest contents by employing algorithms that “analyze the words of the original texts to discover the themes that run through them” (Blei, 2012, p. 77). Since information on the level of the full text, abstract, or keywords can be used for topic modeling, the resulting topics have the potential to address specific subjects based on the corpus and independent from predefined classifications. In topic modeling, each document is assumed to address each topic to varying degrees (0–100%). For example, a paper might comprise an evaluation topic with a share of 10% and other topics with a share of 90%. This means that in contrast to a dichotomous classification (this publication is assigned or is not assigned to the classification) or multiple dichotomous classifications, a probabilistic approach such as topic modeling can deal with heterogeneous topics of a publication in terms of topic proportions. In this study, such a probabilistic method is applied for topic modeling, namely latent Dirichlet allocation (LDA; Blei et al., 2003). By applying statistical methods to the change of mean topic probabilities over time, rising and declining trends can be identified (Griffiths & Steyvers, 2004). Once trending topics are identified, scientific knowledge can be gathered from publications addressing these topics by conducting systematic reviews and meta-analyses to synthesize the results from related published research on a certain subject. The current study aims to deliver the foundation for such research synthesis techniques in the context of hotspots in psychology: a data-driven bottom-up approach for the identification of latent topics and trends in psychological research.
Topic Modeling in Psychological Research and Scientometrics Big data and topic modeling represent a relatively new approach of psychological research methods that can be applied to various research questions (e.g., Chen & Wojcik, 2016; Kosinski, Wang, Lakkaraju, & Leskovec, 2016). For example, Griffiths, Steyvers, and Tenenbaum (2007) used topic models for predicting word association and the effects of semantic association and ambiguity on a variety of language-processing and memory tasks. Steyvers and Griffiths (2008) showed that both human memory and information retrieval faces similar computational demands Zeitschrift für Psychologie (2018), 226(1), 3–13
A. Bittermann & A. Fischer, How to Identify Hot Topics in Psychology
by employing topic models. Topic models have also been used for modeling couple and family text data (Atkins et al., 2012), improving the prediction of neuroticism and depression (Resnik, Garron, & Resnik, 2013), investigating mental health signals in Twitter (Coppersmith, Dredze, & Harman, 2014), analyzing the linguistic data of patienttherapist interactions (Imel, Steyvers, & Atkins, 2015), and exploring differences in language use on Facebook across gender, affiliation, and assertiveness (Park et al., 2016). In the field of scientometric analysis, which is highly relevant for the present study, Griffiths and Steyvers (2004) applied LDA topic models to a corpus of abstracts published in the Proceedings of the National Academy of Sciences of the United States of America and identified “hot” and “cold” topics. A topic was defined as hot if it showed an increasing linear trend in popularity and cold if it showed a decreasing linear trend in popularity. This approach was adapted in several other research fields, for example, to identify the major biological concepts from a corpus of protein-related publication titles and abstracts (Zheng, McLean, & Lu, 2006), to conduct a bibliometric analysis of aquaculture literature (Natale, Fiore, & Hofherr, 2012), to analyze the field of development studies (Thelwall & Thelwall, 2016), or to explore hydropower research (Jiang, Qiang, & Lin, 2016). The current study is the first to apply LDA-based topic modeling for a scientometric analysis of psychological research in the German-speaking countries.
A Brief Illustration of LDA-Based Topic Modeling In the following, a very brief and illustrative description of LDA and topic modeling is provided. Further details and more technical descriptions can be found in Blei (2012) and Blei et al. (2003). The underlying assumption of LDA is that a document represents a mixture of topics with different proportions (Blei et al., 2003). Using Bayesian probabilistic modeling, LDA aims to identify clusters of terms (i.e., topics) that tend to co-occur within documents (Park et al., 2016). Thus, topics are defined as a distribution over a fixed vocabulary (Blei, 2012). In a generative process, two kinds of probabilities are drawn from Dirichlet distributions over (1) the prior weight of a certain word in a topic (β) for the probabilities of terms occurring in a certain topic (φ), and (2) the prior weight of a certain topic in a document (α) for the probabilities of topics occurring in a certain document (θ) based on the terms within the document. “Prior” means that the α and β hyperparameters have to be set prior to the analysis. Lower values of α result in documents belonging to fewer topics, and lower values of β result in more separated topics. Ó 2018 Hogrefe Publishing
A. Bittermann & A. Fischer, How to Identify Hot Topics in Psychology
Table 1. Example of four documents consisting of 16 terms each Document 1
2
3
4
Love
Happiness
Disgust
Amazement
Hate
Joy
Anger
Surprise
Fear
Serenity
Rage
Joy
Disgust
Love
Hate
Happiness
Intervention
Therapy
Psychoanalysis
Psychotherapy
Therapist
Therapist
Transference
Counseling
Client
Client
Client
Disorder
Disorder
Treatment
Disorder
Treatment
Mother
Intervention
Treatment
Outcome
Brother
Disorder
Intervention
Exposition
Sister
Parents
Mother
Client
Father
Siblings
Father
Therapist
School
Learning
Parents
Parents
Learning
Teacher
Child
Mother
Grades
Class
Grades
College
Class
College
Achievement
University
Table 2. Five most common terms of the resulting topics (idealization) Topic 1 “Emotions”
2
3
4
“Therapy”
“Family”
“Education” Class
Love
Client
Parents
Joy
Disorder
Mother
College
Happiness
Therapist
Father
Grades
Disgust
Intervention
Child
Learning
Amazement
Treatment
Brother
Teacher
Note. The topic titles are descriptive terms provided by the authors and were not generated by the model.
Table 3. Illustration of document-topic probabilities (θ) Topic Document
1
2
3
4
Sum
1
0.250
0.250
0.250
0.250
1
2
0.250
0.375
0.125
0.250
1
3
0.250
0.375
0.250
0.125
1
4
0.250
0.500
0.125
0.125
1
Mean
0.250
0.375
0.188
0.188
1
Notes. Document 1 addresses all four topics with equal shares (1/k, with k = number of topics), whereas Documents 2–4 show different topic probabilities. By mean probabilities, Topic 2 is addressed with more than average probability and, thus, can be interpreted as the most popular.
For a simplified illustration of the main idea behind topic modeling in a scientometric context, imagine a corpus consisting of four documents and a model of four topics. For the sake of brevity, each document shall consist of 16 terms (see Table 1). In this idealized example, LDA Ó 2018 Hogrefe Publishing
5
reveals four topics by clustering co-occurring terms, of which the five most frequent terms are shown in Table 2. Note that each topic actually consists of all unique terms of the corpus, that is, of all four documents. The terms are sorted by frequency to best represent different topics. For the sake of illustration, the results presented in Table 2 can be considered ideal because these topics reflect optimal semantic differences. As a real LDA analysis for this very small sample corpus is based exclusively on term co-occurrences, one of the resulting topics would be “intervention, parents, disgust, love, hate,” which includes different semantic meanings. Documents differ in topic proportions, and this is represented by the probability of a document belonging to a topic (θ). As shown in Table 3, Document 1 addresses all four topics with equal shares, whereas Document 2 mostly addresses Topic 2 and so on. The resulting mean document-topic probabilities by topic show that Topic 1 has a mean probability of 25% which corresponds to the expected proportion 1/k (with k being the number of topics). Topic 2, with a mean probability of 37.5%, can be considered as the most popular, whereas Topics 3 and 4 are less popular than average. LDA is an unsupervised method, but the number of topics (k) must be defined a priori by the analyst. Griffiths and Steyvers (2004) examined different values of k and compared the resulting log-likelihoods. Yet another approach would be to test various values of k and determine the optimal k intellectually, that is, by expert judgment to decide whether the topics are in balance between too broad or too specific (Thelwall & Thelwall, 2016). In this study, we follow the first approach.
Using Controlled Terms for Topic Modeling For a reliable identification of the representative topics within a research field, the information about the documents’ content must be of high quality. For instance, if abstracts give a mere introduction rather than an objective summary, the resulting topics will reflect the theoretical background or the studies’ raison d’être rather than their actual content. The same applies to keywords that, for instance, contain the statistical methods that were used and do not represent the actual topic of the study (e.g., in the case of “analysis of variance,” from a keyword point of view it remains unclear whether the method was simply applied, discussed, or further developed). To avoid latent semantic heterogeneity within a topic, all keywords should be chosen according to the same rules (e.g., keywords for statistical procedures are assigned only if they themselves are the focus of the study and not their mere application is referred to). In most studies, authors provide keywords that further summarize the document’s content. If these keywords are uncontrolled (i.e., can be freely chosen), Zeitschrift für Psychologie (2018), 226(1), 3–13
6
(1) it is not guaranteed that they actually represent the main concepts, ideas, and topics of the publication; (2) they sometimes are long phrases and not terms; and (3) keywords from different authors might be different terms for the same idea (e.g., adaptation vs. adaption vs. adjustment) or, conversely, the same words for different ideas. These are problematic aspects for topic modeling using LDA, since topics are identified according to word co-occurrences (Blei et al., 2003). Topic models aim to capture semantically related topics (Wallach, Mimno, & McCallum, 2009), but they do not generate topics based on the words’ inherent semantic relations. The PSYNDEX database is developed and hosted by the Leibniz Institute for Psychology Information (ZPID; Trier, Germany) and is a comprehensive database containing German- and English-language publications in psychology and closely related disciplines from the German-speaking countries. In early July 2017, there were more than 327,400 documents indexed in PSYNDEX (accessible at https://www.pubpsych.eu/). The PSYNDEX editorial staff assigns controlled terms (CTs) from the aforementioned Thesaurus of Psychological Index Terms published by the APA (Tuleya, 2007; ZPID, 2016). In the context of topic modeling, this controlled vocabulary has several advantages: (1) The CTs correspond with the content of the publications. (2) The terms’ standardized spelling avoids synonyms or variations in expressions. (3) The corpus for topic modeling consists of only those words that are relevant to the content. Stop words that contain little topical content (e.g., “the,” “a,” “and”) have to be neither defined nor deleted. (4) All CTs are available in German and English; therefore, the whole corpus of publications can be used irrespective of the documents’ language. (5) In contrast to abstract texts, the terms do not have to be stemmed with the resulting problem of word fragments. (6) Since the corpus contains fewer words, computation time decreases and fewer memory resources are needed. In a pretest with 3,846 documents, LDA based on CTs took less than 7% of the time needed for an abstract-based LDA while revealing comparable results. Thus, in contrast to prior research using abstracts as primary data for topic modeling analysis (e.g., Griffiths & Steyvers, 2004; Jiang et al., 2016), the current study employs CTs for topic modeling.
Objectives The objectives of the current study were twofold: (1) to examine trends of latent topics, and (2) to contrast latent topics with manifest classifications. LDA-based topic modeling will be applied to a corpus of psychological publications from the German-speaking Zeitschrift für Psychologie (2018), 226(1), 3–13
A. Bittermann & A. Fischer, How to Identify Hot Topics in Psychology
countries retrieved from PSYNDEX. Increasing and decreasing linear trends as well as nonlinear trends will be identified. Furthermore, the topics will be contrasted with classifications in terms of thematic specificity.
Method Data Data were extracted from the PSYNDEX database on July 3, 2017. A total of 316,996 of the indexed psychological articles, book chapters, reports, and dissertations were published between 1980 and 2016. Biographies or historical sources (reprints or selected readings) were excluded, since they usually address the topic retrospectively, resulting in N = 314,573 publications.
Software Analyses were conducted in RStudio version 1.0.153 (RStudio Team, 2016) based on R version 3.4.2 (R Core Team, 2017). For text mining and topic modeling, the packages tm 0.7-1 (Feinerer, Hornik, & Meyer, 2008) and topicmodels 0.2-6 (Grün & Hornik, 2011) were used. Additional operations were conducted with packages dplyr 0.5.0, readr 1.1.0, splitstackshape 1.4.2, Xmisc 0.2.1, lattice 0.20-35, and nnet 7.3-12.
Topic Modeling LDA was applied using Gibbs sampling with parameters as suggested by Awati (2015), that is, 4,000 omitted Gibbs iterations at beginning, 2,000 Gibbs iterations with 500 omitted in-between iterations, and 5 repeated random starts. Parameters of the symmetric Dirichlet priors were set according to Tang, Meng, Nguyen, Mei, and Zhang (2014), that is, α = 0.1 (resulting in documents belonging to fewer topics) and β = 0.01 (resulting in well-separated topics). Concerning the number of topics k, we inspected the log-likelihood estimates for various values of k, which is referred to as the commonly used approach (Kosinski et al., 2016). We ran models with 100, 150, 200, 300, 400, and 500 topics comparable to Griffiths and Steyvers (2004), who tested values of 50, 100, 200, 300, 400, 500, 600, and 1,000 topics. Values of k higher than 500 were discarded, since more topics decrease understanding and verifiability by experts (De Battisti, Ferrara, & Salini, 2015). Text input for the topic models were the publications’ controlled keyword terms (CTs). They were prepared for LDA by removing spaces, parentheses, hyphens, slashes, and apostrophes. Ó 2018 Hogrefe Publishing
A. Bittermann & A. Fischer, How to Identify Hot Topics in Psychology
7
Table 4. Log-likelihoods (LL) of topic models by different numbers of topics (k) k LL
100 8234,278
150 7491,948
200 7032,583
Modeling Trends Previous research employed linear regression models for identifying increasing and decreasing trends (Griffiths & Steyvers, 2004; Paul & Girju, 2009; Ponweiser, Grün, & Hornik, 2014). Hot topics were defined by the highest linear slopes. We extended this approach by taking nonlinearity into account to identify nonlinear trends. Specifically, we applied multilayer perceptrons (MLPs) with two hidden units to model the average topic probability (mean of document-topic probabilities over all documents for each topic) as a nonlinear function of the year of publication. The MLPs applied provide nonlinear regression functions with a minimal sum of squared residuals for each topic, and thus provide an estimate of R2, given an optimal nonlinear transformation of the year of publication (Fischer, 2015). Two hidden units were included to allow for nonmonotonic functions while at the same time minimizing the risk (and amount) of overfitting (Fischer, 2015). The difference between R2MLP and R2linear is applied as an indicator of the amount of nonlinearity that is not accounted for by the linear model. More specifically, nonlinearity is defined by R2MLP > 2 R2linear. Trends were estimated over a period of more than two years. Because of random fluctuations – and considering the duration of a typical publication cycle – an estimation over a shorter time span implies severe overfitting (Fischer, 2015) and may not represent a topic’s significance well. The complete R code used in the analyses is provided in the Electronic Supplementary Material (ESM 1).
Results Model Selection The corpus of N = 314,573 documents contained 6,073 unique terms. By comparing log-likelihoods of the resulting models (as shown in Table 4), k = 500 was determined as the optimal number of topics. A table containing the top 15 terms of all topics can be found in ESM 2.
Trends in Topics Linear trends in changes of mean document-topic probabilities (θ) over time were analyzed according to Griffiths and Steyvers (2004) with an additional examination of Ó 2018 Hogrefe Publishing
300 6403,997
400 5978,100
500 5695,993
nonlinearity. Significantly increasing linear trends could be found for 128 of the topics, and significantly decreasing linear trends could be found for 135 of the topics, both at the p = .0001 level. The 10 topics with highest increasing linear trends (i.e., hot topics) are listed in Table 5. Figure 1 shows their mean document-topic probabilities (θ) by publication year. The major hot topics are neuropsychology and genetics, online therapy, human migration, traumatization, and visual attention. A closer look at the terms of these topics (Table 5) reveals that these major themes can be further specified. Traumatization, for example, can be further specified with three narrower topics: traumatization of refugees during war and torture (Topic 86), therapy of emotional trauma (Topic 344), and trauma-related disorders and processes (Topic 95). Since the focus of the current study was on hot topics, additional trend analyses are reported briefly (see ESM 2 for topic terms and more information on the following topics). Strongly decreasing linear trends (i.e., cold topics) could be found in topics referring to human-factors engineering (Topic 310), psychosomatic disorders (Topic 361), incarceration (Topic 472), social and political processes in West and East Germany (Topics 41, 393, and 186), experimental methodology (Topic 342), group psychotherapy (Topic 491), community mental health services (Topic 163), and infectious disorders (Topic 479). The comparison of R2linear and R2MLP revealed topics with a considerable amount of nonlinearity that is not accounted for by the linear model. The largest difference between R2linear and R2MLP (i.e., nonlinear trends) could be found for topics referring to psychodiagnosis and testing (Topic 467, with peaks in 2006 and 2011), outpatient psychotherapy (Topic 334, with peak in 1998), family relations (Topic 259), prevention and health promotion (Topic 162, highest peak in 1991), Internet and information systems (Topic 481, with peak in 1999), organizational psychology (Topic 345), sexual relations with clients in psychotherapy (Topic 237, peaks in 1995 and 1998), racial and ethnic attitudes (Topic 130, peak in 1993), health behavior and dental health (Topic 44), and relations between socioeconomic background of the family and education (Topic 138).
Relationship Between Topics and Classifications According to the second objective of the current study, we investigated whether topics can be allocated to a specific Zeitschrift für Psychologie (2018), 226(1), 3–13
8
A. Bittermann & A. Fischer, How to Identify Hot Topics in Psychology
Table 5. Top 15 terms of the 10 hottest topics Topic
Top 15 terms
364
Functional Magnetic Resonance Imaging, Cerebral Blood Flow, Prefrontal Cortex, Amygdala, Neuroanatomy, Biological Neural Networks, Cingulate Cortex, Brain, Oxygenation, Insula, Rewards, Striatum, Hippocampus, Brain Connectivity, Cognitive Control
249
Functional Magnetic Resonance Imaging, Cerebral Blood Flow, Brain, Parietal Lobe, Prefrontal Cortex, Neuroanatomy, Frontal Lobe, Temporal Lobe, Oxygenation, Neuroimaging, Magnetic Resonance Imaging, Biological Neural Networks, Occipital Lobe, Visual Cortex, Spectroscopy
386
Internet, Computer Mediated Communication, Online Therapy, Online Social Networks, Internet Usage, Electronic Communication, Communications Media, Websites, Social Media, Virtual Reality, Computer Assisted Therapy, Cellular Phones, Privacy, Telemedicine, Information Technology
459
Genes, Polymorphism, Genetics, Serotonin, Genotypes, Dopamine, Alleles, Biological Markers, Phenotypes, Attention Deficit Disorder With Hyperactivity, Susceptibility (Disorders), Neurotransmission, Brain Derived Neurotrophic Factor, Neural Receptors, Tryptophan
371
Cross-Cultural Differences, Human Migration, Cross-Cultural Communication, Cultural Sensitivity, Cross-Cultural Treatment, Multiculturalism, Expatriates, Transcultural Psychiatry, International Organizations, Cross-Cultural Counseling, Globalization, Multicultural Education, Foreign Workers, Acculturation, Racial And Ethnic Differences
323
Magnetic Resonance Imaging, Brain, Neuroimaging, Neuroanatomy, Hippocampus, Gray Matter, Brain Size, Tomography, Prefrontal Cortex, White Matter, Amygdala, Cingulate Cortex, Cerebral Cortex, Temporal Lobe, Morphology
86
Posttraumatic Stress Disorder, Emotional Trauma, Refugees, Trauma, War, Victimization, Torture, Persecution, Survivors, Violence, Injuries, Asylum Seeking, Exposure Therapy, Human Migration, Transgenerational Patterns
344
Posttraumatic Stress Disorder, Emotional Trauma, Trauma, Eye Movement Desensitization Therapy, Stress Reactions, Intrusive Thoughts, Adjustment Disorders, Acute Stress Disorder, Traumatic Neurosis, Posttraumatic Growth, Complex PTSD, Exposure Therapy, Accidents, Medical Personnel, Metaphor
365
Attention, Visual Attention, Selective Attention, Visual Search, Distraction, Cues, Reaction Time, Stimulus Parameters, Eye Movements, Attentional Capture, Visual Perception, Stimulus Salience, Visual Stimulation, Attentional Bias, Divided Attention
95
Emotional Trauma, Posttraumatic Stress Disorder, Trauma, Dissociation, Dissociative Disorders, Early Experience, Dissociative Identity Disorder, Depersonalization, Borderline Personality Disorder, Neurobiology, Introjection, Dissociative Patterns, Amnesia, Psychodynamic Psychotherapy, Depersonalization/Derealization Disorder
Figure 1. Mean values of document-topic probabilities θ by publication year for the 10 hottest topics with added linear regression line. The topics are described in Table 5.
PSYNDEX subject classification. If this is the case, topics either match the classifications’ content or they provide more detailed information within the classification. Zeitschrift für Psychologie (2018), 226(1), 3–13
If this is not the case, topics cover themes that could only be matched by multiple classifications. For every document, the assigned classifications were compared to the Ó 2018 Hogrefe Publishing
A. Bittermann & A. Fischer, How to Identify Hot Topics in Psychology
9
Figure 2. Level plot showing mean document-topic probabilities (θ) by topics and main classifications. Only the 10 hottest topics are displayed. Darker cells represent higher values of θ. For example, in the publications that were classified as “2500 Physiological Psychology & Neuroscience,” the highest mean θ resulted for Topics 364, 249, 459, and 323. The respective topics are described in Table 5.
documents’ most probable topics in order to examine content similarities and differences. Similar to Griffiths and Steyvers’ (2004) approach for identifying diagnostic topics, Figure 2 shows a level plot of mean document-topic probabilities (θ) by topics and main classifications for the hot topics. For creating the level plot, publications were grouped by classification (in case of multiple classifications, the document was assigned to each classification). Then, mean θ probabilities were determined by each classification. This allowed the investigation of the extent to which a topic’s semantic content (as reflected by its top terms) corresponds with the classification system. For the sake of clarity, only main classification categories are included in Figure 2 (as the complete APA classification system consists of 157 codes). The darker the cells of the level plot, the higher the mean θ. If a topic column shows different colors, the θ values are not equally distributed over the classifications, that is, the topics’ semantic content cannot be reflected by a single classification. Clearly, the topics do not match the classifications perfectly, but they do show correspondence with various classifications (in this case, only one dark cell is observed for each topic). For example, the highest mean θ for Topic 371 (referring to human migration and cross-cultural aspects) can be observed in “2900 Social Processes & Social Issues.” Since it also shows a relatively high mean θ in various other classifications, this topic Ó 2018 Hogrefe Publishing
cannot be described by a single classification. The hot topics concerning neuropsychology (Topics 364, 249, and 323) and genetics (Topic 459) show their highest mean θ not only in “2500 Physiological Psychology & Neuroscience,” but also in other classifications. No distinctively matching classification can be identified for Topics 86 (traumatization of refugees) and 95 (traumatizationrelated disorders).
Selecting Publications for Research Synthesis The publications related to a topic can be filtered by (1) using the document-topic probabilities (θ) or (2) using the keywords that constitute the topic for literature search. We employed the first approach on the example of Hot Topic 386 (online therapy) and sorted documents by θ in decreasing order. This resulted in a list of all publications in the corpus, with the ones most likely addressing the topic ranking highest. The results were then filtered by selecting only empirical studies with values of θ higher than 1/k (i.e., the average document-topic probability). This means that Topic 386 occurs in these empirical studies with a probability above average. The distribution of θ values is shown in Figure 3. Inclusion criteria for subsequent research synthesis approaches can be applied to this subset Zeitschrift für Psychologie (2018), 226(1), 3–13
10
A. Bittermann & A. Fischer, How to Identify Hot Topics in Psychology
Methodological Limitations
Figure 3. Document-topic probabilities (θ) of n = 1,083 empirical studies for Topic 386 (sorted by θ). Only documents with θ higher than average are shown.
of 1,083 documents. Since the documents are ranked by θ, a list can be generated that allows for an inspection of relevant documents in the order of their topic probabilities.
Discussion The current study applied LDA-based topic modeling for a scientometric analysis of psychological research as a datadriven bottom-up approach for the identification of latent topics and trends. In a model with 500 topics, strongly increasing linear trends were found for topics addressing neuropsychology and genetics, online therapy, human migration and cross-cultural aspects, traumatization, and visual attention. These topics were referred to as hot topics in psychology. Additionally, it was shown how the resulting topics can be used for purposes of research synthesis. The topics’ contents corresponded with respective classifications, but as expected, they could not be matched to a single subject classification. Thus, the topics provided information beyond the scope of a predefined classification system. Prior scientometric research in psychology used classifications for determining trends (e.g., Krampen, 2016; Krampen & Trierweiler, 2013). From our results, it can be concluded that this approach is feasible as long as the classifications’ specificity is satisfactory. Using topic modeling, we were able to find specific topics that would have not been easy to detect by a classification-based approach, for example, lifestyle of adolescents and popular culture (Topic 49), attitude change of public opinion (Topic 72), values in individualism versus collectivism (Topic 144), or traumatization of refugees because of war and torture (Topic 86). Most topics represent a mixture of classifications.
Zeitschrift für Psychologie (2018), 226(1), 3–13
Topic models were based on standardized keywords (controlled terms, CTs) of the publications. This approach resulted in a much smaller number of corpus terms than would have resulted from using abstracts. CTs reflect a document’s content in a condensed manner and offer several advantages in the context of topic modeling: computation times are shorter, there is no need for stop words, they offer excellent readability, and since the topics consist of CTs, they can be used directly for subsequent literature searches. A significant disadvantage of using CTs is the time of their first thesaurus inclusion as a potential artifact. For instance, recently added CTs such as “Political Asylum” or “Asylum Seeking” (both included in 2015) cannot describe a topic during the years before their addition. Nevertheless, if one is interested in recent topics, the following approach for defining hotspots besides considering trends over time could be employed: by building a corpus for the respective recent years (e.g., 2015–2016), popular topics could be defined by examining the highest mean document-topic probabilities. This represents a cross-sectional approach using all currently available CTs. Similar to classifications, thesaurus-based CTs have limitations regarding their semantic detail. The uncontrolled keywords of the current study are “topic modeling, hotspots, scientometrics, trends, controlled terms,” with more or less corresponding CTs “Mathematical Modeling, Scientific Communication, Trends” (no matches for the quite specific keywords “hotspots” and “controlled terms”). The use of words in the abstract would overcome these shortcomings, since every word of the original text can be included. Downsides, on the other hand, are the problem of defining stop words (e.g., Schofield, Magnusson, & Mimno, 2017) and a much larger corpus vocabulary with higher computational demands that would require several days of calculation time or the use of a computer cluster. The number of topics k was determined by computing models for values of k and inspecting the respective loglikelihoods, which is referred to as the commonly used approach (Kosinski et al., 2016). The log-likelihoods increased with higher values of k, indicating that a model with more topics could show an even better fit. However, a model with more topics is more difficult to be understood and verified by experts (De Battisti et al., 2015). Besides, inspecting the hot topics for the applied values of k in this study revealed stable themes of neuropsychology, online therapy, human migration, traumatization, and visual attention. In this study, basic LDA was employed using the R programming language. Newer developments such as dynamic topic modeling with a focus on changes over time
Ó 2018 Hogrefe Publishing
A. Bittermann & A. Fischer, How to Identify Hot Topics in Psychology
(Blei & Lafferty, 2006) or correlated topic models that aim to capture correlations between the occurrence of latent topics (Blei & Lafferty, 2007) could further improve the identification of hot topics in psychology. The analysis of abstracts from different languages by employing polylingual topic models (Mimno, Wallach, Naradowsky, Smith, & McCallum, 2009) or multilingual probabilistic topic modeling (Vulić, De Smet, Tang, & Moens, 2015) could be of interest for future research as well.
Implications for Research Synthesis The topic modeling approach presented in this paper can be applied to the identification of hotspots in psychology. Erdfelder and Bošnjak (2016) related hotspots to the presence of a significant number of primary studies within a research-active field. We expanded the scope of hotspots by including all types of publications with the exception of historical studies and biographies in order to gain a comprehensive view on the topics that are addressed. For subsequent research synthesis purposes, primary studies which address hot topics can be easily identified in PSYNDEX by filtering the documents. This results in a list with the documents that show the highest document-topic probabilities at the highest ranks. A more common approach for selecting documents would be using the keywords (CTs) that constitute the topic. In this paper, only the top 15 terms of each hot topic were reported. Since a topic consists of a long list of terms, with various frequencies and term-to-topic probabilities, we encourage readers to take a closer look at the topics of interest. A sophisticated method for visualizing and interpreting topics is provided by “LDAvis” (Sievert & Shirley, 2014), which defines the relevance for ranking terms within topics based on weight parameters and can be employed in R with the “LDAvis” package (Sievert & Shirley, 2015).
Other Possible Applications A researcher who wants to develop a new area of interest can learn more about the subject’s structure by looking at the underlying topics. A publication database could be explored more in depth using topics, as illustrated by topic-based web browsers of Wikipedia (Chaney & Blei, 2012) or the Signs journal (Goldstone, Galán, Lovin, Mazzaschi, & Whitmore, 2014). Moreover, for a better navigation through a model with many topics, scientific documents could be divided into several topic clusters (Yau, Porter, Newman, & Suominen, 2014). Such clusters could constitute empirically derived themes as an alternative to manifest classifications. Ó 2018 Hogrefe Publishing
11
Since a document-topic probability is computed for every publication, those papers could be recommended that show the highest probabilities for the respective topic. A more sophisticated approach was presented by Wang and Blei (2011), who developed an algorithm for recommending scientific articles to users of an online community. Authors usually indicate their fields of interest, for example, “psychotherapy research” or “cognitive processes.” Here, topic modeling can be used to identify research topics based on the authors’ publications (Lu & Wolfram, 2012; Rosen-Zvi, Griffiths, Steyvers, & Smyth, 2004). This procedure results in a publication-based profile of authors which can be applied to find experts for specific topics, find authors with similar topics, or analyze authors’ change of publication-based interests over time.
Conclusion Topic modeling is a feasible method for an exploratory analysis of topics in psychological publications and for identifying hot research topics. The identification of specific topics in a large corpus of publications offers new possibilities of exploring research beyond predefined classifications. Furthermore, topics can be the starting point for subsequently applied research synthesis methods. Acknowledgments We thank Lisa Trierweiler and Katja Singleton for helpful comments and recommendations during the writing process, Jürgen Wiesenhütter and Veronika Kuhberg-Lasson for valuable input during early phases of this research, and Andreas Konz and Jannik Lorenz for hardware support. The action editors for this article were Edgar Erdfelder and Michael Bošnjak. Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 2151-2604/a000318 ESM 1. Code (.R). R code of the analyses. ESM 2. Text (.csv). List of topics for k = 500.
References Atkins, D. C., Rubin, T. N., Steyvers, M., Doeden, M. A., Baucom, B. R., & Christensen, A. (2012). Topic models: A novel method for modeling couple and family text data. Journal of Family Psychology, 26, 816–827. https://doi.org/10.1037/a0029607 Awati, K. (2015, September 29). A gentle introduction to topic modeling using R [Blog post]. Retrieved from https://eight2late. wordpress.com/2015/09/29/a-gentle-introduction-to-topicmodeling-using-r/
Zeitschrift für Psychologie (2018), 226(1), 3–13
12
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55, 77–84. https://doi.org/10.1145/2133806.2133826 Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. In W. Cohen & A. Moore (Eds.), Proceedings of the 23rd International Conference on Machine Learning (pp. 113–120). New York, NY: ACM. https://doi.org/10.1145/1143844.1143859 Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1, 17–35. https://doi. org/10.1214/07-AOAS114 Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. https://doi.org/10.1162/jmlr.2003.3.4-5.993 Chaney, A. J. B., & Blei, D. M. (2012, March). Visualizing topic models. Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (IWSCM). Retrieved from https:// www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/viewFile/4645/5021 Chen, E. E., & Wojcik, S. P. (2016). A practical guide to big data research in psychology. Psychological Methods, 21, 458–474. https://doi.org/10.1037/met0000111 Coppersmith, G., Dredze, M., & Harman, C. (2014). Quantifying mental health signals in Twitter. In P. Resnik, R. Resnik, & M. Mitchell (Eds.), Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality (pp. 51–60). Stroudsburg, PA: Association for Computational Linguistics. Retrieved from http://www.aclweb. org/anthology/W14-3207 De Battisti, F., Ferrara, A., & Salini, S. (2015). A decade of research in statistics: A topic model approach. Scientometrics, 103, 413–433. https://doi.org/10.1007/s11192-015-1554-1 Erdfelder, E., & Bošnjak, M. (2016). “Hotspots in Psychology”: A new format for special issues of the Zeitschrift für Psychologie. Zeitschrift für Psychologie, 224, 141–144. https://doi.org/ 10.1027/2151-2604/a000249 Feinerer, I., Hornik, K., & Meyer, D. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25, 1–54. https://doi. org/10.18637/jss.v025.i05 Fischer, A. (2015). How to determine the unique contributions of input-variables to the nonlinear regression function of a multilayer perceptron. Ecological Modelling, 309, 60–63. https://doi.org/10.1016/j.ecolmodel.2015.04.015 Friedman, D. P. (2008). Public outreach: A scientific imperative. Journal of Neuroscience, 28, 11743–11745. https://doi.org/ 10.1523/JNEUROSCI.0005-08.2008 Goldstone, A., Galán, S., Lovin, C. L., Mazzaschi, A., & Whitmore, L. (2014). An interactive topic model of signs. Signs at 40. Retrieved from http://signsat40.signsjournal.org/topicmodel Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(Suppl. 1), 5228–5235. https://doi.org/10.1073/pnas.0307752101 Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114, 211–244. https://doi.org/10.1037/0033-295X.114.2.211 Grün, B., & Hornik, K. (2011). Topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40, 1–30. https:// doi.org/10.18637/jss.v040.i13 Imel, Z. E., Steyvers, M., & Atkins, D. C. (2015). Computational psychotherapy research: Scaling up the evaluation of patientprovider interactions. Psychotherapy, 52, 19–30. https://doi. org/10.1037/a0036841 Jiang, H., Qiang, M., & Lin, P. (2016). A topic modeling based bibliometric exploration of hydropower research. Renewable and Sustainable Energy Reviews, 57, 226–237. https://doi.org/ 10.1016/j.rser.2015.12.194
Zeitschrift für Psychologie (2018), 226(1), 3–13
A. Bittermann & A. Fischer, How to Identify Hot Topics in Psychology
Kosinski, M., Wang, Y., Lakkaraju, H., & Leskovec, J. (2016). Mining big data to extract patterns and predict real-life outcomes. Psychological Methods, 21, 493–506. https://doi.org/10.1037/ met0000105 Krampen, G. (2016). Scientometric trend analyses of publications on the history of psychology: Is psychology becoming an unhistorical science? Scientometrics, 106, 1217–1238. https://doi.org/10.1007/s11192-016-1834-4 Krampen, G., & Trierweiler, L. (2013). Research on emotions in developmental psychology contexts: Hot topics, trends, and neglected research domains. In C. Mohiyeddini, M. Eysenck, & S. Bauer (Eds.), Handbook of psychology of emotions. Recent theoretical perspectives and novel empirical findings (Vol. 1, pp. 63–79). New York, NY: Nova Science. Lu, K., & Wolfram, D. (2012). Measuring author research relatedness: A comparison of word-based, topic-based, and author cocitation approaches. Journal of the Association for Information Science and Technology, 63, 1973–1986. https://doi.org/ 10.1002/asi.22628 Mimno, D., Wallach, H. M., Naradowsky, J., Smith, D. A., & McCallum, A. (2009). Polylingual topic models. In P. Koehn & R. Mihalcea (Eds.), Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 (pp. 880–889). Stroudsburg, PA: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/old_ anthology/D/D09/D09-1.pdf#page=918 Natale, F., Fiore, G., & Hofherr, J. (2012). Mapping the research on aquaculture. A bibliometric analysis of aquaculture literature. Scientometrics, 90, 983–999. https://doi.org/10.1007/s11192011-0562-z Park, G., Yaden, D. B., Schwartz, H. A., Kern, M. L., Eichstaedt, J. C., Kosinski, M., . . . Seligman, M. E. (2016). Women are warmer but no less assertive than men: Gender and language on Facebook. PloS One, 11, e0155885. https://doi.org/10.1371/ journal.pone.0155885.t003 Paul, M. J., & Girju, R. (2009). Topic modeling of research fields: An interdisciplinary perspective. In R. Mitkov & G. Angelova (Eds.), Proceedings of the International Conference RANLP2009 (pp. 337–342). Stroudsburg, PA: Association for Computational Linguistics. Retrieved from http://www.anthology. aclweb.org/R/R09/R09-1.pdf#page=361 Ponweiser, M., Grün, B., & Hornik, K. (2014). Finding scientific topics revisited. In M. Carpita, E. Bentari, & E. Qannari (Eds.), Advances in latent variables (pp. 93–100). Cham, Switzerland: Springer International. https://doi.org/10.1007/10104_2014_11 R Core Team. (2017). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. [Computer software]. Retrieved from https//www. R-project.org/ Resnik, P., Garron, A., & Resnik, R. (2013). Using topic modeling to improve prediction of neuroticism and depression. In D. Yarowsky, T. Baldwin, A. Korhonen, K. Livescu, & S. Bethard (Eds.), Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1348–1353). New York, NY: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/D13-1133 Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In M. Chickering & J. Halpern (Eds.), Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (pp. 487–494). Arlington, VA: AUAI Press. Retrieved from https:// mimno.infosci.cornell.edu/info6150/readings/398.pdf RStudio Team. (2016). RStudio: Integrated development for R [Computer software]. Boston, MA: RStudio, Inc. Retrieved from http://www.rstudio.com/
Ó 2018 Hogrefe Publishing
A. Bittermann & A. Fischer, How to Identify Hot Topics in Psychology
Schofield, A., Magnusson, M., & Mimno, D. (2017). Understanding text pre-processing for latent Dirichlet allocation. In M. Lapata, P. Blunsom, & A. Koller (Eds.), Proceedings of the 15th conference of the European chapter of the Association for Computational Linguistics: Volume 2, Short Papers (pp. 432–436). New York, NY: Association for Computational Linguistics. Retrieved from http://www.cs.cornell.edu/ xanda/ winlp2017.pdf Sievert, C., & Shirley, K. E. (2014). LDAvis: A method for visualizing and interpreting topics. In J. Chuang, S. Green, M. Hearst, J. Heer, & P. Koehn (Eds.), Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces (pp. 63–70). Stroudsburg, PA: Association for Computational Linguistics. Retrieved from E-mail http://www.aclweb.org/ anthology/W14-3110 Sievert, C., & Shirley, K. E. (2015). LDAvis: Interactive visualization of topic models. R package version 0.3.2. [Computer software]. Retrieved from https://CRAN.R-project.org/package=LDAvis Steyvers, M., & Griffiths, T. L. (2008). Rational analysis as a link between human memory and information retrieval. In N. Chater & M. Oaksford (Eds.), The probabilistic mind: Prospects for a Bayesian cognitive science (pp. 329–350). Oxford, UK: Oxford University Press. Tang, J., Meng, Z., Nguyen, X., Mei, Q., & Zhang, M. (2014). Understanding the limiting factors of topic modeling via posterior contraction analysis. In E. P. Xing (Ed.), 31st International Conference on Machine Learning (ICML 2014) (pp. 190–198). Stroudsburg, PA: International Machine Learning Society. Retrieved from http://proceedings.mlr.press/v32/tang14.pdf Thelwall, M., & Thelwall, S. (2016). Development studies research 1975–2014 in academic journal articles: The end of economics? El Profesional de la Información, 25, 47–58. https://doi.org/ 10.3145/epi.2016.ene.06 Tuleya L. G. (Ed.). (2007). Thesaurus of psychological index terms (11th ed.). Washington, DC: American Psychological Association. Vulić, I., De Smet, W., Tang, J., & Moens, M. F. (2015). Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications. Information Processing & Management, 51, 111–147. https://doi.org/10.1016/j.ipm. 2014.08.003
Ó 2018 Hogrefe Publishing
13
Wallach, H. M., Mimno, D. M., & McCallum, A. (2009). Rethinking LDA: Why priors matter. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems 22 (NIPS 2009) (pp. 1973–1981). La Jolla, CA: Neural Information Processing Systems. Retrieved from http://dirichlet.net/pdf/ wallach09rethinking.pdf Wang, C., & Blei, D. M. (2011). Collaborative topic modeling for recommending scientific articles. In C. Apte, J. Ghosh, & P. Smyth (Eds.), Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 448–456). New York, NY: ACM. https://doi.org/ 10.1145/2020408.2020480 Yau, C. K., Porter, A., Newman, N., & Suominen, A. (2014). Clustering scientific documents with topic modeling. Scientometrics, 100, 767–786. https://doi.org/10.1007/s11192-0141321-8 Zheng, B., McLean, D. C., & Lu, X. (2006). Identifying biological concepts from a protein-related corpus with a probabilistic topic model. BMC Bioinformatics, 7, 58. https://doi.org/ 10.1186/1471-2105-7-58 ZPID–Leibniz-Zentrum für Psychologische Information und Dokumentation. (Eds.). (2016). PSYNDEX terms (10th ed.). Trier, Germany: ZPID. Retrieved from https://www.zpid.de/pub/info/ PSYNDEXterms2016.pdf Received October 26, 2017 Revision received November 19, 2017 Accepted November 21, 2017 Published online February 2, 2018
André Bittermann Leibniz Institute for Psychology Information (ZPID) Universitätsring 15 54296 Trier Germany abi@leibniz-psychology.org
Zeitschrift für Psychologie (2018), 226(1), 3–13
Review Article
The Structure of the Rosenberg Self-Esteem Scale A Cross-Cultural Meta-Analysis Timo Gnambs,1 Anna Scharl,1 and Ulrich Schroeders2 1
Leibniz Institute for Educational Trajectories, Bamberg, Germany
2
Psychological Assessment, University of Kassel, Germany Abstract: The Rosenberg Self-Esteem Scale (RSES; Rosenberg, 1965) intends to measure a single dominant factor representing global selfesteem. However, several studies have identified some form of multidimensionality for the RSES. Therefore, we examined the factor structure of the RSES with a fixed-effects meta-analytic structural equation modeling approach including 113 independent samples (N = 140,671). A confirmatory bifactor model with specific factors for positively and negatively worded items and a general self-esteem factor fitted best. However, the general factor captured most of the explained common variance in the RSES, whereas the specific factors accounted for less than 15%. The general factor loadings were invariant across samples from the United States and other highly individualistic countries, but lower for less individualistic countries. Thus, although the RSES essentially represents a unidimensional scale, cross-cultural comparisons might not be justified because the cultural background of the respondents affects the interpretation of the items. Keywords: self-esteem, factor analysis, wording effect, meta-analysis, measurement invariance
More than 50 years of research and hundreds of empirical studies have failed to solve the dispute surrounding the dimensionality of the Rosenberg Self-Esteem Scale (RSES). Originally, Rosenberg (1965) considered self-esteem a unitary construct reflecting individual differences in the evaluation of one’s self-worth and self-respect. In empirical studies, however, several researchers highlighted the need to acknowledge between one and four secondary dimensions, in addition to general self-esteem, to properly model responses to the RSES (e.g., Alessandri, Vecchione, Eisenberg, & Łaguna, 2015; Donnellan, Ackerman, & Brecheen, 2016; Tafarodi & Milne, 2002; Urbán, Szigeti, Kökönyei, & Demetrovics, 2014). Within the last decades the structural ambiguity of the RSES led to a form of “beauty contest” (Reise, Kim, Mansolf, & Widaman, 2016, p. 819) of factor analytic studies designed to explore the structure of the RSES in diverse samples. Although strict unidimensionality is hard to achieve for many psychological self-report scales (Reise, Moore, & Haviland, 2010), pronounced multidimensionality poses a frequently neglected problem for applied researchers using composite scores. In this instance, simple sum scores across all items can bias person estimates, because they reflect a blend of different latent traits. Further difficulties arise if the identified factor structure depends on important moderating Zeitschrift für Psychologie (2018), 226(1), 14–29 https://doi.org/10.1027/2151-2604/a000317
influences such as respondents’ cognitive abilities (Marsh, 1996; Gnambs & Schroeders, in press) or their cultural affiliation (Song, Cai, Brown, & Grimm, 2011; Supple, Su, Plunkett, Peterson, & Bush, 2013). Group comparisons that are based on instruments lacking measurement invariance can result in seriously biased (if not wrong) conclusions (see Chen, 2008; Kuha & Moustaki, 2015). Therefore, we present a meta-analytic summary on the factor structure of the RSES to evaluate whether the RSES scores reflect a single trait or a composite of different traits. Moreover, we explore the cross-cultural measurement invariance of the scale between culturally diverse countries from America, Europe, and Asia.
Dimensionality of the Rosenberg Self-Esteem Scale Since its introduction, a wealth of exploratory and confirmatory factor studies have examined the structure of the RSES. In line with its original conception, many researchers identified a single factor explaining the covariances between the items of the scale (e.g., Franck, de Raedt, Ó 2018 Hogrefe Publishing
T. Gnambs et al., Meta-Analysis of the Rosenberg Self-Esteem Scale
Barbez, & Rosseel, 2008; Mimura & Griffiths, 2007; Schmitt & Allik, 2005). Global self-esteem, as identified in these studies, reflects an individual’s self-liking or, in Rosenberg’s words, the feeling that “one’s good enough” (1965, p. 31). For example, Schmitt and Allik (2005) reported the results of an international large-scale project that translated the RSES into 28 languages and administered the scale to almost 17,000 participants in 53 countries around the globe. The authors concluded that most samples supported a unidimensional structure for the RSES. However, a closer inspection of the reported analyses reveals that this conclusion is not warranted by the statistical methods used: first, competing theories about the dimensional structure should be tested with confirmatory factor analyses rather than exploratory factor analyses (e.g., Schmitt, 2011). Second, the authors used principal component analysis, which is a data reduction tool not suitable to discover underlying structures – a fact that has been stressed several times in the psychometric literature (e.g., Preacher & MacCallum, 2003). This study as well as many others (e.g., Mimura & Griffiths, 2007) exemplify that statements about the dimensionality of the RSES are often not based on appropriate statistical methods. In contrast to the monolithic conceptualization of the RSES, early factor analytic studies pointed to a different structure (e.g., Dobson, Goudy, Keith, & Powers, 1979; Goldsmith, 1986; Goldsmith & Goldsmith, 1982; Hensley & Roberts, 1976). Because the RSES assesses positive selfappraisals (e.g., “I feel that I have a number of good qualities.”) and negative self-appraisals (e.g., “At times, I think I am no good at all.”) with opposingly keyed items (see Appendix), exploratory factor analyses of the questionnaire typically reveal two separable factors, one for the positively worded items and the other for the negatively worded items. This pattern is often brought into connection with specific response styles such as acquiescence (DiStefano & Motl, 2006; Tomás, Oliver, Galiana, Sancho, & Lila, 2013). In this perspective, the multidimensionality of the RSES reflects mere method-specific variance that needs to be controlled for in empirical analyses (Marsh, 1996). However, some researchers challenged this interpretation and adhered to the view of qualitatively different types of self-esteem (e.g., Alessandri et al., 2015; Owens, 1994). They argued that these two dimensions imply a substantive distinction between positive and negative self-esteem. In line with this view, the negatively keyed items of the RSES, which can be interpreted as an expression of intense negative affect toward oneself as a form of selfderogation (Kaplan & Pokorny, 1969), predicted higher alcohol consumption and drug use among adolescents (Epstein, Griffin, & Botvin, 2004; Kaplan, Martin, & Robbins, 1982). In contrast, the factor associated with
Ó 2018 Hogrefe Publishing
15
positively worded items supposedly captures an individual’s self-appraisal of his or her competences (Alessandri et al., 2015). This two-dimensional model of self-esteem has been replicated across measurement occasions (Marsh, Scalas, & Nagengast, 2010; Michaelides, Koutsogiorgi, & Panayiotou, 2016), subgroups (DiStefano & Motl, 2009), and even different language versions (Supple et al., 2013). Moreover, evidence for positive and negative self-esteem was also found in a meta-analysis of exploratory factor analyses that scrutinized the configural measurement invariance of the RSES across 80 samples (Huang & Dong, 2012). However, in these studies the identification of positive and negative self-esteem as subcomponents of the RSES remained entirely data-driven and was only post hoc enriched with a potential theoretical foundation, which speaks in favor of the conceptualization of a method artifact. Other researchers offered a theoretical explanation for alternative facets of the RSES (Tafarodi & Milne, 2002; Tafarodi & Swann, 1995). According to these authors an individual “takes on value both by merit of what she can do and what she is” (Tafarodi & Milne, 2002, p. 444). Thus, self-esteem derives from one’s appraisal of observable skills and abilities as well as from intrinsic values such as character and morality. In this conceptualization, the RSES subsumes two distinct subscales, self-competence and self-liking, which are independent of any wording effects. Self-liking reflects one’s self-worth as an individual, similar to the original view of global self-esteem, whereas self-competence refers to one’s self-views as a source of power similar to Bandura’s (1977) concept of self-efficacy. Although initial confirmatory factor studies supported this theoretical model (Tafarodi & Milne, 2002; Tafarodi & Swann, 1995), replication attempts failed (e.g., Donnellan et al., 2016; Marsh et al., 2010). Therefore, it is unclear whether this theoretically motivated model provides a meaningful description of the RSES.
Cross-Cultural Replicability of the Factor Structure The RSES has been translated into dozens of languages and is routinely administered in countries across the world (e.g., Alessandri et al., 2015; Baranik et al., 2008; Farruggia, Chen, Greenberger, Dmitrieva, & Macek, 2004; Schmitt & Allik, 2005; Song et al., 2011; Supple et al., 2013). In light of the inconsistent findings on the dimensionality of the original instrument, the structural ambiguity extends to the translated versions. Moreover, several caveats contribute to dimensional differences between language versions. For example, intercultural differences in the familiarity with
Zeitschrift für Psychologie (2018), 226(1), 14–29
16
certain stimuli, response formats, or testing procedures can disadvantage certain groups (van de Vijver & Poortinga, 1997). Or, despite best efforts translation errors can unintentionally change the meaning of specific items. But even correctly translated items might convey a different meaning within different societies because of nomothetic beliefs and value systems. In addition, the adoption of systematic response styles is subject to pronounced intercultural variations (e.g., He, Bartram, Inceoglu, & van de Vijver, 2014; He, Vliert, & van de Vijver, 2016; Johnson, Kulesa, Cho, & Shavitt, 2005; Smith et al., 2016). For example, acquiescence is more prevalent among members of harmonic societies that favor interrelatedness over independence, whereas extreme responding is found more likely in cultures emphasizing individualism and self-reliance (Johnson et al., 2005; Smith et al., 2016). Thus, intercultural differences in response styles can contribute to factorial differences in psychological measures. Regarding the RSES, several cross-cultural studies examined its measurement across cultural groups: for example, Farruggia and colleagues (2004) demonstrated strict measurement invariance for a bidimensional model of the RSES across four adolescent samples from China, Czech Republic, Korea, and the USA. However, this result was only achieved after removing a noninvariant item (“I wish I could have more respect for myself.”) due to extremely low factor loadings in the non-US samples. This finding was also replicated in a study comparing US immigrants with European, Latino, Armenian, and Iranian background (Supple et al., 2013). Short of the previously identified item, the RSES exhibited strong measurement invariance across the ethnic groups. However, other analyses revealed more severe cross-cultural differences: for two samples of US and Chinese college students only three items were fully measurement invariant (Song et al., 2011). Rather, the two groups used the scale very differently (see Baranik et al., 2008, for similar results). Thus, frequently observed cultural differences in self-esteem between Western and Eastern countries might be spurious effects from differential item functioning associated with cultural values.
The Present Study In response to the ongoing controversy regarding the structure of the RSES, we scrutinized the dimensionality of the RSES in a meta-analytic structural equation modeling (MASEM; Cheung, 2014) framework. We conducted a systematic literature research to retrieve studies reporting on the dimensionality of the RSES. In contrast to Huang and Dong’s (2012) meta-analysis that simply aggregated the number of times two items exhibited their strongest Zeitschrift für Psychologie (2018), 226(1), 14–29
T. Gnambs et al., Meta-Analysis of the Rosenberg Self-Esteem Scale
loading on the same factor across multiple exploratory factor analyses, we estimated a pooled variance-covariance matrix on an item-level (cf. Gnambs & Staufenbiel, 2016). This allowed us to derive an overall evaluation of the scale’s internal structure by investigating the configural model of the RSES (i.e., the number of factors) along with information on the size of the factor loadings (i.e., metric information). Moreover, we compared the different competing measurement models described in the literature. Given overwhelming evidence of secondary dimensions in the RSES (e.g., Alessandri et al., 2015; Marsh et al., 2010; Michaelides et al., 2016), we expected a worse fit of a single factor model as compared to models that also acknowledge different subdimensions of self-esteem (Hypothesis 1). Because several studies failed to identify self-liking and self-competence as subcomponents of self-esteem (e.g., Donnellan et al., 2016; Marsh et al., 2010), we expected more support for positive and negative self-esteem in the RSES (Hypothesis 2). In order to capture the multidimensionality in the presence of a strong overarching self-esteem factor, we also relied on bifactor models (Brunner, Nagy, & Wilhelm, 2012; Reise, 2012) that used each item as an indicator of a general dimension (i.e., global self-esteem) and an orthogonal specific factor (e.g., for negatively worded items). This allowed us to retain the goal of measuring a single trait common to all items and estimating the proportion of common variance explained by general self-esteem. Because bifactor models include less constraints than comparable correlated trait models (Reise, 2012), we expected better support for a bifactor structure of the RSES (Hypothesis 3). Finally, we explored the cross-cultural measurement invariance of the RSES by comparing its factor structure across samples from highly individualistic countries (e.g., USA, Germany) to those from less individualistic societies (e.g., China, Indonesia). Individualism refers to the degree of autonomy and self-actualization people in a given society strive for as compared to an emphasis of interrelatedness and group cohesion (Hofstede, Hofstede, & Minkov, 2010). Because expressions of overly positive self-views (i.e., selfenhancement) are typically seen as less appropriate among members of less individualistic societies (Heine, Lehman, Markus, & Kitayama, 1999; Markus & Kitayama, 1991), we expected cultural individualism to affect the loading structure of the RSES. However, short of item 8 that seems to convey a different meaning in Asian cultures (see Farruggia et al., 2004), we had no a priori hypotheses regarding the degree of measurement invariance across societies.
Method Meta-Analytic Database The search for primary studies reporting on the factor structure of the RSES included major scientific databases Ó 2018 Hogrefe Publishing
T. Gnambs et al., Meta-Analysis of the Rosenberg Self-Esteem Scale
17
Figure 1. Flowchart process.
(ERIC, PsycINFO, Psyndex, Medline), public data archives (GESIS data catalog, Inter-university Consortium for Political and Social Research [ICPSR] data archive, UK data archive), and Google Scholar. Additional studies derived from the references of all identified articles (“rolling snowball method”). In January 2017, we identified 7,760 potentially relevant journal articles and data archives using the Boolean expression Rosenberg self-esteem AND (factor analysis OR factor structure OR principal component analysis). After reviewing the title and the abstracts of these results, we retained all studies that met the following criteria: (a) In the study the original 10-item version of the RSES was administered, (b) the questionnaire employed at least four response options (in order to implement linear factor analyses in subsequent analyses, see Rhemtulla, Brosseau-Liard, & Savalei, 2012), and (c) the loading pattern from an exploratory factor analysis or the full covariance matrix between all items was reported. In case, the raw data of a study was available, we calculated the respective covariance matrix. If oblique factor rotations were used, we only considered studies that also reported the respective factor correlations. Moreover, the analyses were limited to (d) samples including healthy individuals without mental disorders. This literature search and screening process resulted in 34 eligible studies for our meta-analysis that reported on 113 independent samples (see Figure 1). Coding Process In a coding protocol (available in the online data repository, see below), we defined all relevant information to be extracted from each publication and gave guidelines concerning the range of potential values for each variable. Since covariance matrices on an item-level were rarely reported, loading patterns from exploratory factor analyses Ó 2018 Hogrefe Publishing
of
search
were the focal statistics. In case different factor solutions for one and the same sample were available, we used the factor loading pattern with the largest number of factors. Additionally, descriptive information was collected on the sample (e.g., sample size, country, mean age, percentage of female participants), the publication (e.g., publication year), and the reported factor analysis (e.g., factor analytic method, type of rotation). All studies were coded by the first author. To evaluate the coding process two thirds of the studies were independently coded a second time by the second author. Intercoder agreement was quantified using two-way intraclass coefficients (ICC; Shrout & Fleiss, 1979) which indicate strong agreement for values exceeding .70 and excellent agreement for values greater than .90 (LeBreton & Senter, 2008). The intercoder reliabilities were generally high (approaching 1); for example, for the factor loadings the ICC was .99, 95% CI [.99, .99]. Meta-Analytic Procedure Effect Size The zero-order Pearson product moment correlations between the 10 items of the RSES were used as effect sizes. Ten samples reported the respective correlation matrices, whereas 26 samples provided raw data that allowed the calculation of these correlations. The remaining 77 samples reported factor pattern matrices that were used to reproduce the item-level correlations (Gnambs & Staufenbiel, 2016). One study (Rojas-Barahona, Zegers, & Förster, 2009) neglected to report the full factor loading pattern and excluded small loadings falling below .40. In this case, a value of zero was imputed for the missing factor loadings, because Monte Carlo simulations indicated that this approach results in unbiased estimates of meta-analytic factor patterns (Gnambs & Staufenbiel, 2016). Zeitschrift für Psychologie (2018), 226(1), 14–29
18
T. Gnambs et al., Meta-Analysis of the Rosenberg Self-Esteem Scale
Figure 2. Single factor and acquiescence models for the RSES with standardized factor loadings.
Meta-Analytic Factor Analyses The correlation matrices were pooled across samples using a recent development in MASEM (Cheung, 2014), that allows for the meta-analytic integration of correlation matrices and factor loading structures from exploratory factor analyses (see Gnambs & Staufenbiel, 2016). More precisely, for each item pair of the RSES the correlations were pooled using a fixed-effects model with a generalized least square estimator (Becker, 1992). Sampling error was accounted for by weighting each individual correlation using the sample size. The derived pooled correlation matrix for the RSES was used as input for confirmatory factor analyses with a maximum likelihood estimator. A series of simulation studies indicated that this meta-analytic procedure precisely Zeitschrift für Psychologie (2018), 226(1), 14–29
recovers the population factor structure of an instrument (Gnambs & Staufenbiel, 2016). Multiple criteria were used to evaluate the fit of competing factor models (see Figures 2 and 3). In line with conventional standards (see Schermelleh-Engel, Moosbrugger, & Müller, 2003) models with a comparative fit index (CFI) .95, a root mean square error of approximation (RMSEA) .08, and a standardized root mean square residual (SRMR) .10 were interpreted as “acceptable” and models with CFI .97, RMSEA .05, and SRMR .05 as “good” fitting. Moderator Analyses Cross-cultural measurement invariance was evaluated within the well-established framework of multigroup Ó 2018 Hogrefe Publishing
T. Gnambs et al., Meta-Analysis of the Rosenberg Self-Esteem Scale
Figure 3. Multidimensional factor models for the RSES with standardized factor loadings.
19
Ó 2018 Hogrefe Publishing
Zeitschrift für Psychologie (2018), 226(1), 14–29
20
confirmatory factor analysis (Wicherts & Dolan, 2010). First, each country was allotted the respective individualism score from Minkov et al. (2017) that reflects the relative standing of each country on the respective cultural dimension. Then, the samples were divided at the mean individualism score (M = 0) into two groups (low versus high). Because various factors (e.g., language, economic conditions, political systems) can contribute to cross-country differences, samples from the United States as an example of a highly individualistic country formed a third group. The latter was used as homogenous reference to gauge the robustness of the identified factor patterns. We expected negligible differences between the US samples and samples from other highly individualistic countries, whereas both groups should show similar differences in comparison to samples from less individualistic countries. Subsequently, we reestimated the pooled correlation matrices and fitted the factor models to the correlation matrices within each group. Different steps of invariance of the measurement models can be tested by applying increasingly restrictive constraints across groups. Because of the large sample size and the excessive power of statistical tests in the current case, measurement invariance was evaluated based on differences in practical fit indices (Marsh, Nagengast, & Morin, 2013). To this end, simulation studies indicated that differences in CFI less than .002 between two hierarchical nested models indicate essential measurement invariance (Khojasteh & Lo, 2015; Meade, Johnson, & Braddy, 2008). Moreover, differences in factor loadings between groups less than .10 are considered negligible (cf. Saris, Satorra, & van der Veld, 2009). Sensitivity Analyses The robustness of the identified factor structure was evaluated by subjecting the samples with complete correlation matrices (n = 36) to a random-effects meta-analysis (Cheung & Chan, 2005; Jak, 2015). Therefore, the pooled correlation matrix was estimated using a multivariate approach with a weighted least square estimator. Subsequently, we repeated the factor analyses using the asymptotic covariance matrix derived in the previous step as weight matrix for the factor models. Simulation studies indicated that this two-step approach is superior to univariate meta-analyses and more precisely recovers population effects (Cheung & Chan, 2005). However, as of yet, it cannot accommodate correlations reproduced from factor patterns. Examined Factor Models for the RSES We tested a series of structural models for the RSES that have been frequently applied in the literature (see Figures 2 and 3). If not stated otherwise, factor loadings and residual variances were freely estimated, whereas the Zeitschrift fßr Psychologie (2018), 226(1), 14–29
T. Gnambs et al., Meta-Analysis of the Rosenberg Self-Esteem Scale
latent factor variances were fixed to 1 for identification purposes. Moreover, the residual variances for all items were uncorrelated. Model 1: Single Factor Model A single common factor was assumed to explain the covariances between the RSES items (see Figure 2). This model corresponds to the original construction rationale of the scale (Rosenberg, 1965) and implicitly guided most applied research that derived simple sum scores from the RSES items. Model 2: Acquiescence Model Self-reports are frequently distorted by systematic response styles such as acquiescence, that is, interindividual differences in the tendency to agree to an item independent of its content (Ferrando & Lorenzo-Seva, 2010). Therefore, we extended Model 1 by another orthogonal latent factor common to all items with factor loadings fixed to 1 (Aichholzer, 2014; Billiet & McClendon, 2000). The latent variance of the second factor was freely estimated and reflected differences in acquiescence. Model 3: Correlated Trait Factors for Positive and Negative Self-Esteem Two correlated latent factors were specified that represent positive and negative self-esteem (see Model 3 in Figure 3), indicated by either the five positively keyed items (1, 3, 4, 7, 10) or the five negatively keyed items (2, 5, 6, 8, 9), respectively. This model was suggested in early factor analytic studies (e.g., Dobson et al., 1979; Goldsmith, 1986; Goldsmith & Goldsmith, 1982; Hensley & Roberts, 1976) and reflects the assumption of qualitatively different types of self-esteem for differently worded items (see also Alessandri et al., 2015; Owens, 1994). Model 4: Bifactor Model for Positive and Negative Self-Esteem The bifactor structure (see Brunner et al., 2012; Reise, 2012) included a general factor for all items of the RSES and two specific factors for the positively and negatively keyed items (see Model 4 in Figure 3). In this model, the two method factors capture the residual variance that is attributed to the positively and negatively keyed items after accounting for the shared variance of all items. Trait and method factors were uncorrelated. This model is mathematically equivalent to the correlated trait model, however, does not include proportional constraints on the factor loadings (Reise, 2012). Because previous studies (e.g., Donnellan et al., 2016; Marsh et al. 2010) found more pronounced method effects for negatively keyed items and inconsistent loading patterns (i.e., nonsignificant or even negative) for the positively keyed items, we also estimated two nested factor Ă“ 2018 Hogrefe Publishing
T. Gnambs et al., Meta-Analysis of the Rosenberg Self-Esteem Scale
models (see Eid, Geiser, Koch, & Heene, 2017; Schulze, 2005) that included only one specific factor, either for the positively or for the negatively worded items (Models 4a and 4b). In this model, the general factor is understood as general self-esteem, which is orthogonal to a method factor capturing the residual variance of the items. Model 5: Correlated Trait Factors for Self-Liking and Self-Competence In line with Tafarodi and Milne (2002; see also Tafarodi & Swann, 1995), two qualitatively distinct subcomponents of self-esteem, self-liking and self-competence, were modeled with two correlated latent factors (see Model 5 in Figure 3). Self-liking was indicated by items 1, 2, 6, 8, and 10, whereas self-competence was formed by the remaining items (3, 4, 5, 7, 9). Model 6: Bifactor Model for Self-Liking and Self-Competence Similar to Model 4, the correlated trait model was reparameterized as a bifactor structure including a general selfesteem factor and two specific factors (see Model 6 in Figure 3). In this model, the two specific factors captured the residual variance that is attributed to self-liking and self-competence after accounting for the shared variance of all items. Again, we also estimated two nested factor models (Models 6a and 6b) that included only one specific factor, either for self-liking or for self-competence, to independently evaluate the relevance of each specific factor. Model 7: Combined Bifactor Model This model combined the bifactor model for positive and negative self-esteem (Model 4) with the bifactor model for self-liking and self-competence (Model 6). Following Tafarodi and Milne (2002), we modeled five orthogonal latent factors: all 10 items loaded on the general factor, whereas the four specific factors were defined by five items each, either the positively keyed items (1, 3, 4, 7, 10), the negatively keyed items (2, 5, 6, 8, 9), the items associated with self-liking (1, 2, 6, 8, 10), or the items referring to self-competence (3, 4, 5, 7, 9). However, in past research this model frequently failed to converge due to overfactorization (e.g., Alessandri et al., 2015; Donnellan et al., 2016; Marsh et al., 2010). Statistical Software and Open Data All analyses were conducted in R version 3.4.2 (R Core Team, 2017). The factor models were estimated in lavaan version 0.5–23.1097 (Rosseel, 2012) and metaSEM version 0.9.16 (Cheung, 2015). To foster transparency and reproducibility of our analyses (see Nosek et al., 2015), we provide all coded data and the R scripts in an online Ó 2018 Hogrefe Publishing
21
repository of the Open Science Framework: https://osf.io/ uwfsp.
Results Study Characteristics The meta-analysis included 113 independent samples that were published between 1969 and 2017 (Mdn = 2005). About half of the samples (n = 53) were from a single publication (Schmitt & Allik, 2005) that compared the RSES across several cultural groups. The remaining studies provided between 1 and 10 samples (Mdn = 1). In total, the samples included N = 140,671 participants; the median sample size was 380 (Min = 59, Max =22,131). The samples included, on average, Mdn = 55% women (Min = 0%, Max = 100%) and had a mean age of M = 28.05 years (SD = 12.95, Min = 10.49, Max = 67.54). Most samples were from the United States (18%), the Netherlands (8%), and Germany (6%). Accordingly, the predominant languages of the administered RSES were English (42%), followed by Dutch (10%) and German (8%). Thirty-two percent of the samples provided correlation matrices between the 10 items of the RSES, whereas the rest reported factor loading patterns. For the latter, about 86% reported one-factor structures and the others two-factor solutions with varimax rotation. The characteristics of each individual sample are given in Table S1 in the Electronic Supplementary Material, ESM 1. Pooled Correlation Matrix for the RSES Following Gnambs and Staufenbiel (2016), we pooled the (reproduced) correlations between the 10 items of the RSES across all samples. The respective correlation matrix is given in Table 1 (lower off diagonal). All items were substantially correlated, with correlations ranging from .21 to .61 (Mdn = .40). Given the large overall sample size, the respective standard errors were small (all SEs < .001). Moreover, Kaiser’s measure of sampling adequacy (MSA; Kaiser & Rice, 1974) indicated substantial dependencies between the items (all MSAs > .89), thus, demonstrating the adequacy of the pooled correlation matrix for further factor analytic examinations. The eigenvalues of the first two unrotated factors exceeded 1 (λ1 = 4.61 and λ2 = 1.10), whereas the third did not (λ3 = 0.68). Accordingly, we conducted an exploratory maximum likelihood factor analysis with oblimin rotation that extracted two factors (see Table 2). These factors closely mirrored the correlated trait model for positive and negative self-esteem (see Model 2 in Figure 3). The five negatively worded items had salient loadings on one factor, Mdn(|λ|) = .57 (Min = .45, Max = .80), whereas the positively worded items primarily loaded on the second factor, Mdn(|λ|) = .61 (Min = .51, Max = .75). All cross-loadings were small, Mdn(|λ|) = .07 (Min = .01, Max = .23). Because the two factors were substantially Zeitschrift für Psychologie (2018), 226(1), 14–29
22
T. Gnambs et al., Meta-Analysis of the Rosenberg Self-Esteem Scale
Table 1. Pooled correlation matrices for the items of the Rosenberg Self-Esteem Scale Item 1 Item 1
Item 2
Item 3
Item 4
Item 5
Item 6
Item 7
Item 8
Item 9
Item 10
.407
.428
.382
.368
.394
.434
.301
.437
.621
.310
.281
.459
.651
.334
.423
.534
.433
.470
.347
.287
.538
.201
.347
.466
.288
.268
.418
.181
.304
.399
.450
.338
.346
.490
.383
.312
.418
.526
.414
.227
.373
.476
.397
.335
Item 2
.423
Item 3
.419
.330
Item 4
.374
.298
.449
Item 5
.395
.446
.363
.313
Item 6
.411
.605
.312
.292
.446
Item 7
.424
.355
.498
.400
.362
.335
Item 8
.328
.405
.230
.206
.345
.394
.259
Item 9
.446
.516
.360
.322
.479
.510
.386
.396
Item 10
.589
.448
.457
.398
.408
.434
.468
.361
.473 .480
Notes. Correlations for 113 independent samples (N = 140,671) pooled with a fixed-effects model below the diagonal and correlations for 36 independent samples reporting full correlation matrices (N = 109,988) pooled with a random-effects model above the diagonal.
Table 2. Meta-analytic exploratory factor analysis of the Rosenberg Self-Esteem Scale Factor 1
Factor 2
h2 .47
Item 1
.23
.51
Item 2#
.78
.02
.59
Item 3
.08
.75
.49
Item 4
.03
.61
.35
Item 5#
.45
.23
.39
#
Item 6
.80
.05
.58
Item 7
.01
.67
.46
Item 8#
.51
.04
.29
Item 9#
.57
.19
.50
Item 10
.23
.56
.54
2.36
2.29
Eigenvalue Explained variance
24%
23%
Notes. N = 140,671. Maximum likelihood factor analysis with oblimin rotation (factor correlation: .68) based upon pooled correlation matrix. Gray cells indicate salient pattern coefficients > .40; #negatively keyed items.
correlated (r = .68), the covariances between the RSES items were at least partially attributable to a common factor. Evaluation of Structural Models for the RSES Given the correlated factor structure, we examined to what degree the item variances could be explained by a general factor underlying all 10 items of the RSES. To this end, we fitted 11 different structural models to the pooled correlation matrix. The fit statistics in Table 3 highlight several notable results. First, the single factor model (see Figure 2) exhibited a rather inferior fit: CFI = .90, TLI = .87, and RMSEA = .10. This is in line with our exploratory analyses and the prevalent factor analytic literature on the RSES (e.g., Donnellan et al., 2016; Marsh et al., 2010; Michaelides et al., 2016). Second, although modeling an acquiescence factor improved the model fit (CFI = .97, Zeitschrift für Psychologie (2018), 226(1), 14–29
TLI = .95, RMSEA = .06), the latent variance was rather small (Var = 0.049). The acquiescence factor explained less than 5% of the common variance (ECV; Rodriguez, Reise, & Haviland, 2016). Third, all multidimensional models for wording effects outperformed respective models for self-liking and self-competence. Thus, there was more support for negative and positive self-esteem than for Tafarodi’s self-esteem facets (Tafarodi & Milne, 2002; Tafarodi & Swann, 1995). Finally, Model 7 with specific factors for wording effects, self-liking, self-competence, and a general self-esteem factor showed the best fit in terms of the information criteria. However, the practical fit indices indicated only a marginally better fit than the more parsimonious bifactor model with wording effects (Model 4). The loading patterns for all examined models are summarized in Table S2 in ESM 1. Despite the empirical preference for the more complex multidimensional models as compared to the single factor model and the acquiescence model, most specific factors had issues with factor loadings (see Figure 3). The specific positive factor (Model 4) exhibited only a single substantial loading greater than .40 (item 3) and even two loadings close to zero. This corroborates previous findings (e.g., Donnellan et al., 2016; Marsh et al. 2010) that demonstrated rather unclear loading patterns for the positively keyed items. Similar, the items showed only weak (or even negative) specific factor loadings for self-liking and selfcompetence (Model 6). Only negative self-esteem captured substantial residual variance over and above the general factor. However, the ECV for the bifactor models highlighted that most variance was captured by the general factor: in Model 4, ECV was .88 for the general, .02 for the positive, and .10, for the negative factor, whereas ECV fell at .95 for the general, .00 for the self-liking, .and .04 for the self-competence factor in Model 6. Thus, the multidimensionality in the RSES was predominately attributable to the negatively keyed items. Ó 2018 Hogrefe Publishing
T. Gnambs et al., Meta-Analysis of the Rosenberg Self-Esteem Scale
23
Table 3. Fit statistics for different factor models for the Rosenberg Self-Esteem Scale Model
w2
df
CFI
TLI
SRMR RMSEA
90% CI
AIC
BIC
1. Single factor model
48,361.93*
35
.902
.874
.053
.099
[.098, .100]
3547,789.43
3547,896.51
2. Acquiescence model
17,474.60*
34
.965
.953
.030
.060
[.060, .061]
3516,904.10
3517,111.03
17,856.52*
34
.964
.952
.032
.061
[.060, .062]
3517,286.02
3517,492.96
3,518.86*
25
.993
.987
.013
.032
[.031, .032]
3502,966.63
3503,261.99
4a. Nested factor for positive self-esteem
14,160.36*
30
.971
.957
.026
.058
[.057, .059]
3514,597.86
3514,844.21
4b. Nested factor for negative self-esteem
13,701.92*
30
.972
.958
.027
.057
[.056, .058]
3513,139.42
3513,306.32
Positive and negative self-esteem 3. Correlated traits model 4. Bifactor model
Self-liking and self-competence 5. Correlated traits model
45,840.92*
34
.907
.877
.051
.098
[.097, .099]
3545,270.42
3545,477.36
6. Bifactor model
10,680.82*
25
.978
.961
.029
.055
[.054, .056]
3510,128.32
3510,423.95
6a. Nested factor for self-liking
29,326.87*
30
.941
.911
.044
.083
[.083, .084]
3528,764.37
3529,010.72
6b. Nested factor for self-competence
29,229.22*
30
.941
.911
.040
.083
[.082, .084]
3528,666.72
3528,913.08
269.32*
15
.999
.998
.003
.011
[.010, .012]
3499,736.82
3500,002.87
7. Combined bifactor model
Notes. N = 140,671. CFI = comparative fit index; TL = Tucker-Lewis index; SRMR = standardized root mean residual; RMSEA = root mean square error of approximation with 90% confidence interval; AIC = Akaike’s information criterion; BIC = Bayesian information criterion. *p < .05.
Sensitivity Analyses The robustness of the identified factor structure was studied by repeating the meta-analytic factor analyses for the subgroup of samples reporting full correlation matrices using a random-effects model. The pooled correlation matrix (upper off diagonal in Table 1) closely mirrored the previously derived pooled correlations. On average, the difference in correlations was M(|Δr|) = .02 (SD = .01, Max = .05). As a result, the competing factor models exhibited a highly similar pattern of results (see Table S3 as well as Figures S1 and S2 in ESM 1). However, the most complex Model 7 failed to converge indicating a serious misspecification (for similar problems, see Donnellan et al., 2016; Marsh et al., 2010). The best fit was achieved by the bifactor model for wording effects (Model 4). Again, the general factor explained most of the common variance (ECV = .84) as compared to the specific factors (ECV = .03 and .13). Because the number of response options can affect factor analytic results (Beauducel & Herzberg, 2006; Rhemtulla et al., 2012), we compared samples administering fourversus five-point response scales. Multigroup modeling of the bifactor structure for positive and negative self-esteem (Model 4) showed metric measurement invariance for the general factor (ΔCFI = .003, ΔSRMR = .020). Moreover, the difference in factor loadings between the two groups was small, M(Δβ) = .05. Thus, the response format had a negligible effect on our results. Cross-Cultural Measurement Invariance From the United States, 20 independent samples (total N = 36,131) were available, whereas 38 samples (total N = 73,796) and 31 samples (total N = 19,900) stemmed from highly and less individualistic countries. An unconstrained multigroup model for these groups resulted in an excellent Ó 2018 Hogrefe Publishing
fit of the bifactor model for positive and negative selfesteem (Model 4), w2(df = 75) = 3,918, CFI = .992, TLI = .986, SRMR = .014, RMSEA = .034. Equality constraints on the general factor loadings across all three groups lead to a noticeable decline in fit (ΔCFI = .007, ΔSRMR = .038), whereas respective constraints that were limited to the United States and highly individualistic countries showed a comparable fit (ΔCFI = .002, ΔSRMR = .020). Thus, in less individualistic countries the general factor loadings were, on average, M(Δβ) = .16 smaller than in the United States (see Table 4). Particularly, negatively worded items exhibited smaller loadings M(Δβ) = .24 and to a lesser degree also positively worded items, M(Δβ) = .08. Item 8 even showed a general factor loading around zero. As a consequence, the common variance explained by the general factor was higher in the United States (ECV = .92) and other individualistic countries (ECV = .88) as compared to less individualistic countries (ECV = .82). At the same time, ECV for the negative factor showed a reversed pattern with values of .07, .10, and .15 for the three groups.
Discussion The present study provided a meta-analytic perspective on the structure of one of the most popular instruments for the assessment of self-esteem, the RSES. The novel metaanalytic approach (Gnambs & Staufenbiel, 2016; see also Cheung, 2014) was based on item-level variancecovariance matrices and, thus, allowed us to compare several competing measurement models for the RSES that have been proposed in the recent literature (see Donnellan et al., 2016; Urbán et al., 2014). The current findings Zeitschrift für Psychologie (2018), 226(1), 14–29
24
T. Gnambs et al., Meta-Analysis of the Rosenberg Self-Esteem Scale
Table 4. Bifactor loadings for positive and negative self-esteem by individualism General factor
Item 1
Positive factor
Negative factor
βUS
ΔβHI
ΔβLO
βUS
ΔβHI
ΔβLO
.77
.03
.09
.13
.08
.11
βUS
ΔβHI
ΔβLO
.51
.00
.06
Item 2#
.64
.07
.18
Item 3
.67
.10
.06
.39
.11
.01
Item 4
.58
.09
.01
.23
.12
.04
Item 5#
.65
.10
.23
.23
.07
.11
Item 6#
.58
.02
.11
.54
.00
.05
Item 7
.71
.12
.15
Item 8#
.53
.02
.48
.29
.02
.01
Item 9#
.68
.06
.18
.24
.13
.15
.07
.10
.15
.31
.04
.00
Item 10
.81
.00
.09
.07
.03
.07
ECV
.92
.88
.82
.01
.03
.02
Notes. NUS = 36,131 in 20 samples, NHI = 73,796 in 38 samples, NLO = 19,900 in 31 samples. βUS = standardized factor loading in US samples; ΔβHI = difference in standardized factor loading between US samples and highly individualistic samples, ΔβLO = difference in standardized factor loading between US samples and less individualistic samples. ECV = explained common variance (Rodriguez et al., 2016). #Negatively keyed items.
warrant four main conclusions: first, a single latent factor is insufficient to adequately describe responses to the RSES (Hypothesis 1). The scale rather exhibits multidimensionality in regard to the wording of the items. Because these wording effects predominately pertain to the negatively keyed items, they can be interpreted as method effects such as response styles (i.e., acquiescence). Second, the theoretically derived facets of self-liking and self-competence (Tafarodi & Milne, 2002; Tafarodi & Swann, 1995) received only limited support (Hypothesis 2). Respective models generally exhibited worse fits than comparable models including wording effects (or even failed to converge). In view of these results, independent subscale scores for self-liking and self-competence should not be used. Third, most of the common variance in the RSES was explained by a general self-esteem factor and only up to 15% by specific factors (Hypothesis 3), which is in line with Rosenberg’s (1965) original notion of self-esteem as a unitary construct. The strong general factor also suggests that it is not useful to distinguish between positive and negative aspects of selfesteem in empirical analyses, because little variance is unique to each subscale. Finally, the general factor loadings were subject to strong cross-cultural variability. In less individualistic countries, the respective factor loadings were significantly smaller, particularly for the negatively keyed items. The noninvariance of the RSES challenges its usefulness for cross-cultural comparisons because different measurement models across countries can lead to seriously biased test statistics and, consequently, wrong conclusions (see Chen, 2008; Kuha & Moustaki, 2015). What are the practical implications of these results for the measurement of self-esteem? Although the RSES is not strictly unidimensional, secondary dimensions only have a modest impact on the item responses and, thus, Zeitschrift für Psychologie (2018), 226(1), 14–29
introduce a seemingly small bias in composite scores of the RSES. In fact, there are authors arguing that the validity of the general self-esteem factor seems hardly to be affected in case wording effects are not controlled for (Donnellan et al., 2016). More troublesome is the lack of cross-cultural measurement invariance. If members of different cultural groups (i.e., individualistic versus collectivistic) interpret items of the RSES differently, the resulting scale scores cannot be meaningfully compared (van de Vijver & Poortinga, 1997). Particularly, negatively worded items exhibited smaller loadings on the general self-esteem factor among members of less as compared to highly individualistic societies. These results fall in line with an international large-scale administration of the RSES (Schmitt & Allik, 2005) that found negatively worded items to be interpreted differently across cultural heterogeneous groups. Moreover, items referring to pride and respect exhibited significantly lower loadings on the general self-esteem factor. Presumably, these concepts convey a different meaning in less individualistic societies. Whereas pride of one’s accomplishments might reflect a healthy form of selfconfidence in individualistic countries such as the United States, it might be conceived as presumptuous and arrogant in societies valuing modesty (Wu, 2008). Thus, out of modesty people from less individualistic countries might be unwilling to emphasize their self-worth. Although the reasons for the observed noninvariance remain speculative, the bottom line is that cross-cultural research with the RSES might unjustifiably align incomparable concepts, unless measurement invariance has been explicitly corroborated for the countries at hand. Finally, we want to acknowledge some limitations in our study that might open avenues for future research. Metaanalytic conclusions can only be as good as the quality of Ó 2018 Hogrefe Publishing
T. Gnambs et al., Meta-Analysis of the Rosenberg Self-Esteem Scale
the included primary studies. For example, intense random responding in some samples (Huang, Liu, & Bowling, 2015) or different assessment contexts (see also Gnambs & Kaspar, 2015, 2017) might have distorted the reported effect sizes and, consequently, biased the meta-analytic factor models. Similar, splitting continuous moderators into qualitatively distinct groups is associated with several methodological problems (see MacCallum, Zhang, Preacher, & Rucker, 2002). Therefore, the present results should be replicated with individual-participant data, preferably from representative large-scale assessments (cf. Cheung & Jak, 2016; Kaufmann, Reips, & Merki, 2016), that allow for an appropriate modeling of moderated factor structures (see Klein & Moosbrugger, 2000; Molenaar, Dolan, Wicherts, & van der Maas, 2010). However, we also think that the adopted meta-analytic approach provides excellent possibilities to aggregate inconsistent results. MASEM allows scrutinizing the heterogeneity of published studies in search for potential moderators. Accordingly, in our opinion it is time to abandon simple factor analytic research on the RSES in yet another sample and, rather, move on to identify moderating influences that explain why the scale exhibits, for example, strong wording effects in some samples and not in others (Gnambs & Schroeders, in press; Marsh, 1996). In addition, it seems important to evaluate under what circumstances neglecting to model secondary factors, in fact, does not lead to substantial bias in applied settings. Finally, we hope to see more research tackling the problem of measurement invariance in the assessment of noncognitive abilities (van de Vijver & He, 2016), particularly for the coherent measurement of self-esteem across culturally diverse groups. There is ample evidence that cross-group comparisons may be severely distorted (Chen, 2008; Kuha & Moustaki, 2015), unless measurement equivalence has been corroborated for the samples at hand. Therefore, we hope that the presented methods will stimulate further research on the measurement of self-esteem across different cultures and societies. Acknowledgments In this paper, we make use of data of the LISS (Longitudinal Internet Studies for the Social Sciences) panel administered by CentERdata (Tilburg University, The Netherlands) through its MESS project funded by the Netherlands Organization for Scientific Research. Moreover, this paper also uses data from the National Educational Panel Study (NEPS): Starting Cohort Grade 5, https://doi.org/10.5157/ NEPS:SC3:6.0.0; Starting Cohort Grade 9, https://doi.org/ 10.5157/NEPS:SC4:7.0.0; Starting Cohort First-Year Students, https://doi.org/10.5157/NEPS:SC5:8.0.0; Starting Cohort Adults, https://doi.org/10.5157/NEPS:SC6:7.0.0. From 2008 to 2013, NEPS data was collected as part of the Framework Program for the Promotion of Empirical Ó 2018 Hogrefe Publishing
25
Educational Research funded by the German Federal Ministry of Education and Research (BMBF). As of 2014, NEPS is carried out by the Leibniz Institute for Educational Trajectories (LIfBi) at the University of Bamberg, Germany, in cooperation with a nationwide network. The action editor for this article was Michael Bošnjak. Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 2151-2604/a000317 ESM 1. Text (.pdf). Tables S1–S3 and Figures S1–S2.
References *References marked with an asterisk were included in the metaanalysis. Aichholzer, J. (2014). Random intercept EFA of personality scales. Journal of Research in Personality, 53, 1–4. https://doi.org/ 10.1016/j.jrp.2014.07.001 Alessandri, G., Vecchione, M., Eisenberg, N., & Łaguna, M. (2015). On the factor structure of the Rosenberg (1965) General Self-Esteem Scale. Psychological Assessment, 27, 621–635. https://doi.org/10.1037/pas0000073 *Bagley, C., Bolitho, F., & Bertrand, L. (1997). Norms and construct validity of the Rosenberg Self-Esteem Scale in Canadian high school populations: Implications for counseling. Canadian Journal of Counseling, 31, 82–92. Bandura, A. (1977). Self-efficacy toward a unifying theory of behavioral change. Psychology Review, 84, 191–215. https:// doi.org/10.1037/0033-295X.84.2.191 Baranik, L. E., Meade, A. W., Lakey, C. E., Lance, C. E., Hu, C., Hua, W., & Michalos, A. (2008). Examining the differential item functioning of the Rosenberg Self-Esteem Scale across eight countries. Journal of Applied Social Psychology, 38, 1867–1904. https://doi.org/10.1111/j.1559-1816.2008.00372.x Beauducel, A., & Herzberg, P. Y. (2006). On the performance of maximum likelihood versus means and variance adjusted weighted least squares estimation in CFA. Structural Equation Modeling, 13, 186–203. https://doi.org/10.1207/ s15328007sem1302_2 Becker, B. J. (1992). Using results from replicated studies to estimate linear models. Journal of Educational Statistics, 17, 341–362. https://doi.org/10.3102/10769986017004341 Billiet, J. B., & McClendon, M. J. (2000). Modeling acquiescence in measurement models for two balanced sets of items. Structural Equation Modeling, 7, 608–628. https://doi.org/ 10.1207/S15328007SEM0704_5 *Blossfeld, H.-P., Roßbach, H.-G, & von Maurice, J. (Eds.). (2011). Education as a lifelong process – The German National Educational Panel Study (NEPS) [Special Issue]. Zeitschrift für Erziehungswissenschaft, 14(2 Suppl.), 1–330. Brunner, M., Nagy, G., & Wilhelm, O. (2012). A tutorial on hierarchically structured constructs. Journal of Personality, 80, 796–846. https://doi.org/10.1111/j.1467-6494.2011.00749.x *Carmines, E. G., & Zeller, R. A. (1979). Reliability and validity assessment. Beverly Hills, CA: Sage University Press. *CentERdata. (2008). Longitudinal Internet studies for the social sciences [Computer file]. Tilburg, The Netherlands: Tilburg University [Distributor]. Retrieved from http://www.lissdata.nl
Zeitschrift für Psychologie (2018), 226(1), 14–29
26
*Chao, R. C.-L., Vidacovich, C., & Green, K. E. (2017). Rasch analysis of the Rosenberg Self-Esteem Scale with African Americans. Psychological Assessment, 29, 329–342. https:// doi.org/10.1037/pas0000347 Chen, F. F. (2008). What happens if we compare chopsticks with forks? The impact of making inappropriate comparisons in cross-cultural research. Journal of Personality and Social Psychology, 95, 1005–1018. https://doi.org/10.1037/a0013193 Cheung, M. W. L. (2014). Fixed-and random-effects meta-analytic structural equation modeling: Examples and analyses in R. Behavior Research Methods, 46, 29–40. https://doi.org/ 10.3758/s13428-013-0361-y Cheung, M. W. L. (2015). metaSEM: An R package for metaanalysis using structural equation modeling. Frontiers in Psychology, 5, 1521. https://doi.org/10.3389/fpsyg.2014.01521 Cheung, M. W. L., & Chan, W. (2005). Meta-analytic structural equation modeling: A two-stage approach. Psychological Methods, 10, 40–64. https://doi.org/10.1037/1082-989X.10.1.40 Cheung, M. W. L., & Jak, S. (2016). Analyzing big data in psychology: A split/analyze/meta-analyze approach. Frontiers in Psychology, 7, 738. https://doi.org/10.3389/fpsyg.2016. 00738 DiStefano, C., & Motl, R. W. (2006). Further investigating method effects associated with negatively worded items on self-report surveys. Structural Equation Modeling, 13, 440–464. https:// doi.org/10.1207/s15328007sem1303_6 DiStefano, C., & Motl, R. W. (2009). Self-esteem and method effects associated with negatively worded items: Investigating factorial invariance by sex. Structural Equation Modeling, 16, 134–146. https://doi.org/10.1080/10705510802565403 *Dobson, C., Goudy, W. J., Keith, P. M., & Powers, E. (1979). Further analyses of Rosenberg’s self-esteem scale. Psychological Reports, 44, 639–641. https://doi.org/10.2466/pr0.1979.44.2.639 *Donnellan, M. B., Ackerman, R. A., & Brecheen, C. (2016). Extending structural analyses of the Rosenberg Self-Esteem Scale to consider criterion-related validity: Can composite self-esteem scores be good enough? Journal of Personality Assessment, 98, 169–177. https://doi.org/10.1080/00223891.2015.1058268 Eid, M., Geiser, C., Koch, T., & Heene, M. (2017). Anomalous results in g-factor models: Explanations and alternatives. Psychological Methods, 22, 541–562. https://doi.org/10.1037/met0000083 Epstein, J. A., Griffin, K. W., & Botvin, G. J. (2004). Efficacy, selfderogation, and alcohol use among inner-city adolescents: Gender matters. Journal of Youth and Adolescence, 33, 159–166. https://doi.org/10.1023/B:JOYO.0000013427.31960.c6 *Farid, M. F., & Akhtar, M. (2013). Self-esteem of secondary school students in Pakistan. Middle-East Journal of Scientific Research, 14, 1325–1330. Farruggia, S. P., Chen, C., Greenberger, E., Dmitrieva, J., & Macek, P. (2004). Adolescent self-esteem in cross-cultural perspective: Testing measurement equivalence and a mediation model. Journal of Cross-Cultural Psychology, 35, 719–733. https://doi.org/10.1177/0022022104270114 Ferrando, P. J., & Lorenzo-Seva, U. (2010). Acquiescence as a source of bias and model and person misfit: A theoretical and empirical analysis. British Journal of Mathematical and Statistical Psychology, 63, 427–448. https://doi.org/10.1348/ 000711009X470740 *Franck, E., de Raedt, R., Barbez, C., & Rosseel, Y. (2008). Psychometric properties of the Dutch Rosenberg Self-Esteem Scale. Psychologica Belgica, 48, 25–35. https://doi.org/ 10.5334/pb-48-1-25 Gnambs, T., & Kaspar, K. (2015). Disclosure of sensitive behaviors across self-administered survey modes: A meta-analysis. Behavior Research Methods, 47, 1237–1259. https://doi.org/ 10.3758/s13428-014-0533-4
Zeitschrift für Psychologie (2018), 226(1), 14–29
T. Gnambs et al., Meta-Analysis of the Rosenberg Self-Esteem Scale
Gnambs, T., & Kaspar, K. (2017). Socially desirable responding in web-based questionnaires: A meta-analytic review of the candor hypothesis. Assessment, 24, 746–762. https://doi.org/ 10.1177/1073191115624547 *Gnambs, T., & Schroeders, U. (in press). Cognitive abilities explain wording effects in the Rosenberg Self-Esteem Scale. Assessment. https://doi.org/10.1177/1073191117746503 Gnambs, T., & Staufenbiel, T. (2016). Parameter accuracy in metaanalyses of factor structures. Research Synthesis Methods, 7, 168–186. https://doi.org/10.1002/jrsm.1190 *Goldsmith, R. E. (1986). Personality and adaptive-innovative problem solving. Journal of Social Behavior and Personality, 1, 95–106. *Goldsmith, R. E., & Goldsmith, E. B. (1982). Dogmatism and selfesteem: Further evidence. Psychological Reports, 51, 289–290. https://doi.org/10.2466/pr0.1982.51.1.289 *Gray-Little, B., Williams, V. S. L., & Hancock, T. D. (1997). An item response theory analysis of the Rosenberg Self-Esteem Scale. Personality and Social Psychology Bulletin, 23, 443–451. https://doi.org/10.1177/0146167297235001 He, J., Bartram, D., Inceoglu, I., & van de Vijver, F. J. (2014). Response styles and personality traits: A multilevel analysis. Journal of Cross-Cultural Psychology, 45, 1028–1045. https:// doi.org/10.1177/0022022114534773 He, J., Vliert, E., & van de Vijver, F. J. (2016). Extreme response style as a cultural response to climato-economic deprivation. International Journal of Psychology. Advance online publication. https://doi.org/10.1002/ijop.12287 Heine, S. J., Lehman, D. R., Markus, H. R., & Kitayama, S. (1999). Is there a universal need for positive self-regard? Psychological Review, 106, 766–794. https://doi.org/10.1037/0033-295X. 106.4.766 *Hensley, W. E. (1977). Differences between males and females on Rosenberg scale of self-esteem. Psychological Reports, 41, 829–830. https://doi.org/10.2466/pr0.1977.41.3.829 *Hensley, W. E., & Roberts, M. K. (1976). Dimensions of Rosenberg’s self-esteem scale. Psychological Reports, 38, 583–584. https://doi.org/10.2466/pr0.1976.38.2.583 *Hesketh, T., Lu, L., & Dong, Z. X. (2012). Impact of high sex ratios on urban and rural China, 2009–2010 [Computer file]. Colchester, UK: UK Data Archive [Distributor]. https://doi.org/ 10.5255/UKDA-SN-7107-1 Huang, C., & Dong, N. (2012). Factor structures of the Rosenberg Self-Esteem Scale: A meta-analysis of pattern matrices. European Journal of Psychological Assessment, 28, 132–138. https://doi.org/10.1027/1015-5759/a000101 Huang, J. L., Liu, M., & Bowling, N. A. (2015). Insufficient effort responding: Examining an insidious confound in survey data. Journal of Applied Psychology, 100, 828–845. https://doi.org/ 10.1037/a0038510 Hofstede, G., Hofstede, G. J., & Minkov, M. (2010). Cultures and organizations. New York, NY: McGraw Hill. Jak, S. (2015). Meta-analytic structural equation modelling. Berlin, Germany: Springer. Johnson, T., Kulesa, P., Cho, Y. I., & Shavitt, S. (2005). The relation between culture and response styles: Evidence from 19 countries. Journal of Cross-Cultural Psychology, 36, 264–277. https://doi.org/10.1177/0022022104272905 Kaplan, H. B., Martin, S. S., & Robbins, C. (1982). Application of a general theory of deviant behavior: Self-derogation and adolescent drug use. Journal of Health and Social Behavior, 23, 274–294. https://doi.org/10.2307/2136487 *Kaplan, H. B., & Pokorny, A. D. (1969). Self-derogation and psychological adjustment. Journal of Nervous and Mental Disease, 149, 421–434. https://doi.org/10.1097/00005053196911000-00006
Ó 2018 Hogrefe Publishing
T. Gnambs et al., Meta-Analysis of the Rosenberg Self-Esteem Scale
Kaiser, H. F., & Rice, J. (1974). Little Jiffy, Mark IV. Educational and Psychological Measurement, 34, 111–117. https://doi.org/ 10.1177/001316447403400115 Kaufmann, E., Reips, U. D., & Merki, K. M. (2016). Avoiding methodological biases in meta-analysis. Zeitschrift für Psychologie, 224, 157–167. https://doi.org/10.1027/2151-2604/a000251 Khojasteh, K., & Lo, W.-J. (2015). Investigating the sensitivity of goodness-of-fit indices to detect measurement invariance in a bifactor model. Structural Equation Modeling, 22, 531–541. https://doi.org/10.1080/10705511.2014.937791 Kuha, J., & Moustaki, I. (2015). Non-equivalence of measurement in latent variable modeling of multi group data: A sensitivity analysis. Psychological Methods, 20, 523–536. https://doi.org/ 10.1037/met0000031 Klein, A., & Moosbrugger, H. (2000). Maximum likelihood estimation of latent interaction effects with the LMS method. Psychometrika, 65, 457–474. https://doi.org/10.1007/BF02296338 LeBreton, J. M., & Senter, J. L. (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11, 815–852. https://doi.org/ 10.1177/1094428106296642 MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19–40. https://doi.org/ 10.1037/1082-989X.7.1.19 Markus, H. R., & Kitayama, S. (1991). Culture and the self: Implications for cognition, emotion, and motivation. Psychological Review, 98, 224–253. https://doi.org/10.1037/0033-295X. 98.2.224 Marsh, H. W. (1996). Positive and negative global self-esteem: A substantively meaningful distinction or artifactors? Journal of Personality and Social Psychology, 70, 810–9. https://doi.org/ 10.1037/0022-3514.70.4.810 Marsh, H. W., Nagengast, B., & Morin, A. J. (2013). Measurement invariance of Big-Five factors over the life span: ESEM tests of gender, age, plasticity, maturity, and la dolce vita effects. Developmental Psychology, 49, 1194–1218. https://doi.org/ 10.1037/a0026913 Marsh, H. W., Scalas, L. F., & Nagengast, B. (2010). Longitudinal tests of competing factor structures for the Rosenberg SelfEsteem Scale: Traits, ephemeral artifacts, and stable response styles. Psychological Assessment, 22, 366–381. https://doi.org/ 10.1037/a0019225 Meade, A. W., Johnson, E. C., & Braddy, P. W. (2008). Power and sensitivity of alternative fit indices in tests of measurement invariance. Journal of Applied Psychology, 93, 568–592. https:// doi.org/10.1037/0021-9010.93.3.568 *Meurer, S. T., Luft, C. B., Benedetti, T. R., & Mazo, G. Z. (2012). Validade de construto e consistência interna da escala deautoestima de Rosenberg para uma população de idososbrasileiros praticantes de atividades físicas [Construct validity and reliability in Rosenberg’s self-steem scale for Brazilian older adults who practice physical activities]. Motricidade, 8, 5–15. https://doi.org/10.6063/motricidade.8(4).1548 Michaelides, M. P., Koutsogiorgi, C., & Panayiotou, G. (2016). Method effects on an adaptation of the Rosenberg Self-Esteem Scale in Greek and the role of personality traits. Journal of Personality Assessment, 98, 178–188. https://doi.org/10.1080/ 00223891.2015.1089248 *Mimura, C., & Griffiths, P. (2007). A Japanese version of the Rosenberg Self-Esteem Scale: Translation and equivalence assessment. Journal of Psychosomatic Research, 62, 589–594. https://doi.org/10.1016/j.jpsychores.2006.11.004 Minkov, M., Dutt, P., Schachner, M., Morales, O., Sanchez, C., Jandosova, J., . . . Mudd, B. (2017). A revision of Hofstede’s individualism-collectivism dimension: A new national index from
Ó 2018 Hogrefe Publishing
27
a 56-country study. Cross Cultural & Strategic Management, 24, 386–404. https://doi.org/10.1108/CCSM-11-2016-019 *Mlačić, B., Milas, G., & Kratohvil, A. (2007). Adolescent personality and self-esteem – An analysis of self-report and parentalratings. Društvena istraživanja-Časopis za opća društvena pitanja, 1, 213–236. Molenaar, D., Dolan, C. V., Wicherts, J. M., & van der Maas, H. L. (2010). Modeling differentiation of cognitive abilities within the higher-order factor model using moderated factor analysis. Intelligence, 38, 611–624. https://doi.org/10.1016/j.intell. 2010.09.002 Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., . . . Yarkoni, T. (2015). Promoting an open research culture. Science, 348, 1420–1422. https://doi.org/ 10.1126/science.aab2374 *O’Brien, E. J. (1985). Global self-esteem scales: Unidimensional or multidimensional? Psychological Reports, 57, 383–389. https://doi.org/10.2466/pr0.1985.57.2.383 *Open Psychology Data. (2014). Answers to the Rosenberg SelfEsteem Scale. Retrieved from http://personality-testing.info/ _rawdata/ Owens, T. J. (1994). Two dimensions of self-esteem: Reciprocal effects of positive self-worth and self-deprecation on adolescent problems. American Sociological Review, 59, 391–407. https://doi.org/10.2307/2095940 *Portes, A., & Rumbaut, R. G. (2012). Children of Immigrants Longitudinal Study (CILS), 1991–2006. ICPSR20520–v2. Ann Arbor, MI: Inter-University Consortium for Political and Social Research [Distributor]. https://doi.org/10.3886/ICPSR20520.v2 Preacher, K. J., & MacCallum, R. C. (2003). Repairing Tom Swift’s electric factor analysis machine. Understanding Statistics, 2, 13–43. https://doi.org/10.1207/S15328031US0201_02 *Pullmann, H., & Allik, J. (2000). The Rosenberg Self-Esteem Scale: Its dimensionality, stability and personality correlates in Estonian. Personality and Individual Differences, 28, 701–715. https://doi.org/10.1016/S0191-8869(99)00132-4 R Core Team. (2017). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https//www.R-project.org/ Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47, 667–696. https://doi.org/10.1080/00273171.2012.715555 Reise, S. P., Kim, D. S., Mansolf, M., & Widaman, K. F. (2016). Is the bifactor model a better model or Ii it just better at modeling implausible responses? Application of iteratively reweighted least squares to the Rosenberg Self-Esteem Scale. Multivariate Behavioral Research, 51, 818–838. https://doi.org/10.1080/ 00273171.2016.1243461 Reise, S. P., Moore, T. M., & Haviland, M. G. (2010). Bifactor models and rotations: Exploring the extent to which multidimensional data yield univocal scale scores. Journal of Personality Assessment, 92, 544–559. https://doi.org/10.1080/ 00223891.2010.496477 Rhemtulla, M., Brosseau-Liard, P. É., & Savalei, V. (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological Methods, 17, 354–373. https://doi.org/10.1037/a0029315 Rodriguez, A., Reise, S. P., & Haviland, M. G. (2016). Evaluating bifactor models: Calculating and interpreting statistical indices. Psychological Methods, 21, 137–150. https://doi.org/ 10.1037/met0000045 *Rojas-Barahona, C. A., Zegers, B., & Förster, C. A. (2009). La escala de autoestima de Rosenberg: Validación para Chile en una muestrade jóvenes adultos, adultosy adultos mayors [Rosenberg Self-Esteem Scale: Validation in a representative
Zeitschrift für Psychologie (2018), 226(1), 14–29
28
sample of Chilean adults]. Revista Médica de Chile, 137, 791–800. https://doi.org/10.4067/S0034-98872009000600009 Rosenberg, M. (1965). Society and the adolescent self-image. Princeton, NJ: Princeton University Press. Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48, 1–36. https://doi. org/10.18637/jss.v048.i02 Saris, W. E., Satorra, A., & van der Veld, W. (2009). Testing structural equation models or detection of misspecifications? Structural Equation Modeling, 16, 561–582. https://doi.org/ 10.1080/10705510903203433 *Sarkova, M., Nagyova, I., Katreniakova, Z., Geckova, A. M., Orosova, O., van Dijk, J. P., & van den Heufel, W. (2006). Psychometric evaluation of the General Health Questionnaire12 and Rosenberg Self-Esteem Scale in Hungarian and Slovak early adolescents. Studia Psychologica, 48, 69–79. Schermelleh-Engel, K., Moosbrugger, H., & Müller, H. (2003). Evaluating the fit of structural equation models: Test of significance and descriptive goodness-of-fit measures. Methods of Psychological Research Online, 8, 23–74. *Schmitt, D. P., & Allik, J. (2005). Simultaneous administration of the Rosenberg Self-Esteem Scale in 53 nations: Exploring the universal and culture-specific features of global self-esteem. Journal of Personality and Social Psychology, 89, 623–642. https://doi.org/10.1037/0022-3514.89.4.623 Schmitt, T. A. (2011). Current methodological considerations in exploratory and confirmatory factor analysis. Journal of Psychoeducational Assessment, 29, 304–321. https://doi.org/ 10.1177/0734282911406653 Schulze, R. (2005). Modeling structures of intelligence. In O. Wilhelm & R. W. Engle (Eds.), Handbook of understanding and measuring intelligence (pp. 241–263). Thousand Oaks, CA: Sage. *Shahani, C., Dipboye, R. L., & Phillips, A. P. (1990). Global selfesteem as a correlate of work-related attitudes: A question of dimensionality. Journal of Personality Assessment, 54, 276–288. https://doi.org/10.1207/s15327752jpa5401&2_26 Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428. https://doi.org/10.1037/0033-2909.86.2.420 *Sinclair, S. J., Blais, M. A., Gansler, D. A., Sanderber, E., Bistis, K., & LoCicero, A. (2010). Psychometric properties of the Rosenberg Self-Esteem Scale: Overall and across demographic groups living within the United States. Evaluation and the Health Professions, 33, 56–80. https://doi.org/10.1177/0163278709356187 Smith, P. B., Vignoles, V. L., Becker, M., Owe, E., Easterbrook, M. J., Brown, R., . . . Yuki, M. (2016). Individual and culturelevel components of survey response styles: A multi-level analysis using cultural models of selfhood. International Journal of Psychology, 51, 453–463. https://doi.org/10.1002/ ijop.12293 *Song, H., Cai, H., Brown, J. D., & Grimm, K. J. (2011). Differential item functioning of the Rosenberg Self-Esteem Scale in the US and China: Measurement bias matters. Asian Journal of Social Psychology, 14, 176–188. https://doi.org/10.1111/j.1467-839X. 2011.01347.x Supple, A. J., Su, J., Plunkett, S. W., Peterson, G. W., & Bush, K. R. (2013). Factor structure of the Rosenberg Self-Esteem Scale. Journal of Cross-Cultural Psychology, 44, 748–764. https://doi. org/10.1177/0022022112468942 Tafarodi, R. W., & Milne, A. B. (2002). Decomposing global selfesteem. Journal of Personality, 70, 443–484. https://doi.org/ 10.1111/1467-6494.05017
Zeitschrift für Psychologie (2018), 226(1), 14–29
T. Gnambs et al., Meta-Analysis of the Rosenberg Self-Esteem Scale
Tafarodi, R. W., & Swann, W. B. Jr. (1995). Self-linking and self-competence as dimensions of global self-esteem: Initial validation of a measure. Journal of Personality Assessment, 65, 322–342. https://doi.org/10.1207/s15327752jpa6502_8 Tomás, J. M., Oliver, A., Galiana, L., Sancho, P., & Lila, M. (2013). Explaining method effects associated with negatively worded items in trait and state global and domain-specific self-esteem scales. Structural Equation Modeling, 20, 299–313. https://doi. org/10.1080/10705511.2013.769394 Urbán, R., Szigeti, R., Kökönyei, G., & Demetrovics, Z. (2014). Global self-esteem and method effects: Competing factor structures, longitudinal invariance, and response styles in adolescents. Behavior Research Methods, 46, 488–498. https://doi.org/10.3758/s13428-013-0391-5 van de Vijver, F. J. R., & He, J. (2016). Bias assessment and prevention in noncognitive outcome measures in context assessments. In S. Kuger, E. Klieme, N. Jude, & D. Kaplan (Eds.), Assessing contexts of learning (pp. 229–253). Berlin, Germany: Springer. https://doi.org/10.1007/978-3-319-453576_9 van de Vijver, F. J. R., & Poortinga, Y. H. (1997). Towards an integrated analysis of bias in cross-cultural assessment. European Journal of Psychological Assessment, 13, 29–37. https://doi.org/10.1027/1015-5759.13.1.29 *Vasconcelos-Raposo, J., Fernandes, H. M., Teixeira, C. M., & Bertelli, R. (2012). Factorial validity and invariance of the Rosenberg Self-Esteem Scale among Portuguese youngsters. Social Indicators Research, 105, 482–498. https://doi.org/ 10.1007/s11205-011-9782-0 *Welsh Assembly Government, Social Research Division. (2011). National Survey for Wales, 2009–2010: Pilot study [Computer file]. Colchester, UK: UK Data Archive [Distributor]. https://doi. org/10.5255/UKDA-SN-6720–1 *Whiteside-Mansell, L., & Corwyn, R. F. (2003). Mean and covariance structure analyses: An examination of the Rosenberg self-esteem scale among adolescents and adults. Educational and Psychological Measurement, 63, 163–173. https://doi.org/10.1177/0013164402239323 Wicherts, J. M., & Dolan, C. V. (2010). Measurement invariance in confirmatory factor analysis: An illustration using IQ test performance of minorities. Educational Measurement: Issues and Practice, 29, 39–47. https://doi.org/10.1111/j.1745-3992. 2010.00182.x Wu, C.-H. (2008). An examination of the wording effect in the Rosenberg Self-Esteem Scale among culturally Chinese people. Journal of Social Psychology, 148, 535–551. https://doi.org/ 10.3200/SOCP.148.5.535-552 *Yaacob, M. J. (2006). Validity and reliability study of Rosenberg Self-Esteem Scale in Seremban school children. Malaysian Journal of Psychiatry, 15, 35–39. Received June 15, 2017 Revision received November 2, 2017 Accepted November 7, 2017 Published online February 2, 2018 Timo Gnambs Leibniz Institute for Educational Trajectories Wilhelmsplatz 3 96047 Bamberg Germany timo.gnambs@lifbi.de
Ó 2018 Hogrefe Publishing
T. Gnambs et al., Meta-Analysis of the Rosenberg Self-Esteem Scale
29
Appendix Rosenberg (1965) Self-Esteem Scale To 1. 2. 3. 4. 5. 6. 7.
what extent do the following statements apply to you? On the whole, I am satisfied with myself. (P) At times, I think I am no good at all. (N) I feel that I have a number of good qualities. (P) I am able to do things as well as most other people. (P) I feel I do not have much to be proud of. (N) I certainly feel useless at times. (N) I feel that I’m a person of worth, at least on an equal plane with others. (P) 8. I wish I could have more respect for myself. (N) 9. All in all, I am inclined to feel that I am a failure. (N) 10. I take a positive attitude toward myself. (P)
Response categories: 1 = applies not at all, 2 = does not really apply, 3 = partly, 4 = rather applies, 5 = applies completely. P = positively worded, N = negatively worded (reverse scored for creating a sum score).
Ó 2018 Hogrefe Publishing
Zeitschrift für Psychologie (2018), 226(1), 14–29
Review Article
An Update of a Meta-Analysis on the Clinical Outcomes of Deep Transcranial Magnetic Stimulation (DTMS) in Major Depressive Disorder (MDD) Helena M. Gellersen1 and Karina Karolina Kedzior2 1
Behavioural and Clinical Neuroscience Institute (BCNI), Department of Psychology, University of Cambridge, UK
2
Institute of Psychology and Transfer, University of Bremen, Germany Abstract: Deep transcranial magnetic stimulation (DTMS) is a noninvasive therapy for treatment-resistant major depressive disorder (MDD). The current study aimed to update a previous meta-analysis by investigating the acute and longer-term clinical outcomes of DTMS and their possible predictors (patient characteristics and stimulation parameters) in unipolar MDD. A systematic literature search identified 11 studies with 282 treatment-resistant, unipolar MDD patients. The clinical outcomes (depression severity, response and remission rates) were evaluated using random-effects meta-analyses. High frequency and intensity DTMS protocol with H1-coil had significant acute antidepressant outcomes and improved some cognitive functions after 20 daily sessions in unipolar MDD. Response rates tended to increase with lower severity of illness. Antidepressant effects were prolonged if maintenance DTMS was used after daily stimulation phases. DTMS consistently improves various symptom domains (antidepressant, cognitive) in treatment-resistant unipolar MDD. Keywords: major depressive disorder (MDD), deep transcranial magnetic stimulation (DTMS), systematic review, meta-analysis
Major depressive disorder (MDD) is the most widespread mental illness that poses an enormous economic burden worldwide (Kleine-Budde et al., 2013). Up to 30% of MDD patients suffer from treatment-resistant MDD that cannot be alleviated with antidepressant medication (Fava & Davidson, 1996). An alternative treatment for such treatment-resistant MDD is the noninvasive, high-frequency repetitive transcranial magnetic stimulation (rTMS) over the left dorsolateral prefrontal cortex (DLPFC) most commonly delivered using a figure-of-eight (F8-) coil, which has been approved in the United States by the US Food and Drug Administration (FDA, December 16, 2008). Some 25 years of research have shown that rTMS has short-term (acute) antidepressant outcomes (Kedzior, Azorina, & Reitz, 2014; Kedzior & Reitz, 2014; Kreuzer, Höppner, et al., 2015; Kreuzer, Padberg, et al., 2015) and produces clinically-relevant response and remission rates of 29% and 19%, respectively, in MDD (Berlim, van den Eynde, Zeitschrift für Psychologie (2018), 226(1), 30–44 https://doi.org/10.1027/2151-2604/a000320
Tovar-Perdomo, & Daskalakis, 2014). The informal German working group “Transcranial Magnetic Stimulation in Psychiatry” has concluded that it is safe to use rTMS in the clinical setting in Germany (Hajak et al., 2005). Indeed, rTMS is available in Germany, albeit as a clinical treatment without coverage from the public health insurance. Current efforts are geared toward designing new coils to target brain regions beyond the DLPFC (Ziemann, 2017). One promising development is the design of the H1-coil by Brainsway Ltd. in Israel which stimulates the entire brain with the focus on the left DLPFC (Zangen, Roth, Voller, & Hallett, 2005). The conventional rTMS is thought to normalize the frontal hypoactivity associated with depression, which in turn may regulate activity of other moodregulation regions, such as the anterior cingulate cortex and the hypothalamic-pituitary-adrenal axis (Baeken & De Raedt, 2011). As the latter lies in deeper brain regions, H1-coils may have an added benefit over F8-coils. Ó 2018 Hogrefe Publishing
H. M. Gellersen & K. K. Kedzior, Update of a Meta-Analysis on DTMS in MDD
Specifically, H1-coils stimulate a larger, less focal brain surface relative to a more focal stimulation delivered with F8coils (Fadini et al., 2009). Such a broad stimulation may increase the electrical field in deeper, subcortical brain regions compared to the cortical stimulation achieved with F8-coils (Parazzini et al., 2017; Zangen et al., 2005). Although the depth of stimulation needs to be further investigated in head-to-head trials with similar stimulation protocols (Schmittwilken, Schuchinsky, & Kedzior, 2017), H1-coil provides a new alternative to the existing rTMS coils. The therapeutic application of the H1-coil, referred to as deep transcranial magnetic stimulation (DTMS), was approved for treatment-resistant unipolar MDD by the FDA in 2013 (FDA, 2013). DTMS produces moderate acute antidepressant outcomes in unipolar MDD according to the first double-blind, randomized-controlled trial (RCT) with inactive sham (Levkovitz et al., 2015). Similarly, the efficacy of DTMS in MDD was also shown in a meta-analysis of open-label studies (Kedzior, Gellersen, Brachetti, & Berlim, 2015). DTMS also appears to acutely reduce anxiety in unipolar MDD (Kedzior, Gellersen, Roth, & Zangen, 2015) and improves some cognitive deficits in the disorder (Kedzior, Gierke, Gellersen, & Berlim, 2016; Levkovitz et al., 2015). One limitation of the previous meta-analysis (Kedzior, Gellersen, Brachetti, et al., 2015) is that it reported only acute (short-term) outcomes of DTMS in mixed samples with unipolar and bipolar MDD. This is problematic because functional and structural differences between the two disorders exist, including a reduction in the ability to deactivate the default mode network, lower cortical thickness, and smaller amplitudes of low-frequency fluctuations of activation in frontal regions in bipolar MDD (Niu et al., 2017; Rodriguez-Cano et al., 2017; Yu et al., 2017). These differences could affect the efficacy of DTMS and might have contributed to the moderate heterogeneity among the effect sizes in the previous meta-analysis (Kedzior, Gellersen, Brachetti, et al., 2015). Furthermore, considering the duration and the cost of treatment, the predictors of antidepressant response to DTMS need to be systematically assessed. The optimization of stimulation protocols and the identification of patients who benefit most from brain stimulation are currently the key foci in the field (Ziemann, 2017). The current study aimed to update the previous meta-analysis (Kedzior, Gellersen, Brachetti, et al., 2015) to assess both the acute and the longer-term antidepressant outcomes of DTMS in unipolar MDD alone. The second aim was to systematically explore the possible predictors of antidepressant outcomes, including patient characteristics and stimulation parameters. We expected that DTMS produces homogeneous antidepressant outcomes (change in depression severity, remission, response) as well as cognitive benefits in studies with Ó 2018 Hogrefe Publishing
31
unipolar MDD compared to the mixed samples with unipolar and bipolar MDD. Due to the exploratory nature of the second aim we had no specific hypotheses regarding the factors that could optimize the clinical outcomes of DTMS.
Method The current study adheres to Meta-Analysis Reporting Standards (MARS; APA, 2008) and Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA; Moher, Liberati, Tetzlaff, & Altman, 2009) guidelines for systematic reviews and meta-analyses.
Inclusion and Exclusion Criteria Inclusion criteria for the current analysis were: (1) major depressive disorder (MDD) according to the Diagnostic and Statistical Manual of Mental Disorders – Fourth Edition (DSM-IV; either unipolar or with bipolar depression being an exclusion criterion), (2) DTMS performed with H1-coil, (3) at least five patients, (4) any study design: open-label, single-blind, or doubleblind RCT with inactive sham, (5) parallel designs (in RCTs), (6) clinical outcomes (depression, cognitive functioning) assessed at baseline, after a course of daily treatment, and at follow-up of any length using any standardized scale. Exclusion criteria were: (1) fewer than five patients (case-study designs), (2) study reporting previously published data already included in the analysis, (3) reviews without new primary data, (4) other primary diagnoses than MDD.
Search Strategy The last electronic search was conducted on April 26, 2017 (refer to Table 1 for search strategy) and k = 24 studies published in peer-reviewed journals were assessed for inclusion (Table S1 in the Electronic Supplementary Material, ESM 1). Study assessment and data coding were conducted by both authors independently and any discrepancies were resolved by consensus. Following the study assessment (Figure 1) k = 1 RCT (Levkovitz et al., 2015) and k = 10 open-label studies (Berlim, van den Eynde, Tovar-Perdomo, Chachamovich, et al., 2014; Feffer et al., 2017; Harel et al., 2014; Isserles et al., 2011; Levkovitz Zeitschrift für Psychologie (2018), 226(1), 30–44
32
H. M. Gellersen & K. K. Kedzior, Update of a Meta-Analysis on DTMS in MDD
Table 1. Search strategy k studies
Search terms
Databases (time frame)
23 (no duplicates)
TI (“deep transcranial magnetic stimulation” OR “deep repetitive transcranial magnetic stimulation” OR deep rTMS OR deepTMS OR deep TMS OR H-coil) AND TI (depress* OR dysthymi* OR MDD) “deep transcranial magnetic stimulation” AND depression
PsycInfo, Medline (EBSCO) (any date – 24.06.2016)
1 (additional study to 23 studies above)
Google Scholar (26.04.2017)
Notes. The search was performed in English by one author (KKK). There were no language restrictions or any other limits. k = number of studies, MDD = major depressive disorder, rTMS = repetitive transcranial magnetic stimulation, TI = title, TMS = transcranial magnetic stimulation.
IDENTIFICATION
k = 24 studies from database search
k = 1 study from Google Scholar
Figure 1. Study selection procedure (PRISMA flowchart). k = number of studies; RCT = randomizedcontrolled trial.
k = 24 studies (duplicates removed)
SCREENING
k = 24 titles/abstracts screened
k = 9 excluded with reasons • k = 3 review (no new data) • k = 4 case study • k = 2 other primary diagnoses (bipolar depression, alcohol use disorder)
ELIGIBILITY
k = 15 full-text studies assessed
k = 4 excluded with reasons • k = 4 studies reporting previously published data already included in the analysis
INCLUDED
k = 11 studies included in the review (k = 10 open- label and k = 1 RCT)
et al., 2009; Naim-Feil et al., 2016; Rapinesi, Bersani, et al., 2015; Rapinesi, Curto, et al., 2015; Rosenberg, Shoenfeld, Zangen, Kotler, & Dannon, 2010; Rosenberg, Zangen, Stryjer, Kotler, & Dannon, 2010) were selected for inclusion in the current review.
Coding Procedures The following data were coded from each study: designs and patients (demographic and clinical characteristics; Table 2), stimulation parameters (Table 3), and the clinical outcome measures (Table 4). In case of dropout, Zeitschrift für Psychologie (2018), 226(1), 30–44
intention-to-treat or last observation carried forward (LOCF) methods were used to analyze the data. The coding rules are listed in the notes to tables. Additional data were obtained from the authors via email.
Outcome Measures The current analysis focuses on the following outcome measures assessed using any standardized scales: (1) primary outcomes: depression severity, response and remission rates, (2) secondary outcomes: cognitive functioning. Ó 2018 Hogrefe Publishing
Ó 2018 Hogrefe Publishing
OL
OL
OL
OL
Rapinesi, Bersani, et al., 2015; Italyd
Rapinesi, Curto, et al., 2015; Italye
Naim-Feil et al., 2016; Israelf
Feffer et al., 2017; Israeld,g 32
21
12
9
101
48 ± 17
44 ± 9
51 ± 8
54 ± 6
45 ± 12
41 ± 11
47 ± 13
45 ± 13
41 ± 13
47 ± 12
46 ± 13
50
52
42
44
48
48
76
45
67
14
48
87
100
100
89
0
62
100
100
50
0
0
Failed 1 pharmacological trial or intolerant to 2 antidepressants Failed 1–4 antidepressant trials or intolerant to 2 antidepressants current episode
Failure to respond to 3 adequate doses of 2 classes of antidepressants 0 Unsatisfactory response to 1 adequate course of antidepressant treatment current episode 0 (21 patients received only one Failed adequate doses of 2 DTMS session, 13/21 continued antidepressants current on for 20 sessions) episode 3 (unclear from which group; Intolerant of or failed adequate 1 lack of response, 1 teeth doses of 2 antidepressant grinding, 1 did not complete HDRS trials current episode scale)
3 (1 safety reasons before 1st treatment, 2 noncompliance with study protocol) 19 (12 intensity < 118% of measured MT, 1 lost to follow-up, 1 missed > 2 sessions, 1 seizure, 3 no improvement, 1 withdrawal of consent) 0
4 (1 sensory disturbance Failed 2 antidepressants unrelated to treatment, current episode 2 uncooperative with treating staff, 1 responded well and withdrew early) 2 (1 insomnia, 1 lack of response) Failed 2 antidepressant trials current episode 4 (3 lack of response, 1 suicidal Failed 2 antidepressant ideation) courses 10 (1 seizure, 1 suicidal ideation, Failed 2 antidepressants 1 intolerance, 1 high motor threshold, 4 lack of response, 2 personal reasons) 2 (scalp discomfort) Failed 3 antidepressant courses
Treatment-resistance definition
17
–
–
31
17
9
30
17
3
16
24
14
17
Mean duration of illness
34
45
25
24
44
29
17
33
29
Mean age of onset (years)
Notes. Some data in this table are also shown in the previous meta-analysis (Kedzior, Gellersen, Brachetti, et al., 2015). All patients had a diagnosis of MDD according to DSM-IV. If not reported then mean onset age (mean age – mean illness duration) or mean illness duration (mean age – mean onset age) was computed by the authors. DTMS = deep repetitive transcranial magnetic stimulation, HDRS = Hamilton Depression Rating Scale, MDD = major depressive disorder, MT = motor threshold, OL = open-label, RCT = double-blind randomized-controlled trial with an inactive sham group. aData from the H1-coil group (other groups were stimulated with different H-coils). bData from the control group (‘No cognitive-emotional reactivation’)- other groups received cognitive-emotional priming prior to DTMS. cData from the active DTMS group. dData from unipolar MDD patients. eData from patients without alcohol use disorders. fData from MDD patients (the healthy control group did not receive DTMS). gOne patient started on new antidepressant, doses of medication were changed in three patients (unclear in which group).
RCT
29
OL
Levkovitz et al., 2015; multicenterc
17
25
OL
OL
6
OL
Berlim, van den Eynde, Tovar-Perdomo, Chachamovich, et al., 2014; Canada Harel et al., 2014; Israel
7
OL
Rosenberg, Shoenfeld, et al., 2010; Israel Rosenberg, Zangen, et al., 2010; Israel Isserles et al., 2011; Israelb
23
OL
Levkovitz et al., 2009; Israela
Study
Age Concurrent antidepressants Dropouts during daily stimulation Female Sample (M ± SD) of Study size at all patients patients at (% of patients phase (number of patients and at baseline) reasons) design baseline at baseline baseline (%)
Table 2. Study designs and patient characteristics (demographic and clinical) in k = 11 DTMS studies
H. M. Gellersen & K. K. Kedzior, Update of a Meta-Analysis on DTMS in MDD 33
Zeitschrift für Psychologie (2018), 226(1), 30–44
Zeitschrift für Psychologie (2018), 226(1), 30–44
20
20
20
20
20
Notes. Some data in this table are also shown in the previous meta-analysis (Kedzior, Gellersen, Brachetti, et al., 2015). For the definition of location, “5.5 cm” refers to 5.5 cm away from the motor “hot-spot.” DTMS = deep repetitive transcranial magnetic stimulation, H1 = H1-coil, L = left PFC, MDD = major depressive disorder, MT = resting motor threshold, PFC = prefrontal cortex. aData from the H1-coil group. b Data from the control group (“No cognitive-emotional reactivation”). cActive DTMS parameters (sham stimulation was performed using H1-sham coil). dData from unipolar MDD patients. eData from patients without alcohol use disorders. fData from MDD patients (the healthy control group did not receive DTMS).
20 55 1,980 39,600 H1 120 18 L Feffer et al., 2017d
5.0 cm
20
20 55
42 1,680
1,980 39,600
33,600 H1
H1 120
120 20
18 5.5 cm
6.0 cm L
L
Naim-Feil et al., 2016f
Rapinesi, Curto, et al., 2015
e
20 55 1,980 39,600 H1 120 18 L Rapinesi, Bersani, et al., 2015d
5.5 cm
20 20
20 55
42 1,680
1,980 39,600
33,600 H1
H1 120
120
L
18
6.0 cm L
Levkovitz et al., 2015c
20
6.0 cm L
6.0 cm
20
20 20 75 3,000 60,000 H1 120
20
Berlim, van den Eynde, Tovar-Perdomo, Chachamovich, et al., 2014 Harel et al., 2014
20
20
20 42
42 1,680
1,680 33,600
33,600 H1
H1 120
120 20 L
5.5 cm L
Isserles et al., 2011b
20
20
Rosenberg, Zangen, et al., 2010
5.5 cm
20
20 42
42 1,680
1,680 33,600
33,600 H1
H1 120
120 20
20 5.5 cm
Rosenberg, Shoenfeld, et al., 2010
5.5 cm
L
L
Levkovitz et al., 2009a
Inter-train interval (s) Trains/session Stimuli/session Total stimuli Coil type Intensity (% MT) Frequency (Hz) Location definition PFC location
Table 3. Stimulation parameters in k = 11 DTMS studies
Meta-analysis was conducted according to the same procedures as in the previous meta-analysis (see the supplementary materials in Kedzior, Gellersen, Brachetti, et al., 2015) using the Comprehensive Meta-Analysis 3.0 (CMA; Biostat, Englewood, NJ, USA). First, the effect sizes were computed per study. Depression severity was expressed as standardized paired difference scores (post-pre DTMS) corrected for the sample size (Hedges’ g). Hedges’ g is considered small (.20–.49), moderate (.50–.79), or large ( .80) according to the criteria for Cohen’s d (Borenstein, Hedges, Higgins, & Rothstein, 2009). Response and remission rates were expressed as event rates (the number of responders or remitters out of the total sample per study). Second, the effect sizes were weighted using the inversevariance method and pooled according to a random-effects model of meta-analysis (Borenstein et al., 2009). Specifically, each effect size was multiplied by the inverse of the sum of the estimated within- and between-study variance. This weighing method assigns higher weights to studies with low estimated variance and vice versa. The random-effects model was chosen a priori because it was assumed that (1) a random sample of existing studies was included in this analysis, (2) the effect sizes could be heterogeneous if systematic differences exist among studies, and (3) the metaanalysis estimates the mean true effect in the population. Third, heterogeneity among the effect sizes was evaluated using the I2 index derived from the Q statistic (Borenstein et al., 2009). The I2 index indicates that there is little ( 25%), moderate (50%), or high ( 75%) heterogeneity in effect sizes due to systematic differences among studies (Borenstein et al., 2009). Fourth, factors that could optimize the clinical outcomes of DTMS, including patient characteristics and stimulation parameters, were identified post hoc after all data were coded and investigated using univariate subgroup analyses and meta-regressions. The univariate approach was used due to a low volume of available data. Mixed-effects subgroup analyses were conducted to compare subgroups of studies based on the therapy type (DTMS as a monotherapy vs. DTMS as an add-on therapy to stable antidepressants) and the number of stimuli per session (1,680 vs. 1,980–3,000). In the mixed-model analysis the effect sizes in each subgroup are pooled using the random-effects model and the pooled effects are compared using the Q statistic derived from a fixed-effect model because the number of study subgroups is fixed. Furthermore, meta-regressions were used to test if any study characteristic (mean age, illness duration, onset age, percentage female patients per study) could predict the weighted effect sizes. The significance of regression slope was tested using the Q statistic.
Daily sessions
Statistical Methods
20
H. M. Gellersen & K. K. Kedzior, Update of a Meta-Analysis on DTMS in MDD
Study
34
Ó 2018 Hogrefe Publishing
Ó 2018 Hogrefe Publishing 30% 6/20
HDRS 10
60% 12/20
44% all 50% M 40% NM
8%
HDRS 10
HDRS 7
58%
Rapinesi, Curto, et al., 2015e
63%
Levkovitz et al., 2009a
78% all 100% M 60% NM
52%
HDRS 10
Response rate (follow-up)
Antidepressant outcomes (longer-term effects)
Rapinesi, Bersani, et al., 2015d
Remission rate (follow-up)
Remission definition
28% 9/32
26% DTMS (22% sham)
– 18% 6/32
– HDRS 7
–
Feffer et al., 2017d
HDRS < 10
0%
HDRS 7
58% 7/12
44% DTMS (26% sham)
11% 1/9
HDRS 7
Rapinesi, Bersani, et al., 2015d Rapinesi, Curto, et al., 2015e Naim-Feil et al., 2016f
Levkovitz et al., 2015c
33% 29/89 DTMS (15% sham)
27% 7/26
HDRS 10
46% 12/26 HDRS < 10
41% 7/17
HDRS < 10
71% 12/17
38% 34/89 DTMS (21% sham) 100% 9/9
Levkovitz et al., 2015c
Berlim, van den Eynde, Tovar-Perdomo, Chachamovich, et al., 2014 Harel et al., 2014
17% 1/6
HDRS < 10
67% 4/6
14% 1/7
42% 8/19
HDRS < 10
57% 4/7
Rosenberg, Shoenfeld, et al., 2010 Rosenberg, Zangen, et al., 2010 Isserles et al., 2011b
HDRS 10
47% 9/19
Levkovitz et al., 2009a
Remission rates
Remission definition
Response rates
Antidepressant outcomes (acute effects)
Table 4. Clinical outcome measures in k = 11 DTMS studies
HDRS17
HDRS21
HDRS21
HDRS24
Scale
HDRS21
BDI
HDRS17
HDRS21
HDRS21
HDRS21
HDRS21
HDRS24
HDRS24
HDRS24
HDRS24
Scale
14 ± 7 (32)
21 ± 10 (13)
13 ± 4 (12)
10 ± 2 (9)
14 ± 8 (80) DTMS completers [16 ± 7 (66) sham]
9 ± 1 (26) completers
11 ± 5 (17)
16 ± 2 (20) completers
16 ± 10 (6) LOCF
17 ± 7 (7) LOCF
16 ± 13 (19) completers
End of daily stimulation phase M ± SD (n)
(Continued on next page)
6 months, no maintenance Outcome assessment: 6 months since baseline
3 months of maintenance (M) or no maintenance (NM) Maintenance: 2/week (1 month) then 1/week (next 2 months) Outcome assessment: 12 months since baseline
3 months of maintenance 2/week (DTMS or sham) Outcome assessment: 4 months since baseline
3 months, no maintenance, 41% resumed antidepressants after acute phase Outcome assessment: 3 months since baseline
Follow-up procedure after daily stimulation phase Last outcome assessment (months since baseline)
21 ± 6 (32)
32 ± 9 (21)
27 ± 6 (12)
24 ± 3 (9)
24 ± 4 (89) DTMS [23 ± 4 (92) sham]
22 ± 6 (17)
31 ± 4 (6)
27 ± 4 (7)
31 ± 4 (19)
Baseline M ± SD (n)
H. M. Gellersen & K. K. Kedzior, Update of a Meta-Analysis on DTMS in MDD 35
Zeitschrift für Psychologie (2018), 226(1), 30–44
Zeitschrift für Psychologie (2018), 226(1), 30–44
MDD (n = 10)
MDD (n = 13) vs. healthy (n = 26; no DTMS)
Harel et al., 2014
Naim-Feil et al., 2016f
CANTAB: visuospatial memory (PRM, SRM); sustained attention (RVP); executive functions (SOC, SWM, SSP); psychomotor speed (RTI) SART: attention (reaction times, performance variability, omission errors)
Mindstreams Cognitive Battery: memory; executive, visuospatial, verbal functions; attention; information processing speed
CANTAB: visuospatial memory (PAL), sustained attention (RVP), executive functions (SOC, SWM), psychomotor speed (RTI)
Test: functions assessed
Baseline (MDD vs. healthy): significantly higher performance variability and more omission errors After 20 DTMS sessions (MDD only): significantly less omission errors After 20 DTMS sessions (MDD vs. healthy): no difference in omission errors
After 20 DTMS sessions: no significant changes in any function
After 20 DTMS sessions: significant improvement in memory (in nonremitters) and information processing speed (in remitters); no change in other functions
Baseline (MDD vs. healthy): significant deficits in all functions except RTI After 20 DTMS sessions (MDD only): significant improvement in all functions except for RTI After 20 DTMS sessions (MDD vs. healthy): no difference on SOC and SWM
Summary of results
Notes. Some data in this table are also shown in the previous meta-analysis (Kedzior, Gellersen, Brachetti, et al., 2015) and another systematic review (Kedzior et al., 2016). Response rates are defined as at least 50% reduction in scale scores at the end of daily stimulation phase relative to baseline. If the baseline HDRS scores are missing then the final scores are mean change from baseline scores ± SEM. The final sample size was used to compute the effect sizes. BDI = Beck Depression Inventory; DTMS = deep repetitive transcranial magnetic stimulation; HDRS = Hamilton Depression Rating Scale; LOCF = last observation carried forward; M = maintenance treatment; MDD = major depressive disorder; n = sample size; NM = no maintenance treatment; PAL = paired associative learning (total errors); PRM = pattern recognition memory; RTI = reaction time (psychomotor speed: five choice reaction time or movement time or latency); RVP = rapid visual information processing; SART = Sustained Attention to Response Task; SOC = Stockings of Cambridge Test (cognitive planning: problems solved in minimum moves); SRM = spatial recognition memory; SSP = spatial memory span; SWM = spatial working memory. aData from the H1-coil group. bData from the control group (“No cognitive-emotional reactivation”). cData from the active DTMS group. dData from unipolar MDD patients. eData from patients without alcohol use disorders. f Data from MDD patients (the healthy control group did not receive DTMS).
MDD (n = 26)
MDD/H1-coil (n = 23) vs. healthy (n = 20; no DTMS)
Groups
Isserles et al., 2011b
Levkovitz et al., 2009
a
Cognitive outcomes (end of daily stimulation phase)
Table 4. (Continued)
36 H. M. Gellersen & K. K. Kedzior, Update of a Meta-Analysis on DTMS in MDD
Ó 2018 Hogrefe Publishing
H. M. Gellersen & K. K. Kedzior, Update of a Meta-Analysis on DTMS in MDD
Publication bias was assessed using funnel plots and Orwin’s Fail-Safe N (Rothstein, Sutton, & Borenstein, 2005). Funnel plots show if the distribution of study effect sizes and their estimated variability (expressed as the standard error of the mean, SEM) is symmetric around the pooled effect size of all studies in the analysis. It was assumed that a lack of symmetry could be due to a publication bias. Since the statistical power for Egger’s regression (a test for funnel plot symmetry) was too low in this analysis, we assessed the influence of a putative publication bias using trim-and-fill analysis (Rothstein et al., 2005). Specifically, the pooled effect sizes were adjusted for studies, theoretically missing from the analysis, required to make the funnel plots mathematically symmetrical. If the adjustment changes the interpretation of the pooled effect then the impact of publication bias is severe and invalidates the meta-analysis (Rothstein et al., 2005). Orwin’s Fail-Safe N is the number of studies with trivial effect sizes, theoretically missing from the analysis, that would reduce the pooled effect to less than a trivial effect (Borenstein et al., 2009). The criteria for “trivial” effect sizes are shown for each analysis in ESM 1.
Results Study Characteristics Compared to the previous meta-analysis (Kedzior, Gellersen, Brachetti, et al., 2015), the current analysis includes data from k = 2 additional studies (Feffer et al., 2017; Naim-Feil et al., 2016) and the active DTMS group in the RCT (Levkovitz et al., 2015). Thus, the analyses reported here include new data from up to 142 patients and exclude data from 34 patients (with bipolar MDD). All k = 11 studies included 282 patients with treatmentresistant MDD who received active DTMS treatment with H1-coil (data from control groups stimulated with other coils or with other conditions than MDD were excluded from the current analysis). The majority (k = 10; 91%) of studies utilized open-label designs and only one study was a sham-controlled RCT (Levkovitz et al., 2015; Table 2 only shows data from the active DTMS group, which were used in the current analysis). The patients were on average at least 40 years old and about 50% were female in the majority of studies (k = 8; 73%). The average MDD onset was at 17–45 years of age and the average MDD duration was at least 10 years in the majority of studies (k = 9; 82%). DTMS was most often applied as an add-on treatment to stable antidepressants (k = 8; 73%). DTMS appeared to be relatively safe with a dropout rate of 17% (47 out of 282 patients) throughout the daily stimulation phases (Table 2). Most patients dropped out due to Ó 2018 Hogrefe Publishing
37
treatment-unrelated reasons or lack of response, although two seizures were reported and two other patients reported suicidal ideation out of a total of 47 patients who dropped out from all studies (Table 2).
Stimulation Parameters All k = 11 studies utilized H1-coils with relatively homogeneous protocols, including high-frequency (18–20 Hz) and high-intensity (120% of the resting motor threshold) stimulation over 20 daily sessions (Table 3). Each session consisted of either 1,680 stimuli in 42 trains (k = 6; 54%) or 1,980–3,000 stimuli in 55–75 trains (k = 5; 46%) with a 20 s inter-train interval used in all studies.
Primary Outcome Measures: Depression (Acute Effects of DTMS) Depression outcomes were assessed using the Hamilton Depression Rating Scale (HDRS; Hamilton, 1960) in k = 10 studies and Beck Depression Inventory (BDI; Beck, Ward, Mendelson, Mock, & Erbaugh, 1961) in k = 1 study (see Table 4). Depression Severity The inspection of all 11 effect sizes (standardized change scores) revealed that one study (Rapinesi, Bersani, et al., 2015) was an outlier with a statistically significantly (p = .006) higher Hedges’ g (4.78) relative to all other effect sizes (1.43; Figure S1 in ESM 1). Preliminary analyses showed that the study had little effects on the interpretation of results but it inflated the pooled effects and, therefore, was excluded from the analysis to maintain statistical conservativeness. There was a statistically significant and large pooled antidepressant effect of 20 daily sessions of DTMS relative to baseline (Hedges’ g = 1.43, k = 10 studies, n = 232 patients; see Table 5 and Figure 2a). Heterogeneity among the effect sizes was low (I2 = 17%; Table 5). The antidepressant effect did not depend on therapy type (monotherapy vs. add-on therapy), stimuli per session (1,680 vs. 1,980– 3,000), and was not predicted by mean age, percentage female patients, mean illness duration, or mean onset age per study (Table 5). There was little evidence for publication bias in this analysis (see Table 5 and Figure S2 in ESM 1). Depression: Response Rates There were 112 out of 237 patients who responded to DTMS (pooled response rate of 51%, k = 10 studies, n = 237 patients; see Table 5 and Figure 2b). Heterogeneity among the effect sizes was moderate (I2 = 50%; Table 5). Zeitschrift für Psychologie (2018), 226(1), 30–44
38
H. M. Gellersen & K. K. Kedzior, Update of a Meta-Analysis on DTMS in MDD
Table 5. Random-effects meta-analyses of acute antidepressant outcomes in k = 11 DTMS studies Random-effects analyses
Depression severity (standardized change scores; Hedges’ g)
Response rates (responders/total n)
Remission rates (remitters/total n)
Pooled weighted effect Mean (95% CI) k, n
1.43 (1.22–1.64), k = 10, n = 232
51% (40–61%), k = 10, n = 237 (112/237)
29% (23–36%), k = 10, n = 237 (66/237)
Heterogeneity statistics
Q (df = 9) = 10.79, p = .290, I2 = 17%
Q (df = 9) = 18.02, p = .035*, I2 = 50%
Q (df = 9) = 9.85, p = .363, I2 = 9%
Orwin Fail-Safe N
NOrwin = 122
NOrwin = 28
NOrwin = 18
Funnel plot symmetric? Number of missing studies
No: k = 3 missing with small effect sizes
No: k = 5 missing with large response rates
No: k = 3 missing with large remission rates
1.32 (1.09–1.55)
40% (30–52%)
31% (24–39%)
Add-on
1.52 (1.17–1.86), k = 7, n = 126
56% (40–71%), k = 7, n = 122 (65/122)
25% (18–35%), k = 7, n = 122 (28/122)
Monotherapy
1.39 (1.13–1.66), k = 3, n = 106
41% (32–50%), k = 3, n = 115 (47/115)
33% (25–43%), k = 3, n = 115 (38/115)
Add-on vs. monotherapy
Q (df = 1) = 0.05, p = .831
Q (df = 1) = 2.32, p = .127
Q (df = 1) = 1.68, p = .195
1,680 (42 trains)
1.44 (1.15–1.72), k = 6, n = 91
52% (41–63%), k = 5, n = 78 (41/78)
30% (21–42%), k = 5, n = 78 (23/78)
1,980–3,000 (55–75 trains)
1.51 (1.07–1.96) k = 4, n = 141
52% (32–70%), k = 5, n = 159 (71/159)
26% (16–40%), k = 5, n = 159 (43/159)
1,680 vs. 1,980–3,000
Q (df = 1) = 0.04, p = .836
Q (df = 1) = 2.05, p = .152
Q (df = 1) = .02, p = .887
Mean age
b < .01, p = .942, R2 = 0%, k = 10
b = .05, p = .471, R2 = 0%, k = 10
b = .06, p = .341, R2 = 0%, k = 10
% Female
b < .01, p = .800, R2 = 0%, k = 10
b = .01, p = .582, R2 = 0%, k = 10
b = .02, p = .238, R2 = 41%, k = 10
Mean illness duration
b = .01, p = .496, R2 = 0%, k=9
b = .05, p = .060, R2 = 41%, k = 10
b < .01, p = .987, R2 = 0%, k = 10
Mean onset age
b = .01, p = .543, R2 = 0%, k=9
b = .05, p = .090, R2 = 35%, k = 10
b < .01, p = .980, R2 = 0%, k = 10
Publication bias analysis
Mean effect (95% CI) adjusted for missing studies Subgroup analyses Therapy
Stimuli (or trains)/session
Meta-regression predictors
Notes. b = unstandardized weighted regression coefficient, CI = confidence interval, df = degrees of freedom, DTMS = deep transcranial magnetic stimulation, g = effect size Hedges’ g (standardized paired difference in means), HDRS = Hamilton Depression Rating Scale, k = number of studies, n = sample size. *ptwo-tailed < .05.
Response rates did not depend on therapy type (monotherapy vs. add-on therapy), stimuli per session (1,680 vs. 1,980–3,000) and were not predicted by mean age and percentage female patients per study (Table 5). However, there was a trend toward higher response rates in studies with lower mean illness duration (p = .06) and higher mean onset age (p = .09; Figure S3 in ESM 1). There was little evidence for publication bias in this analysis (Table 5, Figure S4 in ESM 1). Depression: Remission Rates There were 66 out of 237 patients who remitted after DTMS (pooled remission rate of 29%, k = 10 studies, n = 237 patients; see Table 5 and Figure 2c). Heterogeneity among the effect sizes was low (I2 = 9%; Table 5).
Zeitschrift für Psychologie (2018), 226(1), 30–44
Remission rates did not depend on therapy type (monotherapy vs. add-on therapy), stimuli per session (1,680 vs. 1,980–3,000) and were not predicted by mean age, percentage female patients, mean illness duration, or mean onset age per study (Table 5). There was little evidence for publication bias in this analysis (see Table 5 and Figure S5 in ESM 1).
Primary Outcome Measures: Depression (Longer-Term Effects of DTMS) Only k = 4 studies assessed depression severity beyond daily treatment phases (at 3–12 months since baseline; Table 4). There was no maintenance treatment in k = 2 studies
Ó 2018 Hogrefe Publishing
H. M. Gellersen & K. K. Kedzior, Update of a Meta-Analysis on DTMS in MDD
39
(A) Study name
Statistics for each study Hedges's g
Levkovitz et al., 2009 Rosenberg et al., 2010a Rosenberg et al., 2010b Isserles et al., 2011 Berlim et al., 2014a Harel et al., 2014 Levkovitz et al., 2015 Rapinesi et al., 2015b Naim-Feil et al., 2016 Feffer et al., 2017
Lower limit
1.25 1.43 1.45 1.72 1.88 1.71 1.43 2.46 1.08 1.04 1.43
Upper limit
0.66 0.44 0.39 1.04 1.10 1.12 1.12 1.34 0.42 0.62 1.22
Hedges's g and 95% CI
p-Value Total
1.83 2.42 2.51 2.40 2.66 2.31 1.74 3.58 1.74 1.46 1.64
0.000 0.005 0.007 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.000
19 7 6 20 17 26 80 12 13 32 -4.00
-2.00
0.00
2.00
4.00
DTMS ineffective DTMS effective
(B) Study name
Event rate and 95% CI Event Lower Upper rate limit limit
Levkovitz et al., 2009 Rosenberg et al., 2010a Rosenberg et al., 2010b Isserles et al., 2011 Berlim et al., 2014a Harel et al., 2014 Levkovitz et al., 2015 Rapinesi et al., 2015a Rapinesi et al., 2015b Feffer et al., 2017
0.47 0.57 0.67 0.60 0.71 0.46 0.38 0.95 0.58 0.28 0.51
0.27 0.23 0.27 0.38 0.46 0.28 0.29 0.53 0.31 0.15 0.40
0.69 0.86 0.92 0.79 0.87 0.65 0.49 1.00 0.82 0.46 0.61
Total
9 / 19 4/7 4/6 12 / 20 12 / 17 12 / 26 34 / 89 9/9 7 / 12 9 / 32 -1.00 -0.50
0.00
0.50
Figure 2. Acute antidepressant outcomes of DTMS relative to baseline. All forest plots show the results of randomeffects meta-analyses. (A) Depression severity (standardized HDRS change scores from baseline), (B) Response rates, (C) Remission rates. Abbreviations: CI = confidence interval, DTMS = deep transcranial magnetic stimulation, HDRS = Hamilton Depression Rating Scale, Hedges’ g = standardized paired difference in means corrected for sample size (effect size), Total = patients who received DTMS with H1-coil at baseline. Due to space limitation on the figure the citations to multiple studies with the same first authors include year and letter “a” or “b”. These studies are: Berlim et al., 2014a = Berlim, van den Eynde, Tovar-Perdomo, Chachamovich, et al., 2014; Rapinesi et al., 2015a = Rapinesi, Bersani, et al., 2015; Rapinesi et al., 2015b = Rapinesi, Curto, et al., 2015; Rosenberg et al., 2010a = Rosenberg, Shoenfeld, et al., 2010; Rosenberg et al., 2010b = Rosenberg, Zangen, et al., 2010.
1.00
(C) Study name
Event rate and 95% CI Event Lower Upper rate limit limit
Levkovitz et al., 2009 Rosenberg et al., 2010a Rosenberg et al., 2010b Isserles et al., 2011 Berlim et al., 2014a Harel et al., 2014 Levkovitz et al., 2015 Rapinesi et al., 2015a Rapinesi et al., 2015b Feffer et al., 2017
0.42 0.14 0.17 0.30 0.41 0.27 0.33 0.11 0.04 0.19 0.29
0.23 0.02 0.02 0.14 0.21 0.13 0.24 0.02 0.00 0.09 0.23
0.64 0.58 0.63 0.53 0.65 0.47 0.43 0.50 0.40 0.36 0.36
Total
8 / 19 1/7 1/6 6 / 20 7 / 17 7 / 26 29 / 89 1/9 0 / 12 6 / 32 -1.00 -0.50
(although 41% of patients resumed antidepressants in one study) or 1–2 weekly DTMS sessions for at least three months in further k = 2 studies (Table 4). Even without maintenance response rates remained high (58–63%) at 2–5 months since last DTMS, although remission dropped down to 8%
Ó 2018 Hogrefe Publishing
0.00
0.50
1.00
five months after last DTMS. Maintenance treatment prolonged the acute efficacy of DTMS for three months since the last daily stimulation in the RCT (Levkovitz et al., 2015). The benefits of maintenance were also shown in one study (Rapinesi, Bersani, et al., 2015) in terms of higher
Zeitschrift für Psychologie (2018), 226(1), 30–44
40
H. M. Gellersen & K. K. Kedzior, Update of a Meta-Analysis on DTMS in MDD
response and remission rates at 12 months since baseline in patients on maintenance (1–2 weekly sessions for 3 months) relative to no maintenance.
Secondary Outcome Measures: Cognitive Functioning The cognitive outcomes were assessed using different standardized instruments in only k = 4 studies and only in subsamples of patients who underwent DTMS (Table 4). Cognitive functioning (memory, executive functioning, attention) was worse in MDD patients relative to healthy controls at baseline in k = 2 studies. Cognitive deficits improved after 20 DTMS sessions in MDD patients in k = 3 out of four studies. Specifically, MDD patients improved relative to baseline or no longer differed from controls at the end of daily stimulation in terms of memory (in k = 2 studies), attention (k = 2), executive functioning (k = 1), and processing speed (k = 1).
Discussion Noninvasive brain stimulation (NIBS) methods have been known for at least 30 years and are considered viable alternatives for patients with treatment-resistant MDD, among other clinical disorders (Ziemann, 2017). As one of the newer additions to the NIBS toolkit, DTMS requires evaluation in terms of its clinical efficacy and the potential sources for interindividual variability to therapy. These findings will help to determine what position DTMS might take in the landscape of treatment options for MDD relative to rTMS with conventional F8-coils (Kreuzer, Höppner, et al., 2015; Kreuzer, Padberg, et al., 2015). Compared to the previous meta-analysis regarding the clinical outcomes of DTMS in MDD (Kedzior, Gellersen, Brachetti, et al., 2015), the current study focuses on unipolar MDD and includes additional data from up to 142 patients. In general, the current meta-analysis confirms and extends the findings of the previous meta-analysis (Kedzior, Gellersen, Brachetti, et al., 2015). The homogeneous, FDA-approved protocols (20 days of high-frequency and high-intensity stimulation) used in all studies produced reasonably homogeneous acute effect sizes that may be translated to clinically-relevant antidepressant outcomes. DTMS also appears to be safe and tolerable. However, the high frequency and intensity of stimulation may cause scalp discomfort and, in the worst case, could contribute to seizures and intolerance that have also been reported as reasons for dropping out (see Table 2). Therefore, it would be interesting to investigate if stimulation protocols with lower frequencies and/or intensities could reduce Zeitschrift für Psychologie (2018), 226(1), 30–44
the side effects without compromising the efficacy of DTMS. In addition to the previous meta-analysis (Kedzior, Gellersen, Brachetti, et al., 2015), we have attempted to assess the longer-term clinical outcomes of DTMS beyond the daily stimulation phases in unipolar MDD. In general, evidence from mixed samples with unipolar and bipolar MDD as well as from case studies indicates that the antidepressant outcomes of DTMS remain beyond the acute treatment phases for up to 12 months (Gellersen & Kedzior, 2015). Conventional rTMS studies with F8-coils show that the acute clinical benefits decrease over time if maintenance (continuation) treatment is not offered (Kedzior, Reitz, Azorina, & Loo, 2015). The current findings in unipolar MDD suggest that maintenance treatment can prolong the antidepressant outcomes of DTMS for up to 12 months (Rapinesi, Bersani, et al., 2015) and contribute to a delayed response in those who did not respond during the acute treatment phases (Yip et al., 2017). However, it is difficult to predict who may relapse after daily DTMS treatment based on demographic or clinical characteristics of patients (Rosenberg et al., 2015) and if patients who relapse eventually develop resistance to subsequent treatments (Rosenberg et al., 2011). Further research is required to investigate how to dose the maintenance treatment to achieve the best cost-benefit balance and for how long the effects of DTMS prevail once maintenance ceases in MDD. Only one new study (Naim-Feil et al., 2016) addressed the effects of DTMS on the cognitive outcomes in unipolar MDD since a previous systematic review including mixed samples with unipolar and bipolar MDD (Kedzior et al., 2016). Although limited, the evidence regarding the cognitive outcomes is interesting. The trends toward improvements in multiple cognitive abilities associated with prefrontal cortex, such as working memory, psychomotor speed, and general executive functions, suggest that DTMS may contribute to normalization of prefrontal activity in unipolar MDD. The consistent improvements in various symptom domains (antidepressant, cognitive) suggest that various neural systems might be affected by DTMS because persistence of cognitive deficits even after stabilization of mood is common in depression (Kaser, Zaman, & Sahakian, 2017). Furthermore, there is little evidence for a causal relationship among such changes in clinical and cognitive domains after DTMS (Kedzior et al., 2016; Levkovitz et al., 2009). For example, increases in sustained attention after a single DTMS session and after acute four-week treatment were not related to antidepressant effects (Naim-Feil et al., 2016). Therefore, DTMS-induced improvements in each domain may result from plasticity changes in different brain networks that do not necessarily depend on each other. Future sham-controlled studies are Ó 2018 Hogrefe Publishing
H. M. Gellersen & K. K. Kedzior, Update of a Meta-Analysis on DTMS in MDD
required to investigate the neurobiological bases of DTMSinduced cognitive effects across different domains. Although the clinical outcomes of some NIBS methods, such as the conventional rTMS with F8-coils, are reasonably well established in MDD, predictors of response in terms of patient characteristics and stimulation protocols are poorly understood (Ziemann, 2017). These issues are particularly important given the inconvenience of daily treatment in terms of time and costs. Given that all studies in the current meta-analysis utilized homogeneous stimulation parameters, it is disappointing that we identified only few weak trends regarding predictors of clinical outcomes in terms of patient characteristics. Although the statistical power was low in the current analysis, previous metaanalyses using markedly larger volumes of data also failed to find consistent predictors of antidepressant response to conventional rTMS (for review see Kedzior et al., 2014). In general, our results show that the acute antidepressant outcomes did not depend on the concurrent treatment with antidepressants, age, female gender, and severity of illness according to univariate statistical analyses. Although not significant, the response rates tended to be higher in studies with less severely ill patients. Similarly, lower illness severity or duration was associated with higher antidepressant outcomes in studies with conventional rTMS coils (for review see Kedzior et al., 2014). Ideally, predictors of clinical outcomes should be tested using multivariate statistical approaches that require a higher volume of data. Given the heterogeneous nature of depression, another approach to finding predictors of response may be the identification of neurobiological markers or patterns of symptoms rather than focusing on the crude clinical characteristics, such as an overall illness severity. One possible approach could be the categorization of patients into “biotypes” that are defined by their respective pattern of brain dysfunction. A recent study followed this approach by correlating resting state network activity measured with functional magnetic resonance imaging (fMRI) to clinical symptoms (Drysdale et al., 2017). Interestingly, the authors identified four biotypes, which predicted responsiveness to conventional rTMS over dorsomedial prefrontal cortex. Their work suggests that patterns of dysfunctional connectivity rather than of symptoms resulted in stable patient categorization. As in rTMS, further research is required to investigate the neurobiological mechanisms through which DTMS exerts its clinical effects (Schmittwilken et al., 2017; Ziemann, 2017). Modeling studies suggest that DTMS stimulates wider and possibly deeper, subcortical brain regions (Parazzini et al., 2017; Roth et al., 2014; Zangen et al., 2005). Interestingly, the electrical field of the H1-coil was deeper and more wide-reaching in the model based on a younger female brain relative to an older male brain, Ó 2018 Hogrefe Publishing
41
suggesting that patient demographics may play a role in shaping the electrical field and potentially treatment outcomes (Parazzini et al., 2017). Therefore, optimal stimulation paradigms and predictors of antidepressant effects may only be determined once future technological advancement improves our understanding of the neurobiology of depression and the mechanisms of the NIBS methods. Our meta-analysis has several limitations. First, we only included published studies while gray literature (ongoing and/or unpublished studies) was not considered. Although it cannot be ruled out, it is unlikely that a publication bias due to unpublished studies would have substantially altered the outcomes of our analyses. In general, the current focus in the literature is on the use of DTMS beyond depression in other neuropsychiatric disorders (Tendler, Barnea Ygael, Roth, & Zangen, 2016), including substance use disorders which are difficult to treat (Kedzior, Gerkensmeier, & Schuchinsky, 2017). Second, the study quality and the risk of bias were not assessed using any standardized criteria because our analysis contained mostly non-blind, openlabel studies without control groups. Instead, our weighing method (lower weights in studies with higher variance) was an indirect measure of study quality. We also assessed the influence of individual studies on the pooled effects using the one-study removed analysis. Third, although the pooled effect sizes are probably inflated by placebo and expectation effects in the mostly open-label studies included in this analysis, the magnitudes of these effects reflect the realworld practice where the therapeutic effect combines the real and the placebo effects (Kedzior, Gellersen, Brachetti, et al., 2015). Although a supportive patient-clinician relationship can trigger placebo effects (Colloca, Jonas, Killen, Miller, & Shurtleff, 2014), such relationship is unavoidable when applying NIBS treatment. Further investigations of the placebo effects arising from NIBS are required before such methods are available for home use and/or with less direct involvement of the clinicians. Data from the RCT (Levkovitz et al., 2015) suggest that the placebo effect alone cannot explain the antidepressant outcomes because DTMS had a better efficacy than sham. Furthermore, the preliminary evidence shows that the antidepressant outcomes of DTMS lasted beyond the acute stimulation phases without the frequent contact with the clinicians. Therefore, more effort is required to evaluate the longer-term effects of DTMS that may be less confounded by the placebo effects. Finally, a small volume of data allowed us to conduct only underpowered univariate analyses. Possibly since DTMS is already FDA-approved for unipolar MDD it is unlikely that sufficient primary data required for multivariate analyses will be available in the foreseeable future. Thus, predictors of clinical outcomes may need to be investigated using individual patient rather than the study-level data. Zeitschrift für Psychologie (2018), 226(1), 30–44
42
H. M. Gellersen & K. K. Kedzior, Update of a Meta-Analysis on DTMS in MDD
In conclusion, 20 days of high-frequency DTMS with H1-coil appear to acutely alleviate clinical symptoms and may improve some cognitive functions in unipolar MDD. The current results provide empirical evidence for clinicians interested in the use of DTMS in patients with treatment-resistant unipolar MDD in research settings as well as potentially in the clinical practice. Given that DTMS shows promising efficacy and acceptable tolerability, the clinical application of this method could be considered in Germany. Acknowledgments Some results presented in this manuscript were presented by both authors at the 2nd International Brain Stimulation Conference in Barcelona, Spain (March 2017). The action editor for this article was Edgar Erdfelder. Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 2151-2604/a000320 ESM 1. Text (.docx). 1 Table and 5 Figures.
References APA. (2008). Reporting standards for research in psychology: Why do we need them? What might they be? The American Psychologist, 63, 839–851. https://doi.org/10.1037/0003066X.63.9.839 Baeken, C., & De Raedt, R. (2011). Neurobiological mechanisms of repetitive transcranial magnetic stimulation on the underlying neurocircuitry in unipolar depression. Dialogues in Clinical Neuroscience, 13, 139–145. Retrieved from https://www.dialoguescns.org/wp-content/uploads/issues/13/DialoguesClinNeurosci13-139.pdf Beck, A. T., Ward, C. H., Mendelson, M., Mock, J., & Erbaugh, J. (1961). An inventory for measuring depression. Archives of General Psychiatry, 4, 561–571. https://doi.org/10.1001/ archpsyc.1961.01710120031004 Berlim, M. T., van den Eynde, F., Tovar-Perdomo, S., Chachamovich, E., Zangen, A., & Turecki, G. (2014). Augmenting antidepressants with deep transcranial magnetic stimulation (DTMS) in treatment-resistant major depression. The World Journal of Biological Psychiatry, 15, 570–578. https://doi.org/ 10.3109/15622975.2014.925141 Berlim, M. T., van den Eynde, F., Tovar-Perdomo, S., & Daskalakis, Z. J. (2014). Response, remission and drop-out rates following high-frequency repetitive transcranial magnetic stimulation (rTMS) for treating major depression: A systematic review and meta-analysis of randomized, double-blind and shamcontrolled trials. Psychological Medicine, 44, 225–239. https://doi.org/10.1017/S0033291713000512 Borenstein, M., Hedges, L., Higgins, J., & Rothstein, H. (2009). Introduction to meta-analysis. Chichester, UK: Wiley. Colloca, L., Jonas, W. B., Killen, J., Miller, F. G., & Shurtleff, D. (2014). Reevaluating the placebo effect in medical practice. Zeitschrift für Psychologie, 222, 124–127. https://doi.org/ 10.1027/2151-2604/a000177
Zeitschrift für Psychologie (2018), 226(1), 30–44
Drysdale, A. T., Grosenick, L., Downar, J., Dunlop, K., Mansouri, F., Meng, Y., . . . Liston, C. (2017). Resting-state connectivity biomarkers define neurophysiological subtypes of depression. Nature Medicine, 23, 28–38. https://doi.org/10.1038/nm.4246 Fadini, T., Matthaus, L., Rothkegel, H., Sommer, M., Tergau, F., Schweikard, A., . . . Nitsche, M. A. (2009). H-coil: Induced electric field properties and input/output curves on healthy volunteers, comparison with a standard figure-of-eight coil. Clinical Neurophysiology, 120, 1174–1182. https://doi.org/ 10.1016/j.clinph.2009.02.176 Fava, M., & Davidson, K. G. (1996). Definition and epidemiology of treatment-resistant depression. Psychiatric Clinics of North America, 19, 179–200. https://doi.org/10.1016/S0193-953X(05) 70283-5 Feffer, K., Lapidus, K. A. B., Braw, Y., Bloch, Y., Kron, S., Netzer, R., & Nitzan, U. (2017). Factors associated with response after deep transcranial magnetic stimulation in a real-world clinical setting: Results from the first 40 cases of treatment-resistant depression. European Psychiatry, 44, 61–67. https://doi.org/ 10.1016/j.eurpsy.2017.03.012 Gellersen, H. M., & Kedzior, K. K. (2015, May). Durability of the antidepressant effect of high-frequency deep repetitive transcranial magnetic stimulation (HF-DTMS) in major depression: A systematic review. Paper presented at the Magstim 2015 Neuroscience Conference, Oxford, UK. Retrieved from https:// www.researchgate.net/publication/275973399_Durability_of_ the_antidepressant_effect_of_high-frequency_deep_repetitive_ transcranial_magnetic_stimulation_HF-DTMS_in_major_ depression_a_systematic_review Hajak, G., Padberg, F., Herwig, U., Eschweiler, G. W., Cohrs, S., Langguth, B., . . . Eichhammer, P. (2005). Repetitive transkranielle Magnetstimulation [Repetitive transcranial magnetic stimulation]. Nervenheilkunde, 24, 48–58. Retrieved from http://www. schattauer.de/t3page/1214.html?manuscript=1998 Hamilton, M. (1960). A rating scale for depression. Journal of Neurology, Neurosurgery, and Psychiatry, 23, 56–62. https:// doi.org/10.1136/jnnp.23.1.56 Harel, E. V., Rabany, L., Deutsch, L., Bloch, Y., Zangen, A., & Levkovitz, Y. (2014). H-coil repetitive transcranial magnetic stimulation for treatment resistant major depressive disorder: An 18-week continuation safety and feasibility study. The World Journal of Biological Psychiatry, 15, 298–306. https://doi.org/ 10.3109/15622975.2011.639802 Isserles, M., Rosenberg, O., Dannon, P., Levkovitz, Y., Kotler, M., Deutsch, F., . . . Zangen, A. (2011). Cognitive-emotional reactivation during deep transcranial magnetic stimulation over the prefrontal cortex of depressive patients affects antidepressant outcome. Journal of Affective Disorders, 128, 235–242. https:// doi.org/10.1016/j.jad.2010.06.038 Kaser, M., Zaman, R., & Sahakian, B. J. (2017). Cognition as a treatment target in depression. Psychological Medicine, 47, 987–989. https://doi.org/10.1017/S0033291716003123 Kedzior, K. K., Azorina, V., & Reitz, S. K. (2014). More female patients and fewer stimuli per session are associated with the short-term antidepressant properties of repetitive transcranial magnetic stimulation (rTMS): A meta-analysis of 54 shamcontrolled studies published between 1997–2013. Neuropsychiatric Disease and Treatment, 10, 727–756. https://doi.org/ 10.2147/NDT.S58405 Kedzior, K. K., Gellersen, H., Brachetti, A., & Berlim, M. T. (2015). Deep transcranial magnetic stimulation (DTMS) in the treatment of major depression: An exploratory systematic review and meta-analysis. Journal of Affective Disorders, 187, 73–83. https://doi.org/10.1016/j.jad.2015.08.033 Kedzior, K. K., Gellersen, H., Roth, Y., & Zangen, A. (2015). Acute reduction in anxiety after deep transcranial magnetic
Ó 2018 Hogrefe Publishing
H. M. Gellersen & K. K. Kedzior, Update of a Meta-Analysis on DTMS in MDD
stimulation (DTMS) in unipolar major depression – A systematic review and meta-analysis. Psychiatry Research, 230, 971– 974. https://doi.org/10.1016/j.psychres.2015.11.032 Kedzior, K. K, Gerkensmeier, I., & Schuchinsky, M. (2017). How deep is the deep transcranial magnetic stimulation (DTMS)? Putative stimulation of reward pathways in substance use disorders: A systematic review and meta-analysis. Brain Stimulation, 10, 355. https://doi.org/10.1016/j.brs.2017.01.044 Kedzior, K. K., Gierke, L., Gellersen, H. M., & Berlim, M. T. (2016). Cognitive functioning and deep transcranial magnetic stimulation (DTMS) in major psychiatric disorders: A systematic review. Journal of Psychiatric Research, 75, 107–115. https://doi.org/ 10.1016/j.jpsychires.2015.12.019 Kedzior, K. K., & Reitz, S. K. (2014). Short-term efficacy of repetitive transcranial magnetic stimulation (rTMS) in depressionreanalysis of data from meta-analyses up to 2010. BMC Psychology, 2, 39–58. https://doi.org/10.1186/s40359-0140039-y Kedzior, K. K., Reitz, S. K., Azorina, V., & Loo, C. (2015c). Durability of the antidepressant effect of the high-frequency repetitive transcranial magnetic stimulation (rTMS) in the absence of maintenance treatment in major depression. A systematic review and meta-analysis of 16 double-blind, randomised, sham-controlled trials. Depression and Anxiety, 32, 193–203. https://doi.org/10.1002/da.22339 Kleine-Budde, K., Müller, R., Kawohl, W., Bramesfeld, A., Moock, J., & Rössler, W. (2013). The cost of depression – A cost analysis from a large database. Journal of Affective Disorders, 147, 137–143. https://doi.org/10.1016/j.jad.2012.10.024 Kreuzer, P. M., Höppner, J., Kammer, T., Schönfeldt-Lecuona, C., Padberg, F., Bajbouj, M., . . . Langguth, B. (2015). rTMS in der Therapie psychiatrischer Erkrankungen. Grundlagen und Methodik [rTMS in the treatment of psychiatric disorders. Basics and methods]. Nervenheilkunde, 34, 965–975. Retrieved from http://www.schattauer.de/t3page/1214.html?manuscript=25237 Kreuzer, P. M., Padberg, F., Schönfeldt-Lecuona, C., Höppner, J., Zwanzger, P., Bajbouj, M., . . . Langguth, B. (2015). Repetitive transkranielle Magnetstimulation in der Behandlung depressiver Störungen. Eine systematische Literaturrecherche [Repetitive transcranial magnetic stimulation in the treatment of depression. A systematic literature search]. Nervenheilkunde, 34, 978–986. Retrieved from http://www.schattauer.de/t3page/ 1214.html?manuscript=25235 Levkovitz, Y., Harel, E. V., Roth, Y., Braw, Y., Most, D., Katz, L. N., . . . Zangen, A. (2009). Deep transcranial magnetic stimulation over the prefrontal cortex: Evaluation of antidepressant and cognitive effects in depressive patients. Brain Stimulation, 2, 188–200. https://doi.org/10.1016/j.brs.2009.08.002 Levkovitz, Y., Isserles, M., Padberg, F., Lisanby, S. H., Bystritsky, A., Xia, G., . . . Zangen, A. (2015). Efficacy and safety of deep transcranial magnetic stimulation for major depression: A prospective multicenter randomized controlled trial. World Psychiatry, 14, 64–73. https://doi.org/10.1002/wps.20199 Moher, D., Liberati, A., Tetzlaff, J., & Altman, D. (2009). Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. British Medical Journal, 339, b2535. https://doi.org/10.1136/bmj.b2535 Naim-Feil, J., Bradshaw, J. L., Sheppard, D. M., Rosenberg, O., Levkovitz, Y., Dannon, P., . . . Zangen, A. (2016). Neuromodulation of attentional control in major depression: A pilot deepTMS study. Neural Plasticity, 2016, 5760141. https://doi.org/ 10.1155/2016/5760141 Niu, M., Wang, Y. L., Jia, Y., Wang, J., Zhong, S., Lin, J., . . . Huang, R. (2017). Common and specific abnormalities in cortical thickness in patients with major depressive and bipolar
Ó 2018 Hogrefe Publishing
43
disorders. EBioMedicine, 16, 162–171. https://doi.org/10.1016/ j.ebiom.2017.01.010 Parazzini, M., Fiocchi, S., Chiaramello, E., Roth, Y., Zangen, A., & Ravazzani, P. (2017). Electric field estimation of deep transcranial magnetic stimulation clinically used for the treatment of neuropsychiatric disorders in anatomical head models. Medical Engineering & Physics, 43, 30–38. https://doi.org/ 10.1016/j.medengphy.2017.02.003 Rapinesi, C., Bersani, F., Kotzalidis, G., Imperatori, C., Del Casale, A., Di Pietro, S., . . . Girardi, P. (2015). Maintenance deep transcranial magnetic stimulation sessions is associated with reduced depressive relapses in patients with unipolar or bipolar depression. Frontiers in Neurology, 6, 16. https://doi.org/10.3389/ fneur.2015.00016 Rapinesi, C., Curto, M., Kotzalidis, G. D., Del Casale, A., Serata, D., Ferri, V. R., . . . Girardi, P. (2015). Antidepressant effectiveness of deep transcranial magnetic stimulation (dTMS) in patients with major depressive disorder (MDD) with or without alcohol use disorders (AUDs): A 6-month, open label, follow-up study. Journal of Affective Disorders, 174, 57–63. https://doi.org/ 10.1016/j.jad.2014.11.015 Rodriguez-Cano, E., Alonso-Lana, S., Sarro, S., FernandezCorcuera, P., Goikolea, J. M., Vieta, E., . . . Pomarol-Clotet, E. (2017). Differential failure to deactivate the default mode network in unipolar and bipolar depression. Bipolar Disorders, 19, 386–395. https://doi.org/10.1111/bdi.12517 Rosenberg, O., Dinur Klein, L., Gersner, R., Kotler, M., Zangen, A., & Dannon, P. (2015). Long-term follow-up of MDD patients who respond to deep rTMS: A brief report. The Israel Journal of Psychiatry and Related Sciences, 52, 17–23. Retrieved from https://doctorsonly.co.il/wp-content/uploads/2015/03/05_Longterm-Follow-up.pdf Rosenberg, O., Isserles, M., Levkovitz, Y., Kotler, M., Zangen, A., & Dannon, P. N. (2011). Effectiveness of a second deep TMS in depression: A brief report. Progress in Neuro-Psychopharmacology and Biological Psychiatry, 35, 1041–1044. https://doi. org/10.1016/j.pnpbp.2011.02.015 Rosenberg, O., Shoenfeld, N., Zangen, A., Kotler, M., & Dannon, P. N. (2010). Deep TMS in a resistant major depressive disorder: A brief report. Depression and Anxiety, 27, 465–469. https://doi.org/10.1002/da.20689 Rosenberg, O., Zangen, A., Stryjer, R., Kotler, M., & Dannon, P. N. (2010). Response to deep TMS in depressive patients with previous electroconvulsive treatment. Brain Stimulation, 3, 211–217. https://doi.org/10.1016/j.brs.2009.12.001 Roth, Y., Pell, G. S., Chistyakov, A. V., Sinai, A., Zangen, A., & Zaaroor, M. (2014). Motor cortex activation by H-coil and figure-8 coil at different depths. Combined motor threshold and electric field distribution study. Clinical Neurophysiology, 125, 336–343. https://doi.org/10.1016/j.clinph.2013.07.013 Rothstein, H. R., Sutton, A., & Borenstein, M. (2005). Publication bias in meta-analysis. Prevention, assessment and adjustments. Chichester, UK: Wiley. Schmittwilken, L., Schuchinsky, M., & Kedzior, K. (2017). Neurobiological mechanisms of deep transcranial magnetic stimulation (DTMS): A systematic review. Brain Stimulation, 10, 379. https://doi.org/10.1016/j.brs.2017.01.123 Tendler, A., Barnea Ygael, N., Roth, Y., & Zangen, A. (2016). Deep transcranial magnetic stimulation (dtms) – beyond depression. Expert Review of Medical Devices, 13, 987–1000. https://doi. org/10.1080/17434440.17432016.11233812 Yip, A. G., George, M. S., Tendler, A., Roth, Y., Zangen, A., & Carpenter, L. L. (2017). 61% of unmedicated treatment resistant depression patients who did not respond to acute TMS treatment responded after four weeks of twice weekly deep
Zeitschrift für Psychologie (2018), 226(1), 30–44
44
H. M. Gellersen & K. K. Kedzior, Update of a Meta-Analysis on DTMS in MDD
TMS in the Brainsway pivotal trial. Brain Stimulation, 10, 847–849. https://doi.org/10.1016/j.brs.2017.1002.1013 Yu, H.-L., Liu, W.-B., Wang, T., Huang, P.-Y., Jie, L.-Y., Sun, J.-Z., . . . Zhang, M.-M. (2017). Difference in resting-state fractional amplitude of low-frequency fluctuation between bipolar depression and unipolar depression patients. European Review for Medical and Pharmacological Sciences, 21, 1541–1550. Retrieved from http://www.europeanreview.org/article/12522 Zangen, A., Roth, Y., Voller, B., & Hallett, M. (2005). Transcranial magnetic stimulation of deep brain regions: Evidence for efficacy of the H-coil. Clinical Neurophysiology, 116, 775–779. https://doi.org/10.1016/j.clinph.2004.11.008 Ziemann, U. (2017). Thirty years of transcranial magnetic stimulation: Where do we stand? Experimental Brain Research, 235, 973–984. https://doi.org/10.1007/s00221-016-4865-4
Zeitschrift für Psychologie (2018), 226(1), 30–44
Received June 7, 2017 Revision received September 25, 2017 Accepted October 30, 2017 Published online February 2, 2018
Helena M. Gellersen Behavioural and Clinical Neuroscience Institute (BCNI) Department of Psychology University of Cambridge Downing Site Cambridge CB2 3EB UK hg424@cam.ac.uk
Ó 2018 Hogrefe Publishing
Review Article
A Meta-Analytic Re-Appraisal of the Framing Effect Alexander Steiger1 and Anton Kühberger1,2 1
Department of Psychology, University of Salzburg, Austria
2
Centre of Cognitive Neurosciences, University of Salzburg, Austria Abstract: We reevaluated and reanalyzed the data of Kühberger’s (1998) meta-analysis on framing effects in risky decision making by using p-curve. This method uses the distribution of only significant p-values to correct the effect size, thus taking publication bias into account. We found a corrected overall effect size of d = 0.52, which is considerably higher than the effect reported by Kühberger (d = 0.31). Similarly to the original analysis, most moderators proved to be effective, indicating that there is not the risky-choice framing effect. Rather, the effect size varies with different manipulations of the framing task. Taken together, the p-curve analysis shows that there are reliable risky-choice framing effects, and that there is no evidence of intense p-hacking. Comparing the corrected estimate to the effect size reported in the Many Labs Replication Project (MLRP) on gain-loss framing (d = 0.60) shows that the two estimates are surprisingly similar in size. Finally, we conducted a new meta-analysis of risk framing experiments published in 2016 and again found a similar effect size (d = 0.56). Thus, although there is discussion on the adequate explanation for framing effects, there is no doubt about their existence: risky-choice framing effects are highly reliable and robust. No replicability crisis there. Keywords: effect size, framing effect, meta-analysis, p-curve, prospect theory
About 20 years ago, Kühberger (1998) published a metaanalysis on framing effects in risky gain-loss situations. The analysis not only calculated an overall effect size of the framing effect, but also evaluated the influence of some moderators, mainly based on differences in the ways and means how framing is operationalized. The meta-analysis identified 136 published papers of risk framing in a period of about 15 years and extracted 236 effect sizes based on over 30,000 participants. Kühberger reported an overall weighted framing effect size of d = 0.31. That is, a difference of about 1/3 of a standard deviation can be expected, on average, after changing the description of two identical risky options from gains to losses. Kühberger’s (1998) meta-analysis became one of the primary sources for estimating the size of the framing effect in situations involving the framing of risk information, and it still is (see Kühberger, 2017). In the meanwhile, however, it became obvious that meta-analyses suffer from a specific drawback: they tend to overestimate effect sizes due to publication bias (Asendorpf et al., 2013). Thus, Kühberger also may have overestimated the effect size in his analysis. A precise and timely estimate of the framing effect is called for, based on recently developed methods for addressing the problem of publication bias in meta-analysis. Here we reanalyze the data of Kühberger (1998) with a Ó 2018 Hogrefe Publishing
focus on correcting for possible overestimation of the original effect due to publication bias. In addition, as our reassessment is based on studies that are relatively old, we compare them with recent findings of gain-loss framing effects. Specifically, the Many Labs Replication Project (MLRP; Klein et al., 2014) replicated the classic risky-choice framing effect with the Asian Disease Task (see below). We will use these recent data as a comparison group which is (i) recent, and (ii) by way of preregistration, largely immune to publication bias. In addition, we collected the papers published in 2016 on risky-choice framing and did a metaanalysis on those studies.
The Framing Effect The classic risky-choice framing task is the Asian Disease Problem, introduced by Tversky and Kahneman (1981). Here it is, stated in positive terms (lives saved): Imagine that the US is preparing for the outbreak of an unusual disease, which is expected to kill 600 people. Two alternative programs to combat the disease have been proposed. Assume that the exact Zeitschrift für Psychologie (2018), 226(1), 45–55 https://doi.org/10.1027/2151-2604/a000321
46
scientific estimate of the consequences of the programs are as follows: – If Program A is adopted, 200 people will be saved. – If Program B is adopted, there is 1/3 probability that 600 people will be saved, and 2/3 probability that no people will be saved. – Which program would you choose? The negative framing condition comes with the same cover story, but the options are described in terms of lives lost: – If Program C is adopted 400 people will die. – If Program D is adopted there is 1/3 probability that nobody will die, and 2/3 probability that 600 people will die. This problem confronts participants with two objectively identical options that are described as if they were different, namely as gains or as losses. Framing thus varies the description of a choice situation, but not its outcome. Participants then have to decide between a sure and an equivalent risky gain, or a sure and an equivalent risky loss. The typical finding is that with positive framing participants are more likely to choose the sure option, whereas participants in the negative frame are more likely to choose the risky option. This is, on logical grounds, irrational, and therefore the framing effect has attracted a lot of attention as an exemplary case of human irrationality not only in psychology, but also in economy, philosophy, and linguistics. Framing effects come in many varieties. Kühberger (1998) used a variety of moderators, mostly based on differences in operationalization and content of tasks. Subsequent research introduced other distinctions. For instance, Druckman (2004) distinguished between emphasis frames and equivalency frames. Emphasis frames highlight a subset of potentially relevant considerations. For instance, the issue of giving assistance to the poor by the government can be framed either as a humanitarian act or as a government expenditure problem. These different frames simply highlight different aspects of the same issue but they do not convey logically equivalent information. Thus, holding different preferences is perfectly rational. In contrast, equivalency framing involves casting the same information differently, typically positive or negative. For instance, the chances of a project can be 70% success, or 30% failure. Here it is more difficult to rationally defend different preferences. Borah (2011) reports that between 1997 and 2007 about 60% of 380 papers published on framing were on emphasis framing, while only about 20% were experiments on equivalence framing. Probably the best known typology of equivalency-framing tasks was proposed by Levin, Schneider, and Gaeth (1998). Their distinction is among attribute framing, goal framing, and risky-choice framing. Attribute framing is the simplest Zeitschrift für Psychologie (2018), 226(1), 45–55
A. Steiger & A. Kühberger, Reappraising the Framing Effect
variety of equivalency framing. A single attribute is framed (e.g., % success vs. % failure), and evaluative ratings of favorability, or acceptance, are collected. In goal framing, messages are framed to either stress the positive (negative) consequences of performing (failing to perform) an act, focusing attention either on attaining the positive goal or on preventing the negative outcome. Finally, in risky-choice framing tasks (e.g., the Asian Disease Task), identical choice options are framed either as gains or as losses relative to a reference point and the effect on preferences is observed. The leading explanation of framing effects is based on prospect theory (Kahneman & Tversky, 1979). The theory fits best for risky-choice framing tasks and goes like this: a positive frame makes people adopt a reference point such that the outcomes of the options are gains. Outcomes near the reference point get disproportionally high value, and therefore a gain of 200 is valued higher than a gain of 600 with probability ⅓. In terms of the theory’s value function: v(+200) > ⅓v(+600). A negative frame leads to the adoption of a reference point such that the outcomes are perceived as losses. However, sure losses are especially aversive, and therefore people prefer the risky loss over the sure loss: v( 400) < ⅔v( 600). We end up with the risky-choice framing effect: risk aversion in the domain of gains and risk seeking in the loss domain.
Varieties of Framing Tasks and Effect Sizes Kühberger (1998) reports a general weighted framing effect of Cohen’s d = .31 (Cohen, 1988). Piñon and Gambara (2005) essentially repeated Kühberger’s (1998) meta-analysis with studies that were published between 1997 and 2003. They used the typology of Levin et al. (1998) and reported a mean weighted d = .44 for risky-choice framing, d = .26 for attribute framing, and d = .44 for goal framing. For the subgroup of risky-choice framing tasks of the Asian disease type, Kühberger, Schulte-Mecklenbeck, and Perner (1999) reported a non-standardized effect size: with gains 63% of participants opted for the sure option, while with losses only 41% opted for the sure option. Other meta-analyses investigated different subgroups of framing tasks. O’Keefe and Jensen (2008, 2009) investigated message framing effects. They reported greater message engagement with gainframed messages than with loss-framed messages (d = .12), while loss-framed appeals were more persuasive than gain-framed appeals (d = .08). Gallagher and Updegraff (2012) did a meta-analysis on health-message framing and found nonsignificant effects of framing on the persuasiveness of health messages when the persuasiveness was assessed as either attitudes toward behavior or as intention Ó 2018 Hogrefe Publishing
A. Steiger & A. Kühberger, Reappraising the Framing Effect
toward behavior. Taken together, there does not seem to be the framing effect, but a population of different effects, ranging from d = .50 (presumably in the Asian disease type of task; note that Tversky and Kahneman (1981), in their classic demonstration, report an effect size of d = 1.13) to practical nonexistence (presumably in goal-framing tasks). Here we replicate and extend Kühberger’s meta-analysis. He distinguishes between two potential groups of moderators: risk characteristics and task characteristics. Among the risk characteristics are: (i) whether the risk manipulation refers to a future risky outcome (e.g., “you will win 200 with probability ⅓”; risk manipulation by reference to a risky event), or is simply used as the label for some possible outcome (e.g., “chances to suffer from a particular disease are 10%”; risk manipulation by labeling). A second risk characteristic is (ii) the quality of the risk: whether there is a riskless option. Indeed, most research presents two options: one sure, the other risky. However, there are also studies where participants choose between risky options, which vary in their degree of risk. Finally, Kühberger considered (iii) the number of risky events, distinguishing studies that presented one single risky event from studies that presented multiple risky events. Important task characteristics are the following: whether the framing manipulation is done by explicitly mentioning gains and losses (gain/loss framing) or only implicitly (e.g., a public-goods game is positive while a commons-dilemma is negative; task-responsive framing); whether the response mode was a choice or a judgment/rating; whether the statistical comparison was done between or within-subjects; whether the unit of analysis was individuals or groups; and whether the problem domain pertained to health, business, gambling, or to the social domain. We will use the same moderators here. Specifically, Kühberger considers the manipulation of risk, by reference point or by labeling, to be essential, arguing that only the reference point manipulation produces reliable framing effects while the labeling manipulation fails, or even produces contrasting risk attitudes.
Publication Bias Publication bias is apparent when research findings are held back from publication. Reasons for this are many – the strongest of them being the failure of significance. For instance, Dickersin (2005; Dickersin, Chan, Chalmers, Sacks, & Smith, 1987) showed that when a paper with (highly) significant p-values is submitted for publication, its chances of being published are up to three times higher than for papers without this “quality.” While the advantage gained by significance differs between disciplines, Psychology and Psychiatry seem to be in the lead, with Ó 2018 Hogrefe Publishing
47
91% positive results (Fanelli, 2010). Indeed, research implies that authors may be tempted to use “creative” ways for reaching significance (e.g., Gelman & Loken, 2014; Hartgerink, van Aert, Nuijten, Wicherts, & van Assen, 2016; Kühberger, Fritz, & Scherndl, 2014; Simmons, Nelson, & Simonsohn, 2011). In reaction, much has been done in recent years to prevent and recognize publication bias (see Rothstein, Sutton, & Borenstein, 2005; World Health Organization, 2015).
Correcting for Publication Bias A classic method for detecting bias in published research is the inspection of a funnel graph, which plots effect size against some measure of precision (usually sample size, or the inverse of the standard error of the effect size). In the absence of bias this graph looks like a funnel, since studies based on large samples do cluster near the average, while studies using smaller samples have widely dispersed effect sizes. Bias is indicated by skewness at the base of the funnel. Statistical analysis of the presence of skewness in funnel graphs is possible, but identification of publication bias is difficult, as funnel graphs use effect sizes rather than p-values. Trim-and-fill (Duval & Tweedie, 2000) can also be used to correct for bias, but is problematic since this method also corrects by using effect sizes, and publication bias does not operate on effect sizes, but rather on p-values. A different idea, called precision-effect test, and later precision-effect estimate with standard error (PET-PEESE), was proposed by Stanley (2008) and Stanley and Doucouliagos (2014). This method uses the association between effect size and sample size as the focal variable. The idea is to investigate the relationship between effect size and sample size. Publication bias exists if studies with smaller samples are reporting larger effects, because, presumably, (small) studies that are published can reach significance only if they overestimate the true effect size. Specifically, to the degree that effect sizes are dependent on sample sizes (or to the inverse of the square root of the sample size, more precisely), a regression exists between these two variables. The PET/PEESE procedure consists of running a meta-regression in which studies are the unit of analysis, effect size is the dependent variable, and its variance is the key predictor. The clever insight is that the intercept of this regression is the effect we would expect in the absence of bias. Thus, the intercept tells us the publication-bias-corrected true effect. This sounds nice but there is criticism on the usefulness of the method especially for (social) psychology (e.g., Inzlicht, Gervais, & Berkman, 2015). Most importantly, PET/PEESE’s performance seems to depend on the distribution of sample sizes. The method suffers considerably when there are more small than large Zeitschrift für Psychologie (2018), 226(1), 45–55
48
n studies (nicely simulated and discussed in a blog article by Simonson; http://datacolada.org/59), as is the case in our data. In addition, PET/PEESE requires large differences in sample sizes, a requirement that is rarely found in psychology and surely not in framing research. Thus, we opted not to use this technique. A similar argument applies for cumulative meta-analysis, which also uses the precision of studies as the central measure. The method consists in sorting effect sizes by precision, and then including them in a meta-analysis, one by one, from most to least precise. Results can be plotted and statistically tested for a drift (Borenstein, Hedges, Higgins, & Rothstein, 2009), indicating that precise studies tend to report different effect sizes than imprecise studies. Again, this method works best if there is much variation in sample sizes, rendering it nonoptimal for our purpose. Simonsohn, Nelson, and Simmons (2014a, 2014b, 2016) and Simonsohn, Simmons, and Nelson (2015) introduced pcurve as a method for identifying publication bias. The method’s goal is “to help distinguish between sets of significant findings that are likely versus unlikely to be the result of selective reporting” (2014b, p. 535). Shortly afterwards, p-uniform was introduced (van Assen, van Aert, & Wicherts, 2015), using the same logic, but allowing the calculation of effect sizes from the p-curve of a set of data (Simonsohn et al., 2014a, also show how this is possible with p-curve). The basic idea of both methods is to take the distribution of significant p-values, as the significant studies are not plagued by publication bias, and to compare them to the distribution of significant p-values that are expected under the null. Imagine a researcher who investigates an effect that is actually nonexistent. The significance test will result in some p-value, but lacking an effect, any p-value is equally likely. Thus, p-values are distributed evenly, or uniformly. In contrast, if a true effect is existing, small p-values are more likely. Simonsohn et al. (2014b) use the following example: if a researcher was investigating the effect of gender on height with a sample size of 100,000, she would – because of a true effect – in all likelihood find strongly significant evidence (p < .01) rather than weakly significant evidence (.04 < p < .05). In contrast, in a study where no effect was existent, the p-value of a study would with 1% be .01, and with another 1% it would be .02, with 1% it would be .03, and so on. The shape of the distribution of p-values thus would be uniform, or flat: a p-value of, say, .03 would be expected in 1% of all cases, a p-value of .78 would be expected in 1% of the cases; and any other p-value would also be expected in 1% of the cases. P-curve and p-uniform apply this logic to only the significant results, that is, for 0 < p < .05, on the assumption that without an underlying effect, only in this range p-hacking is a plausible
Zeitschrift für Psychologie (2018), 226(1), 45–55
A. Steiger & A. Kühberger, Reappraising the Framing Effect
explanation for a deviation from uniformity. That is, only the findings that have actually survived the significance threshold of the publication phase are used, and no assumptions on how many nonsignificant ones might actually exist out there are necessary. Any deviation in the 0 < p < .05 interval from uniformity indicates either a true effect – if low p-values are overrepresented and p-curve is rightskewed, or p-hacking – if high p-values are overrepresented and p-curve is left-skewed. In addition, since p-curve’s shape is a function only of effect size and sample size, the shape of the p-curve becomes more markedly right-skewed as power (i.e., sample size) increases. For instance, for a power of 80%, with a true effect size of d = 0.9, and a sample size of n = 20, we can expect about 18 results with p < .01 for every result .04 < p < .05 (see Simonsohn et al., 2014b). Taken together, only right-skewed p-curves, that is, curves that contain more low (.01s) than high (.04s) p-values, indicate true evidence, while a left skew, following from more high than low p-values, indicates p-hacking and biased publication. The stronger the effect, the more the distribution of the p-values becomes right-skewed (i.e., small and very small p-values are more frequent), contingent on power. The idea is to test the distribution of only the significant p-values within the range of 0 < p < .05. If a uniform distribution of p-values is found in this interval, these p-values cannot be based on an effect of nonzero size. Indeed, only when a right-skewed p-curve is found, this indicates evidence for a true effect. The propensity of obtaining small p-values increases with the size of the effect, keeping sample size constant. Thus, the size of effect can be estimated, given the distribution of p-values, and sample size. P-curve is a good method to correct for publication bias, but it has drawbacks. Most importantly, the method cannot be meaningfully applied if there is no unitary basic phenomenon. For instance, using p-curve to investigate p-hacking over a variety of different topics and disciplines as in Head, Holman, Lanfear, Kahn, and Jennions (2015), may not seem to be appropriate (Simonsohn et al., 2015, p. 7; see also Bishop & Thompson, 2016). In the case of the framing effect, p-curve is best suited for the analysis of specific moderators and subgroups (e.g., tasks using only the Asian disease problem), and we will reanalyze the dataset used by Kühberger (1998) using the p-curve method. The interesting question is how the new analysis changes the effect size estimate. The obvious expectation is that it will decrease if publication bias is present in the dataset. We will see how much the effect changes, and a sample of framing studies published in 2016 will be used for comparison. In addition, another comparison group is provided by the results of the MLRP, which includes a multi-lab replication of the classic Asian disease study.
Ó 2018 Hogrefe Publishing
A. Steiger & A. Kühberger, Reappraising the Framing Effect
Reanalysis of Kühberger’s Meta-Analysis We accessed all papers and obtained the database that was used in Kühberger’s (1998) meta-analysis. However, relevant meta-analytic statistics were recalculated completely from scratch. That is, beginning with Kühberger’s papers, we selected all significant results, reassessed them for possible inclusion, and recalculated the meta-analytic statistics of all included papers. This work was done independently by AS. AK was involved only occasionally, for clarifying how he had calculated the original statistics. We used the following criteria to select p-values suitable for p-curve analysis: (i) study is included in the sample of Kühberger (1998); (ii) study is significant in the digital dataset provided by Kühberger (1998); (iii) study reports a test statistic (t-test, w2 test, or F-test) based on a difference of two groups (between design), or two measurement points in time (within design). Extracted statistics included: test statistic; N total of subjects used for calculation of test statistic; and relevant degrees of freedom. In cases of multiple comparisons, only the first result was selected for inclusion, and only one result per experiment was selected. This was, in hierarchical order, either (1) the result of the Asian disease task; (2) the results of the task using a design most similar to the Asian disease task; (3) the first relevant test statistic reported; or (4) the first relevant computable test statistic. Appendix A in the Electronic Supplementary Material, ESM 1, gives the result of this procedure, and Appendix C in ESM 1 provides the references of studies included in Kühberger’s (1998) meta-analysis. Feeds into p-curve have to be examining significant effects of the same sign. Thus, all significant findings contradicting the prospect theory direction were excluded from p-curve. To estimate the corrected effect size, we used a slightly modified R-script from Simonsohn et al. (2014a). Modifications enabled estimation of average statistical power, as well as feeding in F-values and w2-values in addition to t-tests. The logic of finding effect sizes with p-curve and p-uniform follows directly from the fact that p-curve is uniform for d = 0. For any effect d ¼ 6 0, p-curve is rightskewed (indicating an effect), or left-skewed (indicating biased publication). Since the shape of p-curve depends only on sample size and effect size, knowing its shape and the sample size enables the computation of the effect size. The idea is to find an effect size that produces the most uniform p-curve, given the sample size. In other words, the effect size that produces the most uniform p-curve is the best estimator for the overall effect size, since at exactly this effect the p-curve shows the least deviation from uniformity. Note that this calculation is different from
Ó 2018 Hogrefe Publishing
49
traditional calculation of average effect size in fixed and random effects models, taking the fact into account that only significant results are included.
Results From the original sample of 230 effect sizes from 136 studies, 81 were eligible for p-curve analysis. Using the above logic for seeking the level of maximal uniformity, a local minimum in the range of D [0.0, 2.5] was found at d = 0.522, being the best estimate of the true effect size (see Figure 1). The lowest D-value of the Kolmogorov-Smirnov test indicates the point of least deviation of the estimated p-curve from the observed p-curve which is the best estimation for the corrected effect size. Tables 1 (risk characteristics) and 2 (task characteristics) give the original and p-curve findings, and their respective differences for the moderators that were used in Kühberger’s analysis. Note that due to the estimation method used in p-curve analysis, the estimated effect size is not a mean, and the method fails in providing a dispersion value such as standard deviation. Table 1 shows that, compared to the original findings reported in Kühberger, the effect sizes are larger in most subgroups. Notably, the effect size estimated for studies manipulating risk by reference to a risky event (like the Asian disease task) is d = .56, which is quite close to Kühberger’s original finding (d = .50). However, for the labeling studies the p-curve estimates a substantial effect
Figure 1. Plot of estimated effect size. Lowest D-value of KolmogorovSmirnov test indicates the least deviation of estimated p-curve from the observed p-curve indicating the best estimation of the true effect size.
Zeitschrift für Psychologie (2018), 226(1), 45–55
50
A. Steiger & A. Kühberger, Reappraising the Framing Effect
Table 1. Effect sizes for risk characteristics Frequencies Original analysis dc
p-curve dd
Δde
72
+.50 [+.48, +.53]
.564
+.064
10
.11 [ .15, .07]
.400
+.510
122
55
+.46 [+.43, +.49]
.522
+.062
108
27
+.12 [+.09, +.15]
.537
+.417
Original analysis ka
p-values in re-analysisb
Risky event
157
Labelling
072
Riskless/risky Risky/risky
Characteristic Risk manipulation
Quality of risk
Number of risky events Single risky event
176
63
+.34 [+.32, +.37]
.651
+.311
Multiple risky events
054
19
+.17 [+.29, +.33]
.357
+.187
Notes. aNumber of effect sizes included in Kühberger’s analysis. bNumber of p-values included in reanalysis. cAverage effect size reported by Kühberger (1998), with confidence intervals (95%) given in brackets. dEstimated effect size based on p-curve. eDifference in effect sizes.
Table 2. Effect sizes for task characteristics Frequencies Original analysis ka
p-values in re-analysisb
Original analysis dc
p-curve dd
Δde
Gain/loss
191
67
+.33 [+.30, +.35]
.613
+.283
Task-responsive
038
15
+.21 [+.15, +.26]
.334
+.124
Characteristic Framing manipulation
Response mode Choice
150
59
+.40 [+.38, +.43]
.681
+.281
Rating/judgment
061
18
+.08 [+.03, +.12]
.389
+.309
Other/mixed
019
05
+.14 [+.07, +.21]
.427
+.287
Comparison Between-subjects
178
62
+.27 [+.24, +.30]
.430
+.160
Within-subjects
052
20
+.41 [+.37, +.46]
.911
+.501
Individual
215
79
+.31 [+.29, +.33]
.522
+.212
Groups
015
03
+.27 [+.18, +.36]
.569
+.299
Business
66
30
+.34 [+.30, +.38]
.423
+.083
Gambling
57
23
+.32 [+.28, +.36]
.831
+.511
Health
75
29
+.26 [+.23, +.30]
.648
+.388
Social
16
04
+.16 [+.06, +.26]
.159
.001
Other/mixed
16
10
+.45 [+.38, +.53]
.477
+.027
Unit of analysis
Problem domain
Notes. aNumber of effect sizes included in Kühberger’s analysis. bNumber of p-values included in reanalysis. cAverage effect size reported by Kühberger (1998), with confidence intervals (95%) given in brackets. Estimated effect size based on p-curve. eDifference in effect sizes.
(d = .40). Indeed, according to the p-curve findings, this distinction is practically useless, while it was considered central in the original analysis. Also of minor influence was, in contrast to the original findings, the distinction between different risk qualities (riskless vs. risky option; more or less risky options). The influence of the number of risky events (single vs. multiple) remained important, with single risky events showing nearly double the effect size of multiple risky events. Taken together, the p-curve qualifies the original findings: the risk manipulation was
Zeitschrift für Psychologie (2018), 226(1), 45–55
of much lesser importance, the quality of risk was of no importance, and the difference related to the number of risky events remained important. The results for task characteristics (Table 2) are also noteworthy. The picture is clear: all moderators were found to be highly influential (probably with the exception of the difference between individual and group studies, but note the small frequency in group studies). Note especially the strong effect in within-subject studies, and the big differences between problem domains.
Ó 2018 Hogrefe Publishing
A. Steiger & A. Kühberger, Reappraising the Framing Effect
51
in the p-curve analysis. We therefore did a separate p-curve on the labeling studies, including only the significant effects in the opposite direction, investigating whether there is also evidence for an effect in the opposite direction, as suggested by Kühberger (1998). The p-curve estimate of the labeling studies indeed showed a moderate effect size in the opposite direction (d = .26; see Figure 3). Figure 4 indicates that these studies indeed contain reliable evidence for an effect opposite to the traditional framing effect. This adds credibility to
Figure 2. p-curve. Right-skewed distribution of p-values indicates a true effect of framing. The observed p-curve includes 81 statistically significant (p < .05) results, of which 71 are p < .025. There were no nonsignificant results entered.
A specific aim of p-curve is to evaluate whether there is a true effect in the sample of studies. Figure 2 (p-curve web app, version 4.05; Simonsohn, Nelson, & Simmons, 2016; http://www.p-curve.com/app4/) shows the result. Two binomial tests were done, following the advice of the authors. The first compares the proportion of observed significant p-values below p < .025 to the expected proportion under the null. This test found significantly (p < .0001) more studies with p-values below .025 than expected (50%). The second test compares the observed p-curve to a p-curve with only 1/3 power (given the same sample size). This test is useful if the first test finds no evidence for a true effect. Then the 1/3 power test can decide whether the lack of evidence is due to a very small or nonexistent effect, or due to a lack of information (e.g., not enough p-values). This test was nonsignificant (p > .999), indicating that the observed p-curve is not significantly flatter than the 1/3 power p-curve. Further tests for the full (p’s .05), and the half p-curve (p’s < .025) found large negative Z-scores, indicating a strong deviation from the null in favor of the alternative of a right skew (Z = 24.11; p < .0001; and Z = 23.96; p < .0001, respectively). Note, however, the frequency of p-values increases just below the significance level of p = .05, indicating some p-hacking. As reported, Kühberger (1998) considered the distinction between risk manipulation by reference to a risky event versus risk manipulation by labeling essential. Our analysis does not seem to support this, as the effect sizes were similar (d = .40 vs. d = .56, respectively). However, this could be an artifact, since many labeling studies showed nonsignificant, or opposite effects. All those are excluded Ó 2018 Hogrefe Publishing
Figure 3. Plot of estimated effect size for labeling studies.
Figure 4. p-curve for labeling studies. The observed p-curve includes 12 statistically significant (p < .05) results, of which 10 are p < .025. There were no nonsignificant results entered.
Zeitschrift für Psychologie (2018), 226(1), 45–55
52
the distinction made in Kühberger (1998), requiring further investigation. We used only a subsample of Kühberger’s effects here, namely the k = 81 significant ones. If only those would have been published (i.e., assuming perfect publication bias), we would have found a mean (uncorrected) effect size of d = .899 [0.759, 1.040]. p-curve corrects this down to d = .522, that is, a reduction by 0.377. Note that we independently reassessed the effect sizes of all studies; Kühberger’s effect size of these 81 studies is smaller: d = .657 [0.562, 0.753]. That is, the reassessment, which is based on the strict regime proposed by Simonsohn et al. (2014b), led to considerable differences and in most cases to higher estimates. Note that the reassessment uses only one effect, which frequently is the first effect reported in a series of studies. This effect often is larger than effects found in subsequent experiments of a paper.
Recent Evidence I: The Many Labs Replication Project The results of the Many Labs Replication Project (MLRP; Klein et al., 2014) offer a welcome standard for comparing our findings to unbiased evidence. The MLRP consisted of 36 different research groups and investigated the variability and replicability of 13 important published effects. Gain versus loss framing (Tversky & Kahneman, 1981), operationalized as the original Asian disease task, was among those effects. The replication attempt of all 36 groups was preregistered, and the results are publicly available at the Open Science Framework website (https://osf.io/wx7ck/). We accessed the website and analyzed the data. Klein et al. (2014; Table 2) report the following for the Asian disease task: (i) original effect size (Tversky & Kahneman, 1981): d = 1.13; (ii) median replication effect size: d = .58; (iii) mean (weighted) effect size: d = .62 (.60); (iv) proportion p < .05, opposite direction: 0; (v) proportion p < .05, same direction: .86; (vi) proportion ns: .14; (vii) overall w2(N = 6,271) = 516.4; p < .001. Our analysis of the dataset led to a somewhat lower estimate of mean (weighted) d = .57 (.58). This difference is presumably due to an ambiguity in the data file: Klein et al. (2014) report a single w2-test collapsing over all experiments, with 36 experiments containing N = 6,271 participants. However, the file downloaded from the OSF website contains 36 studies including 6,344 participants (this number is also reported by Klein et al. on p. 144). An interesting finding appears if we do p-curve with the MLRP-data. Note, that this is nonsense, on the assumption that in a preregistered many labs study there is no room for publication bias. Of course, however, there is room for p-hacking, although this also is unlikely. It is instructive to Zeitschrift für Psychologie (2018), 226(1), 45–55
A. Steiger & A. Kühberger, Reappraising the Framing Effect
see, however, that p-curve does actually correct the effect: the method leads to an effect size of d = .523. This is quite a point landing to our estimate, and our estimate is within the 99% confidence limits of the unweighted effect size based on 6,271 participants [.52; .71]. However, there is room for speculation about why the method corrects when there is – in all likelihood – nothing to correct. Note, however, that Simonsohn et al. (2014a) are also reporting a p-curve on these data and indeed find an expected effect size of d = .60. That is, p-curve did not correct the effect size. Interestingly, Simonsohn’s analysis is based on 34 studies, also excluding some participants (see above). Note that p-curve corrects by a much higher degree in most cases; in our case we see a reduction by 0.377 (see also van Aert, Wicherts, & van Assen, 2016, for an example of substantial correction). In sum, the result of the MLRP is an effect size that is comparable to the effect found in research done about 30 years earlier. Both effects are only half the size than the effect in the original study (d = 1.13) of Tversky and Kahneman (1981), indicating once more that original findings tend to report effects that are too large.
Recent Evidence II: A Meta-Analysis of Framing Studies Published in 2016 The effect size estimations reported up to here correspond nicely. As a further comparison, we report the findings of a meta-analysis that includes only recent studies. We searched PsychInfo for framing and decision making in 2016 and found 15 papers eligible for inclusion (see Appendix B in ESM 1, for a table disclosing the extracted effect sizes, and Appendix D in ESM 1, for the references). We extracted the meta-analytic statistics and analyzed the data with p-curve (see Figures 5 and 6). We found an estimated effect size for the framing studies published in 2016 of d = .56. The p-curve implies that the effect is different from zero. The binomial tests for the half p-curve found significantly (p < .0001) more studies with p-values below .025 than expected (50%); the test for the full curve was also significant. In sum, there is clear evidence for a framing effect of similar effect size in the studies recently published: no decline effect (Schooler, 2011).
Discussion We reevaluated the meta-analysis of Kühberger (1998) on the effect size of framing using the p-curve method to test for the influence of publication bias. We included only Ó 2018 Hogrefe Publishing
A. Steiger & A. Kühberger, Reappraising the Framing Effect
53
2016 found an effect size of d = .56. Overall, there is a robust and reliable framing effect in risky decision making in the size of about half a standard deviation. To put things into context, consider this effect expressed in a binomial effect size display (Rosenthal & Rubin, 1982): Imagine a framing study where 100 participants have to choose between a sure outcome of €200 and a risky outcome of €600 with probability 1/3. Another 100 participants have to choose between a sure outcome of €400 and a risky outcome of €600 with probability 2/3. Imagine further, that there exists a neutral unframed condition in which the options are of equal attractiveness, such that 50 participants prefer the sure, and the other 50 participants prefer the risky option in either condition. Framing the outcomes as gains increases risk aversion in about 13 people (i.e., 63 participants expressing preference for the sure gain). In contrast, framing the outcomes as losses increases risk seeking in about 13 people (i.e., only 37 participants opting for the sure loss). Figure 5. Plot of estimated effect size for studies published in 2016.
Size and Replicability of the Framing Effect
Figure 6. p-curve for studies published in 2016. The observed p-curve includes 15 statistically significant (p < .05) results, of which 14 are p < .025. There were no nonsignificant results entered.
statistically significant results, which were re-extracted independently from the original papers. Using p-curve on a sample of 81 effect sizes we found an overall framing effect size of d = .52, considerably higher than Kühberger’s (1998) original finding. According to p-curve, the sample of data indicates that the selected studies provide evidence of a reliable framing effect. The same conclusion follows from the MLRP: the close replication of the Asian disease task was successful, showing an effect size of d = .60. Finally, a meta-analysis on 15 effects reported in papers published in Ó 2018 Hogrefe Publishing
We found considerable differences for different experimental procedures, with effect sizes varying in a large interval [.159 d .911]. That is, a random effects model is appropriate. This may not be surprising, but it is theoretically challenging, since it is unlikely that prospect theory can accommodate the variation. The most plausible interpretation of our findings is that the framing effect is the consequence of a blend of multiple independent features: cognitive, motivational, emotional, and pragmatic (see Kühberger, 2017), translating into differences in effect size due to variability in procedural features. Kühberger’s analysis implies that the distinction between framing by manipulation of risky events versus labeling is important theoretically and practically, and the specific p-curve analysis of only the labeling studies corroborated this. Thus, there is a subset of framing studies that is different from the majority of studies, but both come under the head of “framing.” The labeling effects are opposite and about only half the size than the classic effects (d = .26), but they are not statistical flukes. Interpretation of this finding is complicated, as there are multiple correlations between moderators. For instance, risky event manipulations tend to come with gain/loss framing tasks were people indicate preferred choices in gambling task, presumably constituting an optimal biosystem for breeding strong classic framing effects. In contrast, labeling manipulations are frequently paired with task-responsive framing, requiring judgment in business domains. This combination presumably breeds smaller, or even opposite, effects. Zeitschrift für Psychologie (2018), 226(1), 45–55
54
Our meta-analysis of the papers published in 2016 offers only limited insights. First, we were not too picky in the choice of keywords, which led to a subset of very diverse papers. p-curve may not be optimal for such a set. However, we did not do subgroup analyses, because the number of effects is too small to do this seriously. Nevertheless, the results are largely similar. Given that there is a reliable effect, it should replicate in studies with appropriate power. Indeed, framing effects do replicate. The MLRP reports a successful replication rate of 86%, based on p < .05. This is quite impressive. To exemplify matters of replication in terms of power, consider the following: if we were to pick a study randomly out of the population of studies that our subset belongs to, it would have a median sample size of N = 104, tested at α = .05. This study has a power of 0.84 (one-sided) to find a significant difference between framing conditions if the effect size is d = 0.52. Thus, replication is more likely than is usual in Psychology, where power levels tend to linger at about 0.40 (e.g., Sedlmeier & Gigerenzer, 1989), and replication failures are reported with disturbing frequency. Not that bad an expectation for doing a framing study, is it? Acknowledgments The action editor for this article was Edgar Erdfelder. Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 2151-2604/a000321 ESM 1. Text (.docx). Tables and references.
References Asendorpf, J. B., Connor, M., De Fruyt, F., De Houwer, J., Denissen, J. J. A., Fiedler, K., . . . Wicherts, J. M. (2013). Recommendations for increasing replicability in psychology. European Journal of Personality, 27, 108–119. https://doi.org/10.1002/per.1919 Bishop, D. V. M., & Thompson, P. A. (2016). Problems in using p-curve analysis and text-mining to detect rate of p-hacking and evidential value. PeerJ, 4, e1715. https://doi.org/10.7717/ peerj.1715 Borah, P. (2011). Conceptual issues in framing theory: A systematic examination of a decade’s literature. Journal of Communication, 61, 246–263. https://doi.org/10.1111/j.1460-2466.2011.01539.x Borenstein, M., Hedges, L. V., Higgins, J. P., & Rothstein, H. (2009). Introduction to meta-analysis. Chichester, UK: Wiley. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). London, UK: Erlbaum. Dickersin, K. (2005). Publication bias: Recognizing the problem, understanding its origins and scope, and preventing harm. In H. R. Rothstein, A. J. Sutton, & M. Borenstein (Eds.), Publication bias in meta-analysis: Prevention, assessment and adjustments (pp. 11–34). Chichester, UK: Wiley.
Zeitschrift für Psychologie (2018), 226(1), 45–55
A. Steiger & A. Kühberger, Reappraising the Framing Effect
Dickersin, K., Chan, S., Chalmers, T. C., Sacks, H. S., & Smith, H. (1987). Publication bias and clinical trials. Controlled Clinical Trials, 8, 343–353. https://doi.org/10.1016/0197-2456(87) 90155-3 Druckman, J. N. (2004). Political preference formation: Competition, deliberation, and the (ir)relevance of framing effects. American Political Science Review, 98, 671–686. https://doi. org/10.1017/S0003055404041413 Duval, S., & Tweedie, R. (2000). Trim and fill: A simple funnel-plotbased method of testing and adjusting for publication bias in meta-analysis. Biometrics, 56, 455–463. https://doi.org/ 10.1111/j.0006-341X.2000.00455.x Fanelli, D. (2010). ‘‘Positive’’ results increase down the hierarchy of the sciences. PLoS One, 5(4), e10068. https://doi.org/10.1371/ journal.pone.0010068 Gallagher, K. M., & Updegraff, J. A. (2012). Health message framing effects on attitudes, intentions, and behavior: A meta-analytic review. Annals of Behavioral Medicine, 43, 101–116. https://doi.org/10.1007/s12160-011-9308-7 Gelman, A., & Loken, E. (2014). The statistical crisis in science [online]. American Scientist, 102. Retrieved from http://www. americanscientist.org/issues/feature/2014/6/the-statisticalcrisis-in-science Hartgerink, C. H., van Aert, R. C., Nuijten, M. B., Wicherts, J. M., & van Assen, M. A. (2016). Distributions of p-values smaller than .05 in psychology: What is going on? PeerJ, 4, e1935. https://doi.org/10.7717/peerj.1935 Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015). The extent and consequences of p-hacking in science. PLoS Biol, 13, e1002106. https://doi.org/10.1371/journal.pbio. 1002106 Inzlicht, M., Gervais, W., & Berkman, E. (2015). Bias-correction techniques alone cannot determine whether ego depletion is different from zero: Commentary on Carter, Kofler, Forster, & McCullough, 2015. Social Sciences Research Network. https:// doi.org/10.2139/ssrn.2659409 Kahneman, D., & Tversky, A. (1979). Prospect theory: An analysis of decision under risk. Economica, 47, 263–291. Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B. Jr., Bahník, Š., Bernstein, M. J., . . . Cemalcilar, Z. (2014). Investigating variation in replicability. Social Psychology, 45, 142–152. https://doi. org/10.1027/1864-9335/a000178 Kühberger, A. (1998). The influence of framing on risky decisions: A meta-analysis. Organizational Behavior and Human Decision Processes, 75, 23–55. https://doi.org/10.1006/ obhd.1998.2781 Kühberger, A. (2017). Framing. In R. Pohl (Ed.), Cognitive illusions (2nd ed., pp. 79–98). New York, NY: Psychology Press. Kühberger, A., Fritz, A., & Scherndl, T. (2014). Publication bias in psychology: A diagnosis based on the correlation between effect size and sample size. PLoS One, 9, e105825. https://doi. org/10.1371/journal.pone.0105825 Kühberger, A., Schulte-Mecklenbeck, M., & Perner, J. (1999). The effects of framing, reflection, probability, and payoff on risk preference in choice tasks. Organizational Behavior and Human Decision Processes, 78, 204–231. https://doi.org/10.1006/obhd. 1999.2830 Levin, I. P., Schneider, S., & Gaeth, G. J. (1998). All frames are not created equal: A typology and critical analysis of framing effects. Organizational Behavior and Human Decision Processes, 76, 149–188. https://doi.org/10.1006/obhd.1998. 2804 O’Keefe, D. J., & Jensen, J. D. (2008). Do loss-framed persuasive messages engender greater message processing than do gainframed messages? A meta-analytic review. Communication Studies, 59, 51–67. https://doi.org/10.1080/10510970701849388
Ó 2018 Hogrefe Publishing
A. Steiger & A. Kühberger, Reappraising the Framing Effect
O’Keefe, D. J., & Jensen, J. D. (2009). The relative persuasiveness of gain-framed and loss-framed messages for encouraging disease detection behaviors: A meta-analytic review. Journal of Communication, 59, 296–316. https://doi.org/10.1111/j.14602466.2009.01417.x Piñon, A., & Gambara, H. (2005). A meta-analytic review of framing effect: Risky, attribute and goal framing. Psicothema, 17, 325–331. Rosenthal, R., & Rubin, D. B. (1982). A simple, general purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74, 166–169. Rothstein, H. R., Sutton, A. J., & Borenstein, M. (2005). Publication bias in meta-analysis: Prevention, assessment and adjustments. Hoboken, NJ: Wiley. https://doi.org/10.1002/0470870168 Schooler, J. (2011). Unpublished results hide the decline effect. Nature, 470, 437. https://doi.org/10.1038/470437a Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105, 309–316. https://doi.org/10.1037/0033-2909.105. 2.309 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). Falsepositive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. https://doi.org/10.1177/ 0956797611417632 Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014a). p-Curve and effect size: Correcting for publication bias using only significant results. Perspectives on Psychological Science, 9, 666–681. https://doi.org/10.1177/1745691614553988 Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014b). p-Curve: A key to the file-drawer. Journal of Experimental Psychology: General, 14, 534–547. https://doi.org/10.1037/a0033242 Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2016). p-Curve app (Version 4.05) [Web application]. Retrieved from http:// www.p-curve.com/app4/ Simonsohn, U., Simmons, J. P., & Nelson, L. D. (2015). Better p-curves: Making p-curve analysis more robust to errors, fraud, and ambitious p-hacking, a Reply to Ulrich and Miller (2015).
Ó 2018 Hogrefe Publishing
55
Journal of Experimental Psychology: General, 144, 1146–1152. https://doi.org/10.1037/xge0000104 Stanley, T. D. (2008). Meta-regression methods for detecting and estimating empirical effects in the presence of publication selection. Oxford Bulletin of Economics and Statistics, 70, 103–127. https://doi.org/10.1111/j.1468-0084.2007.00487.x Stanley, T. D., & Doucouliagos, H. (2014). Meta-regression approximations to reduce publication selection bias. Research Synthesis Methods, 5, 60–78. https://doi.org/10.1002/jrsm. 1095 Tversky, A., & Kahneman, D. (1981). The framing of decisions and the psychology of choice. Science, 211, 453–458. https://doi. org/10.1126/science.7455683 van Aert, R. C., Wicherts, J. M., & van Assen, M. A. (2016). Conducting meta-analyses based on p values: Reservations and recommendations for applying p-uniform and p-curve. Perspectives on Psychological Science, 11, 713–729. https:// doi.org/10.1177/1745691616650874 van Assen, M. A., van Aert, R., & Wicherts, J. M. (2015). Metaanalysis using effect size distributions of only statistically significant studies. Psychological Methods, 20, 293–309. https://doi.org/10.1037/met0000025 World Health Organization. (2015). WHO statement on public disclosure of clinical trial results. Retrieved from http://www. who.int/ictrp/results/reporting/en/ Received June 13, 2017 Revision received October 31, 2017 Accepted October 31, 2017 Published online February 2, 2018 Anton Kühberger Department of Psychology University of Salzburg Hellbrunnerstr. 34 5020 Salzburg Austria anton.kuehberger@sbg.ac.at
Zeitschrift für Psychologie (2018), 226(1), 45–55
Review Article
Effect Size Estimation From t-Statistics in the Presence of Publication Bias A Brief Review of Existing Approaches With Some Extensions Rolf Ulrich,1 Jeff Miller,2 and Edgar Erdfelder3 1
Department of Psychology, University of Tübingen, Germany
2
Department of Psychology, University of Otago, Dunedin, New Zealand
3
Department of Psychology, University of Mannheim, Germany Abstract: Publication bias hampers the estimation of true effect sizes. Specifically, effect sizes are systematically overestimated when studies report only significant results. In this paper we show how this overestimation depends on the true effect size and on the sample size. Furthermore, we review and follow up methods originally suggested by Hedges (1984), Iyengar and Greenhouse (1988), and Rust, Lehmann, and Farley (1990) allowing the estimation of the true effect size from published test statistics (e.g., from the t-values of reported significant results). Moreover, we adapted these methods allowing meta-analysts to estimate the percentage of researchers who consign undesired results in a research domain to the file drawer. We also apply the same logic to the case when significant results tend to be underreported. We demonstrate the application of these procedures for conventional one-sample and two-sample t-tests. Finally, we provide R and MATLAB versions of a computer program to estimate the true unbiased effect size and the prevalence of publication bias in the literature. Keywords: file-drawer problem, publication bias, effect size estimation
Introduction Meta-analysis is the preferred method of summarizing results within a research area (see Borenstein, Hedges, Higgins, & Rothstein, 2009). For example, in medical research one might be interested in the efficacy of a specific drug. In this case, meta-analysts often combine the results of several studies to obtain a particularly reliable and comprehensive assessment of the drug’s efficacy. Large numbers of meta-analytic studies have been conducted in various fields, such as clinical psychology (e.g., Shadish, Navarro, Matt, & Phillips, 2000), differential psychology (e.g., Hyde, Fennema, & Lamon, 1990), educational research (e.g., Kroesbergen & Van Luit, 2003), medical research (e.g., Lodewijkx, Brouwer, Kuipers, & van Hezewijk, 2013), neuropsychology (e.g., Frazier, Demaree, & Youngstrom, 2004), neuroscience (e.g., Molenberghs, Cunnington, & Mattingley, 2009), and social psychology (e.g., Richard, Bond, & Stokes-Zoota, 2003). Meta-analysis, however, may not provide an accurate assessment of a certain effect when the meta-analyst only Zeitschrift für Psychologie (2018), 226(1), 56–80 https://doi.org/10.1027/2151-2604/a000319
has access to a subset of the studies conducted to investigate the effect. Specifically, if nonsignificant results or results in an unexpected direction are underreported in the literature, a meta-analysis produces a distorted impression of the size or even the reality of the effect under study. In particular, selective reporting of positive results is known to bias the estimated effect size in the positive direction (e.g., Driessen, Hollon, Bockting, Cuijpers, & Turner, 2015). In extreme cases, biased reporting may even create the impression that a nonexistent effect is real (Simonsohn, Nelson, & Simmons, 2014b). Collectively, such biased estimates of effect sizes by selective publishing are said to result from publication bias (Begg & Berlin, 1988; Rothstein, Sutton, & Borenstein, 2005). This bias promotes false conclusions and thus threatens scientific progress. Moreover, this bias may be especially harmful in clinical and other applied research domains, and may promote the waste of public resources (Ioannidis, 2014). The reasons for such publication bias are manifold (e.g., Coburn & Vevea, 2015; Dickersin, 2005). Presumably, the major reason is that negative results are less likely to be Ó 2018 Hogrefe Publishing
R. Ulrich et al., Effect Size Estimation
published because they fail to support the proposed hypothesis (Fanelli, 2010b; Greenwald, 1975). Also, critics have long suspected that journal editors prefer to publish positive results and that therefore researchers are reluctant to submit negative results (Franco, Malhotra, & Simonovits, 2014; Smart, 1964; Sterling, 1959). Consequently, negative results may be underreported and thus underrepresented in meta-analyses (“file-drawer effect”; Rosenthal, 1979). Results from surveys about publication preferences clearly support this notion (e.g., Cooper, DeNeve, & Charlton, 1997; Coursol & Wagner, 1986; Dickersin, 2005; Greenwald, 1975; John, Loewenstein, & Prelec, 2012). The notion also receives support from the surprisingly high rate of positive results reported in journals (e.g., Fanelli, 2010b, 2012; Francis, 2014; Ioannidis, 2015; Smart, 1964; Sterling, 1959; Sterling, Rosenbaum, & Weinkam, 1995), and it appears that growing competition in science increases the file-drawer effect even further (Fanelli, 2010a, 2012). Moreover, the low replication rate of published experimental results, in combination with the high power of the replication studies (Open Science Collaboration, 2015), might reflect a strong publication bias, because the low replication rate suggests that many published studies are false positives and that the corresponding negative results went into the file drawer (Ioannidis, 2005; Pashler & Harris, 2012). Conflicts of interest and industry sponsorship (e.g., Noury et al., 2015) are additional reasons why research may not be published or negative results might be suppressed (see Coburn & Vevea, 2015). Despite its harmful effects, publication bias is often not adequately addressed by researchers conducting metaanalyses. These researchers usually consider the possibility that publication bias is present, but they less often consider the effect of publication bias on the estimated effect size (Sutton, 2005), perhaps partially due to a lack of available methods for correcting the estimate. As a result, the development of such methods is currently an active research area (e.g., van Assen, van Aert, & Wicherts, 2015; Citkowicz & Vevea, 2017; Hedges, 2017; McShane, Böckenholt, & Hansen, 2016; Simonsohn, Nelson, & Simmons, 2014a; Simonsohn et al., 2014b; Vevea & Woods, 2005), with a wide variety of methods with various advantages and disadvantages currently being discussed and examined. Thus it is a major aim of this paper to suggest a further refined method that not only allows one to detect publication bias but that can also be used to estimate its probability and to correct estimated effect sizes for publication bias, thereby overcoming limitations of existing methods. In the remainder of this introduction, we review several methods that have been developed for detecting publication bias in meta-analysis. Naturally, our review focuses on the models that are most closely related to the present
Ó 2018 Hogrefe Publishing
57
work; for more exhaustive reviews, the interested reader should consult Hedges and Vevea (2005) and Jin, Zhou, and He (2015). For the present review, we have categorized the methods into two groups. The first group embraces rather general methods that (usually) do not require specific assumptions about the nature of publication bias – they can be regarded as universal tools for assessing publication bias. The disadvantages of these tools are that they often have low power for detecting publication bias and they do not usually provide any principled way to correct the observed mean effect size for publication bias. The second group is based on selection models, and these invoke specific assumptions about how publication bias distorts the distribution of test statistics that are actually reported. These methods generally use maximum likelihood techniques to recover the true distribution from the observed distorted distribution and – provided that their assumptions are met – can provide accurate estimates of the true effect size and the extent of publication bias, together with standard errors of these quantities. Even when their assumptions are not met, these methods are still useful for sensitivity analyses (Borenstein et al., 2009; Duval, 2005) that provide guidance about how strongly the meta-analytic conclusions might depend on publication bias. The method that we advance falls into this second group, because it is also based on a selection model. Like other models in this group, this method allows meta-analysts to assess publication bias if researchers primarily publish significant results while putting nonsignificant results into the file drawer. In addition, and in contrast to the previous models, it also allows the assessment of publication bias when researchers suppress significant results because they expect null results. As will be described in detail after we present the method, it has many mathematical similarities with previous selection-method approaches but also some distinctive features that would be advantageous under certain circumstances. For example, in contrast to the classical selection-method approach, it allows one to estimate the probability that a nonsignificant result will end up in the file drawer, instead of being published.
Global Methods for Detecting Publication Bias Because of the threatening influence of the publication bias on research progress, several authors have devised various methods to assess the extent of this bias within a given research domain (Rothstein et al., 2005). The fail-safe N method suggested by Rosenthal (1979) is one of the earliest methods (see Becker, 2005, for a review, variations, and drawbacks of this method. He provided a formula to
Zeitschrift für Psychologie (2018), 226(1), 56–80
58
estimate the number of studies hidden in the file drawer (i.e., the fail-safe N) that would be required to reduce the overall significance in a meta-analysis to a nonsignificant level. If this estimate is relatively large with respect to the number of studies included in the meta-analysis, then it is implausible that publication bias would be entirely responsible for the effect, because it is unlikely that so many unpublished studies exist. One major drawback of this method, however, is that there exists no justifiable criterion for deciding when the fail-safe N is excessively large. For this and other reasons, the fail-safe N method is no longer recommended for assessing publication bias, although it has played an important historical role (Borenstein et al., 2009, p. 285). Another method frequently used for investigating publication bias is the funnel plot (Light & Pillemer, 1984), which is a simple scatter plot and thus a nonparametric approach (Sterne, Becker, & Egger, 2005). Each individual study in the meta-analysis is plotted as one point, with its estimated effect size on the x-axis and a measure of precision (such as the inverse standard error or the sample size) on the y-axis (Sterne et al., 2005; Thornton & Lee, 2000). In the absence of any publication bias, the resulting plot should resemble the shape of an inverted symmetrical funnel (i.e., ^), because the accuracy of the estimated effect should increase with precision. To the extent that studies with small samples remain unpublished because of nonsignificant results and publication bias, however, funnel plots become asymmetrical. When a funnel plot exhibits an asymmetry, the overall effect size will be overestimated unless methods are used to correct for the asymmetry. Statistical tools for detecting funnel plot asymmetry have been proposed by Begg and Mazumdar (1994; adjusted rank correlation method) and by Egger, Davey Smith, Schneider, and Minder (1997; linear regression method) – for reviews, see Coburn and Vevea (2015), Sterne and Egger (2005) and especially Jin et al. (2015) for a discussion of the advantages and limitations of these and other funnel plot tools. Once asymmetry is detected, the associated overestimation may be corrected by the trim and fill algorithm (Duval & Tweedie, 2000). This method adjusts the estimated overall effect for the possible effects of the missing studies, and it thereby also estimates the number of studies in the file drawer.1 Cumulative meta-analysis is a further, recently invented nonparametric graphical tool for investigating the existence of publication bias (see Coburn & Vevea, 2015). In a first step, this procedure rank orders the individual studies
1
R. Ulrich et al., Effect Size Estimation
according to the precision (i.e., 1/SE) of their effect estimates, from the most precise to the least precise, which is almost identical to ranking the studies from largest to smallest by sample size. In a second step, the effect sizes of these studies are averaged in a cumulative fashion, starting with the most precise study and including at each step the study with the next-most-precise estimate. If the trajectory of these cumulative means reveals a positive shift, the shift provides evidence for a publication bias in which small-sample studies with negative results are suppressed. Coburn and Vevea (2015) argued that this analysis is especially useful when the number of studies in a meta-analysis is small. Recently, some researchers have proposed that the analysis of p-values can open a window into the file-drawer problem (e.g., van Assen et al., 2015; Simonsohn et al., 2014b). The true theoretical distribution of p-values is determined by the true effect size and sample size; the effects of publication bias can be seen as a distortion of this distribution (e.g., a pronounced density increase at the significance level of .05; de Winter & Dodou, 2015; Ginsel, Aggarwal, Xuan, & Harris, 2015; Lakens, 2015a; Masicampo & Lalande, 2012; Ulrich & Miller, 2017). Simonsohn et al. (2014a) and van Assen et al. (2015) have suggested that the true effect size can be estimated by comparing observed curves of significant p-values with those predicted from a certain underlying true effect size, and in some simulations these procedures recover the true underlying effect sizes quite precisely for one-tailed statistical tests. Unfortunately, these procedures do not incorporate the results of any studies with nonsignificant effects, and they do not provide estimates of either the extent of the publication bias or the standard error of the estimated effect size.
Selection Models for Detecting and Correcting for Publication Bias In contrast to these global tools, selection models (see Hedges & Vevea, 2005) make specific assumptions about the selective publication process that determines the observed results available for a meta-analysis. The assumed selection process in these models is implemented via weight functions (Hedges & Vevea, 2005; Iyengar & Greenhouse, 1988). Methods based on this approach can be used to detect and correct for publication bias. Starting with the seminal work of Hedges (1984), many selection models have been developed based on different assumptions and weight functions (e.g., Citkowicz & Vevea, 2017; Hedges,
All of these funnel plot methods proceed from the assumption that the true (unbiased) funnel plot is symmetric. Some writers, however, have argued that this assumption is not always met (e.g., when several subgroups with different effect sizes contribute to the funnel plot). In this case incorrect conclusions can be drawn if methods are used to remove the asymmetries from funnel plots (e.g., Lau, Ioannidis, Terrin, Schmid, & Olkin, 2006; Peters, Sutton, Jones, Abrams, & Rushton, 2007; Vevea & Woods, 2005).
Zeitschrift für Psychologie (2018), 226(1), 56–80
Ó 2018 Hogrefe Publishing
R. Ulrich et al., Effect Size Estimation
Probability density function
59
Weight function w(t)
1 0.8 0.6 0.4 0.2 0 -4
-3
-2
-1
0
1
2
3
4
1.5 Distorted PDF: g(t) Undistorted PDF: f(t)
1
0.5
0 -4
-3
-2
-1
Weight function w(t)
1 0.8 0.6 0.4 0.2 0 -4
-3
-2
-1
0
1
2
3
4
Proability density function
Weight function w(t)
0.8 0.6 0.4 0.2 0 -2
-1
0
2
3
4
1
2
3
4
1
2
3
4
1
0.5
0 -4
-3
-2
-1
0
t-test statistic
1
-3
1
1.5
t-test statistic
-4
0
t-test statistic Probability density function
t-test statistic
1
2
3
4
1.5
1
0.5
0 -4
-3
-2
-1
t-test statistic
0
t-test statistic
Figure 1. Illustration of weight functions (left panels) and the resulting probability density function (PDF) of observed t-values along with its undistorted PDF (right panels). The undistorted PDF is identical in all three examples, that is, a noncentral t-distribution with ν = 18 degrees of freedom and a noncentrality parameter of ε = 0.22, which corresponds to an effect size of δ = 0.10 for a two-sample t-test with n = 10 per group and α = 0.05. The critical t-values are t18,0.025 = ±2.10. Upper two panels: Step weight function suggested by Hedges (1984). Middle two panels: Step weight function suggested by Iyengar and Greenhouse (1988). Lower two panels: Gradual weight function suggested by Iyengar and Greenhouse (1988).
2017; Hedges & Vevea, 2005; Iyengar & Greenhouse, 1988; Rust, Lehmann, & Farley, 1990; Vevea & Hedges, 1995; Vevea & Woods, 2005). In this model-based approach (for a review, see Hedges & Vevea, 2005), it is generally assumed that publication bias systematically distorts the shape of a theoretical sampling distribution that is associated with a certain test statistic x (e.g., the shape of a noncentral t-distribution). This approach then uses reverse engineering to reconstruct the true undistorted sampling distribution f(x) from the observed (i.e., distorted) sampling distribution g(x) by invoking a specific weight function w(x) to model the selection process. If g(x) denotes the distorted probability density function (PDF) and f (x) the corresponding undistorted PDF of a test statistic x, then g(x) is given by
f ð xÞ w ð x Þ ; f ðxÞ wðxÞdx 1
g ð xÞ ¼ R 1
ð1Þ
where w(x) is a nonnegative weight function. The integral in the denominator of this equation is needed in Ó 2018 Hogrefe Publishing
order to scale g(x) such that the area under this function equals one. For example, Hedges (1984) analyzed the situation of a two-sided two-sample t-test and considered a scenario in which only significant results in both positive and negative directions would be published. For this scenario, the weight function w(t) is given by (Iyengar & Greenhouse, 1988)
wðtÞ ¼
0; 1;
jtj tν;α ; otherwise;
ð2Þ
where tν,α is the critical value associated with significance level α. The left upper panel in Figure 1 illustrates this weight function, and the distorted PDF of published t-values in the right panel emerges from the true undistorted noncentral t-distribution of all observed t-values that is also depicted in this panel. In this case, this PDF corresponds to a distorted t-distribution, because only significant t-values would be available for a meta-analysis. Hedges (1984) also proposed a maximum likelihood approach to infer the noncentrality parameter (and thus the effect size) associated with the undistorted Zeitschrift für Psychologie (2018), 226(1), 56–80
60
R. Ulrich et al., Effect Size Estimation
t-distribution f(t) by maximizing the likelihood function of the truncated t-distribution g(t) when only significant t-values are available for a meta-analysis. On the basis of the truncated distribution, he also examined the expected publication bias and demonstrated that this bias can be severe (see also Chapter 14 in Hedges & Olkin, 1985). Iyengar and Greenhouse (1988) extended the work of Hedges (1984) by considering the case in which a metaanalysis includes some statistically nonsignificant results. One of their models assumed a step weight function (see middle panels in Figure 1)
w ðt Þ ¼
e γ ;
jtj tν;α ; otherwise;
1;
ð3Þ
and another model used a gradual weight function (see lower panels in Figure 1)
wðtÞ ¼
8 < j tj β :
tν;α
1;
;
jtj tν;α ;
ð4Þ
otherwise;
where γ 0 and β 0 are constants. Both of these functions give some weight to nonsignificant t-values, but these weights are smaller than those given to significant t-values. No publication bias is present for γ = 0 or β = 0, whereas if these parameters approach infinity, the two weight functions mimic the all-or-none weight functions suggested by Hedges (1984). Iyengar and Greenhouse (1988) also employed a maximum likelihood approach to estimate the effect size δ of the undistorted PDF and the parameters γ and β. When these authors applied this approach to a meta-analysis examining the effects of open versus traditional education on creativity, the estimated effect size was close to zero, and the estimates of β and γ suggested the presence of publication bias. A further extension of such weight models was introduced by Hedges (1992) and emphasized especially by Hedges and Vevea (2005). This extension assumes that the selection process depends on the level of significance, because reviewers and authors consider results as more conclusive when they are more strongly significant (see Hedges & Vevea, 1996). For example, the general form of a weight function for a two-sided t-test is given by
8 w1 ; 0 jtj < tν;α1 ; > > > > < w2 ; tν;α1 jtj < tν;α2 ; wðtÞ ¼ . > .. > > > : wm ; tν;αm 1 jtj < tν;αm ;
ð5Þ
with larger weights w1 w2 w3 . . . wm for smaller levels of statistical significance α1 α2 α3 . . . αm . These extended weight models are mathematically and computationally complex, and the authors say that they Zeitschrift für Psychologie (2018), 226(1), 56–80
may only be successfully applied when the number of studies entering a meta-analysis is large (cf. Hedges & Vevea, 2005; Sutton, Song, Gilbody, & Abrams, 2000; see Vevea & Woods, 2005, for a more parsimonious version of weight models). Rust et al. (1990) developed a model with a weight function not depending on the statistical significance of an outcome but rather on the size of the test statistic. Specifically, Rust et al. (1990) proceeded from a fixed threshold c. If the observed test statistic of a study is below this threshold, that is t c, the outcome of the study will be put into the file drawer with probability p, whereas when the test statistic is above this threshold, the outcome will definitely be published. As in the previous studies, they employed a maximum likelihood approach to estimate the parameters p and c. Moreover, they also developed a likelihood ratio test to examine whether the value of p was significantly larger than zero, which would indicate the presence of publication bias. There are two major limitations of the model of Rust et al. (1990) that might render it unattractive for metaanalysts. First, it assumes that all studies in a meta-analysis have the same sample size, although these sizes sometimes vary meaningfully across the studies in a meta analysis (see Hedges & Olkin, 1985, Chapter 2). However, Rust et al. (1990) mentioned that it would be easy to adjust the model to handle sample size variation. Second, for mathematical tractability they assumed test statistic distributions that are not actually appropriate for standard statistical tests such as the t-test. Moreover, this assumption appeared to be important. When they applied their model to two published large-scale meta-analyses in the domain of marketing research, the conclusions from the metaanalyses depended on the distributional assumption. Recently, Citkowicz and Vevea (2017) proposed a parismonious weight function that can capture gradual publication biases (but see, Hedges, 2017). Specifically, these authors replaced the unspecific weight function in (1) with a flexible beta density function that only depends on two parameters. This parametric function addresses the selection process with fewer parameters than selection models with multiple steps (see Hedges & Vevea, 2005) but does not address cliff effects that are typically observed in empirical p-curves at the significance level of α = 0.05 (for a review see Ulrich & Miller, 2017). Finally, Guan and Vandekerckhove (2016) proposed a Bayesian approach to the mitigation of publication bias by encompassing weight functions such as the one proposed by Iyengar and Greenhouse (1988). This approach, although computationally complex, allows the estimation of the true underlying effect size given that the prior distribution is specified appropriately. An application (Etz & Vandekerckhove, 2016) of this approach suggests that the failure to replicate Ó 2018 Hogrefe Publishing
R. Ulrich et al., Effect Size Estimation
studies in psychology can be explained by the overestimation of effect sizes caused by small sample sizes, together with publication bias. In sum, all of these approaches are either highly complex or make strong and somewhat implausible assumptions about the weight functions, thus calling for an alternative approach that is based on weaker and more flexible assumptions but still relies on a simple selection model with clear-cut interpretations of the parameters. This motivated our own approach presented in the following section.
Elaborations and Extensions The present paper reviews and follows up on the weight function approach to the publication bias problem and thus builds on the approaches of Hedges (1984), Iyengar and Greenhouse (1988), and especially on Rust et al. (1990). First, we consider one-sided testing scenarios where only positive results (i.e., significant results in the expected direction) are published instead of statistically significant positive or negative results as in the model of Hedges (1984). It seems important to study the effects of one-sided publication bias because this type of bias is arguably more common than two-sided publication bias. One-sided bias is common because, after a significant finding in one direction has been published, various factors increase the difficulty of publishing a significant result in the opposite direction (Pashler & Harris, 2012). It also seems important to study one-sided publication bias because it is much more likely to produce an illusory effect than two-sided publication bias.2 For this case we provide analytical results showing how publication bias will distort the estimated effect size with one-sample and two-sample t-tests. We also show how a maximum likelihood procedure can be used to estimate the true overall effect in this case. This procedure also enables one to estimate the standard error of the estimated effect size, and it thereby also provides a likelihood ratio test for the null hypothesis that the true effect size is zero, since this is an important objective in most meta-analyses. This model can be regarded as the worst-case scenario of a
2
3
61
publication bias when only positive results are published and thus may help researchers to assess the maximum possible publication bias. Second, we follow up this approach further to a scenario which incorporates some nonsignificant and negative results – an approach similar to the ones suggested by Rust et al. (1990) and Iyengar and Greenhouse (1988). Instead of assuming that the publication probability depends on the p-value, however, we assume that the probability of publishing nonsignificant results depends on the publication strategy employed by the researcher. Specifically, we assume that published results represent a mixture of studies that are published irrespective of their outcomes and others that are published only when their outcomes are significant in the predicted direction. Hence, although all significant results will be published, only a certain proportion of nonsignificant results will be published. Importantly, in contrast to traditional selection models, the present model allows also the estimation of this probability along with the standard error of this estimate. For this mixture model, we again investigate the impact of publication bias on the estimated effect size and develop a maximum likelihood procedure for estimating and significance testing of the true overall effect size with t-tests. We also investigate the model’s predictions with regard to funnel plots and compare its power to detect publication bias against standard methods. In addition, we illustrate the model using the data of a recent meta-analysis. Third, we also generalize the models to scenarios where researchers tend to suppress significant effects because such effects may refute their hypotheses or simply may create a conflict of interest. The mixture model is again applied to a recent meta-analysis in which significant effects might well have been suppressed. For both models, R and MATLAB code is provided for estimating the true effect sizes and standard errors, for computing confidence intervals, and for hypothesis testing (see the Electronic Supplementary Materials, ESM 1 (R scripts) and ESM 2 (MATLAB code)). Monte-Carlo simulations are employed to evaluate the performance of all procedures.3
Although Hedges and Vevea (2005, p. 152) mentioned a possible extension of the maximum likelihood approach to one-sided tests, they suggested that it would not be feasible because it would result in numerical problems. Contrary to their suggestion, the simulations reported in this article demonstrate that the one-sided selection model works well with the standard distributional assumptions for t-tests. These problems seem to vanish with multiple studies as discussed in the appendix of McShane et al. (2016). It must be stressed that alternative R code is available. First, Kathleen M. Coburn and Jack L. Vevea developed the R package “weightr” (April 4th, 2017). This is a powerful statistical package for performing sensitivity analyses with the model suggested by Vevea and Hedges (1995). Furthermore, it also allows users to estimate parameters from the modified model by Vevea and Woods (2005). These models proceed from the assumption that the distribution of effect sizes is normal, or at least, can be approximated by a normal distribution. One technical advantage of this assumption is that it also enables the computation of random-effects meta-analyses. Second, McShane et al. (2016) provide a similar R code that also proceeds from normally distributed effect sizes, but they commented that this code appears to result in estimates that are less stable than those produced by weightr. Nevertheless, neither of these codes can be used to estimate effect size when significant results are suppressed, and neither provides an estimate for the probability that a study will end up in the file drawer. The present software handles both of these issues and thus supplements these routines. Moreover, it proceeds from the exact sampling distribution of t-tests rather than the normal approximation, an approximation that might be poor because of the small sample sizes in many studies (Szucs & Ioannidis, 2017).
Ó 2018 Hogrefe Publishing
Zeitschrift für Psychologie (2018), 226(1), 56–80
62
R. Ulrich et al., Effect Size Estimation
Meta-Analysis With Only Significant Results
for a meta-analysis. We first consider one-sample t-tests and thereafter two-sample tests.
Computer simulations have revealed that effect sizes are strongly overestimated when the estimates are based exclusively on significant results (Anderson, Kelley, & Maxwell, 2017; Lane & Dunlap, 1978; Simonsohn et al., 2014b; van Assen, van Aert, Nuijten, & Wicherts, 2014). In particular, these results have indicated that the overestimation can be quite large for commonly employed significance levels. For the case of two-sided tests, Hedges (1984) and Hedges and Olkin (1985, Chapter 14) studied the bias of these estimates analytically and confirmed the simulation results reported by Lane and Dunlap (1978). Specifically, they showed that the absolute bias tends to zero when the true standardized effect size approaches zero or when it becomes large. A maximum bias emerges between d = 0.2 and d = 0.3 depending on sample size. This pattern is specific to two-sided t-tests, however; as will be shown below, the picture is very different with one-sided tests. Extending the work of Hedges and Olkin (1985), the first part of this section provides a detailed bias analysis for onesided one-sample and two-sample t-tests. Subsequently, we further extend that work by introducing a maximum likelihood procedure to estimate the true effect size from studies that only report significant positive results in a specific direction (i.e., one-sided). This procedure also provides an estimate of the standard error of the estimated effect size, and a test of whether the true effect size is significantly larger than zero. The analysis in each of these two parts is based on the following stepped weight function
wðtÞ ¼
0; t tν;α ; 1; otherwise;
ð6Þ
which expands Hedges’s weight function (cf. Equation 2) to situations when only positive results are published. Note that this weight function must be considered as the most extreme form of a publication bias, because metaanalysts may also include nonsignificant results in their analysis. Nevertheless, the results of the present section can illustrate the maximal attainable publication bias, a knowledge that could also facilitate sensitivity analyses.
Bias This section analyzes the bias of the estimated effect size when only significant results constitute the empirical basis 4
One-Sample t-Tests Assume a researcher conducts a single study by sampling n independent paired observations (X, Y) = ((X1, Y1), . . . , ~¼X ~ ~ Y¼ (Xn, Yn)) with the difference scores D ðD1 ; . . . ; Dn Þ. According to the researcher’s hypothesis the true mean of X should be larger than the true mean of Y. Consequently, the researcher will only publish the results of this study if the mean M of the difference scores is significantly larger than zero. A one-sided one-sample t-test is employed for testing the null hypothesis with the test pffiffiffi statistic T ¼ M=ðS= nÞ, where S is the unbiased sample standard deviation of the difference scores. The null hypothesis is rejected when this statistic is larger than the critical value tα that is associated with a prespecified significance level α, usually 5%. Given that the results are signifpffiffiffi icant, the researcher uses Cohen’s formula d ¼ T= n to estimate the effect size from the observed data. For example, let n = 10 and
~ ¼ ð13:4; 3:9; 6:0; 0:4; 8:0; D 1:0; 9:9; 12:4; 22:1; 3:1Þ; then M = 6.96pffiffiffiffiffi and S = 7.90. Thus the statistic ffi T ¼ 6.96/(7.90/ 10) = 2.787 is significant, that is, larger than the critical value tdf = 9,0.05 = 1.833, letting our researcher reject the null hypothesis. The effect size is pffiffiffiffiffiffi estimated to be d ¼ 2.787/ 10 = 0.881, which is a large effect according to Cohen (1992, Table 1).4 As mentioned before, if only d values from significant studies enter a meta-analysis, the true effect δ is overestimated. Specifically, if the difference scores are normally distributed with mean μ and standard deviation σ, then the T statistics follow a noncentral t-distribution fT ðtjε; νÞ pffiffiffi with the noncentrality parameter ε ¼ n δ and with ν = n 1 degrees of freedom, where δ ¼ ðμ 0Þ=σ denotes the true effect size. Thus, when published results are based only on significant positive results, the conditional PDF of T given that T > tα is
fT ðtjT > tα ; ε; νÞ ¼
fT ðtjε; νÞ fT ðtjε; νÞ ¼ ; PðT > tα Þ 1 F T ðtα jε; νÞ t > tα ;
ð7Þ
where FT ðtjε; νÞ denotes the cumulative distribution associated with T. Using (7) and the Law of the Unconscious Statistician (Ross, 1980, p. 39), it is straightforward to
Cohen (1988, p. 48) suggested using the symbol dz rather than d for matched-pairs t-tests in order to distinguish this measure from the effect size measure for two-sample t-tests with independent means. However, for the present paper this distinction is not relevant unless one wants to compare the effect sizes between these two types of t-tests.
Zeitschrift für Psychologie (2018), 226(1), 56–80
Ó 2018 Hogrefe Publishing
R. Ulrich et al., Effect Size Estimation
(A)
(B)
α=5%
2
1.75
Expected estimated effect size
1.5 1.25 1 0.75 0.5
1.5 1.25 1 0.75 0.5 0.25
0.25 0
α=1%
2
10 20 30 40 50
1.75
Expected estimated effect size
63
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
0
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
True effect size
True effect size
Figure 2. Expected estimated effect size E( d|T > ta ) of a one-sample t-test as a function of true effect size, δ, sample size n, and significance level a (A: a = 0.05 and B: a = 0.01).
compute the conditional expected value of Cohen’s effect size as
1 EðdjT > tα Þ ¼ pffiffiffi n
Z
1 tα
t fT ðtjT > tα ; ε; νÞdt:
ð8Þ
The value of this integral can be evaluated numerically (e.g., Shampine, 2008). Thus, the average value estimated from significant results, EðdjT > tα Þ, can be compared directly to the true value δ, without resorting to simulations. Figure 2 depicts the conditional expected mean of d as a function of true effect size δ, sample size n, and significance level α. Several points can be noted. First, the expected estimate generally overestimates the true effect size. For example, for a true effect size of δ = 0.25, the expected estimate would be almost 0.80 with a sample size of n = 10 and α = 0.05. Second, this bias becomes smaller with increasing sample size. Third, the bias is especially pronounced for true effects smaller than about 0.5. In fact, for larger sample sizes, the bias is almost nil for true effects larger than 0.5. Finally, the bias gets larger as α is decreased. Trivially, as α approaches 1, the bias must vanish. The bias associated with significant positive results differs from the bias for two-sided t-tests (Hedges, 1984; Hedges & Olkin, 1985) which diminishes to zero when δ approaches zero. Two-Sample t-Tests It is easy to extend the preceding analysis to two-sample t-tests with sample sizes of n1 and n2 for independent samples 1 and 2, respectively. In this case, effect size is defined as δ ¼ ðμ1 μ2 Þ=σ, where μ1 and μ2 denote the Ó 2018 Hogrefe Publishing
expected means of the two samples and σ is their common standard deviation. Assume that the expected means under the alternative hypothesis are μ1 > μ2 and that a researcher observed M1 = 25, S1 = 15, n1 = 30 and M2 = 15, S2 = 10, n2 = 20. The resulting t-value for this example is t = 2.61, which is computed via the test statistic
M1 M2 T¼ Sp
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n1 n2 ; n1 þ n2
ð9Þ
where Sp is the pooled estimate of the common σ. Therefore, the preceding equation can be rewritten as
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n1 n2 T ¼d ; n1 þ n2
ð10Þ
and thus the estimated effect size may be obtained from the statistic T
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n1 þ n2 : d¼T n1 n2
ð11Þ
The test statistic T follows a noncentral t-distribution with noncentrality parameter
μ μ2 ε¼ 1 σ
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n1 n2 n1 n2 ¼δ ; n1 þ n2 n1 þ n2
ð12Þ
and with degrees of freedom
ν ¼ ðn1 1Þ þ ðn2 1Þ:
ð13Þ
Therefore, the expected estimated effect size for significant studies is given by Zeitschrift für Psychologie (2018), 226(1), 56–80
64
R. Ulrich et al., Effect Size Estimation
10 20 30 40 50
1.75
Expected estimated effect size
(B)
α=5%
2
1.5 1.25 1 0.75 0.5
1.5 1.25 1 0.75 0.5 0.25
0.25 0
α=1%
2 1.75
Expected estimated effect size
(A)
0
0.25
0.5
0.75
1
1.25
1.5
1.75
0
2
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
True effect size
True effect size
Figure 3. Expected estimated effect size E( d|T > ta ) of a two-sample t-test as a function of true effect size, d, sample size n = n1 = n2, and significance level a (A: a = 0.05 and B: α = 0.01).
EðdjT > tα Þ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Z 1 n1 þ n2 t fT ðtjT > tα ; ε; νÞdt; ¼ n1 n2 tα
ð14Þ
which simplifies for n1 = n2 = n to
rffiffiffi Z 1 2 EðdjT > tα Þ ¼ t fT ðtjT > tα ; ε; νÞdt: n tα
ð15Þ
This latter equation was used to compute the expected estimate d for a two-sample t-test, and the results are depicted in Figure 3. The results are similar to those obtained for the one-sample t-test. However, the overestimation of the true effect size is even larger for the two-sample t-test than for the one-sample t-test. This difference pffiffiffi arises mainly because of the extra multiplier 2 that appears in Equation 15 compared to Equation 8. Note that publication bias not only biases the estimate d but also the sampling variance of d. Specifically, when a meta-analysis is performed on exclusively significant results, the standard error of d will be overestimated (see Appendix E in the Electronic Supplementary Material, ESM 3, for a derivation of this result).
Maximum Likelihood Estimation of True Effect Size δ In the previous section, the analysis showed exactly how the conditional mean of d may strongly overestimate the 5 6
true effect size δ when positive results are published selectively. In this section we employ the method of maximum likelihood to estimate δ, and we illustrate this method for two-sample t-tests. We also provide R code to compute these estimates and their confidence intervals (see Appendix H in the Electronic Supplementary Material, ESM 3).5 Assume the data corpus of a meta-analytic researcher consists of significant positive t-values ~ t ¼ ðt1 ; . . . ; tk Þ from k two-sample t-tests. Each t-value comes from a different (independent) study, and the sample sizes may differ across studies. The associated sizes are denoted by ðn1 ; n2 Þ ¼ ððn1;1 ; n2;1 Þ; . . . ; ðn1;k ; n2;k ÞÞ, where n1,i and n2,i are the sample sizes of the experimental and control groups for the ith study, respectively. Because the noncentrality parameter ε is a function of both the unknown effect size δ and the sample sizes of the ith study, the noncentrality parameter will also be indexed with i. An illustrative numerical example with k = 15 studies is shown in Table 1.6 Ignoring any effects of publication bias, a traditional fixed-effect meta-analysis would yield a weighted estimate of 0.40 for δ, with a 95% confidence interval of 95% CI [0.30, 0.51] (Hedges & Olkin, 1985, Chapter 6). For the following maximum likelihood estimation (MLE) procedure, we formally considered the case of homogeneity in δ across all studies, as would be assumed in a fixed-effect meta-analysis. (In the Discussion, we will report simulations showing that the resulting procedure is rather robust against a violation of this assumption.) In general,
We appreciate the help of Daniel Heck for translating our MATLAB code to R. The data in this table were simulated under the assumption of δ = 0.
Zeitschrift für Psychologie (2018), 226(1), 56–80
Ó 2018 Hogrefe Publishing
R. Ulrich et al., Effect Size Estimation
65
Table 1. Illustrative example. n1 and n2 denote the number of participants in the experimental group and control group, respectively. Column t contains the observed t-value and column d the estimated effect size of each study (see Equation 10 in Hedges & Olkin, 1985, p. 81) Study
n1
n2
t
d
1
34
41
2.04
0.47
2
55
50
2.21
0.43
3
28
19
1.85
0.54
4
32
25
2.01
0.53
5
88
90
2.58
0.39
6
70
40
2.61
0.51
7
33
45
1.73
0.39
8
35
22
1.79
0.48
9
81
93
1.68
0.25
10
62
48
1.93
0.37
11
40
50
1.68
0.35
12
80
80
1.94
0.31
13
25
38
2.16
0.55
14
42
44
1.82
0.39
15
18
22
2.22
0.69
the likelihood function of the data ~ t included in a metaanalysis can be written as k Y L~ t jδ; ~ n1 ;~ n2 ; α ¼ fT ðti jT > tα ; εi ; νi Þ
ð16Þ
i¼1
¼
k Y i¼1
fT ðti jεi ; νi Þ : 1 F T ðtα jεi ; νi Þ
ð17Þ
The value of δ which maximizes this function is the MLE d and this value also maximizes the log-likelihood logðLÞ, which is more convenient to compute numerically and is given by k X log L ~ t jδ; ~ n 1; ~ n 2; α ¼ log fT ðti jεi ; νi Þ i¼1
k X
log ½1 F T ðtα jεi ; νi Þ :
i¼1
ð18Þ Maximization can be performed numerically, for example, with the Simplex method (Nelder & Mead, 1964). R code that numerically estimates δ for one- and twosample t-tests is contained in Appendix H in the Electronic Supplementary Material, ESM 3. In each case, the code also provides the estimated standard error of the MLE d. This standard error is calculated from pffiffiffiffiffiffiffiffi ffi the observed Fisher information, that is, SE ¼ 1= IðdÞ, which also allows the calculation of a confidence interval for the true effect size. Moreover, the code performs an asymptotic likelihood ratio Ó 2018 Hogrefe Publishing
test for evaluating the null hypothesis δ = 0. We employed the data in Table 1 to illustrate the code for the two-sample case. For this example, the program computes d = 0.11, SE = 0.17, and 95% CI [ 0.45, 0.23]. The likelihood ratio test is not significant, w2 = 0.47, df = 1, p = .49. Note that these results contrast sharply with the outcome of the aforementioned traditional fixed-effect meta-analysis which does not correct for publication bias. We also conducted Monte-Carlo simulations to assess the performance of this MLE procedure for the two-sample t-test (see Appendix A, in the Electronic Supplementary Material, ESM 3).
Application In order to illustrate the model with real data, we reanalyzed the data considered by Shanks et al. (2015). These authors investigated 43 independent studies of risk-taking and conspicuous consumption in men. Specifically, these studies reported experiments testing the hypothesis that these behaviors increase when men are exposed to mating primes. This hypothesis has typically been investigated by comparing the behaviors between males exposed to mating primes and males exposed to neutral primes. All studies examined by Shanks et al. (2015) reported significant results. They performed a random-effects metaanalysis (see Borenstein et al., 2009, Chapter 12) and found a mean priming effect of d = 0.57 with 95% CI [0.49, 0.65], which is of medium size according to Cohen’s classification scheme. Note that this standard analysis implicitly neglects the assumption of selectively publishing significant results. An additional funnel plot analysis performed by these authors suggested, however, that nonsignificant results were suppressed in this field of research. We applied the MLE procedure of Equation 18 to k = 32 studies from this meta-analysis (i.e., we excluded correlational studies and some other studies that were not suitable). The MLE of δ was 0.44 with 95% CI [0.31, 0.57]. These results contrast with the random-effects meta-analysis of these same 32 studies that implicitly neglects a file-drawer effect, d = 0.61, 95% CI [0.52, 0.70]. Therefore, the present analysis supports the conclusion of Shanks et al. (2015) that the estimated size of this priming effect has been inflated by publication bias.
Meta-Analysis With Significant and Nonsignificant Results The preceding scenario assumed that the average effect size is estimated using only studies that produced significant results but some meta-analyses would also include Zeitschrift für Psychologie (2018), 226(1), 56–80
66
studies with nonsignificant results, though usually not as many as studies with significant results (e.g., Fanelli, 2010b; Open Science Collaboration, 2015; Sterling et al., 1995). As was discussed in the Introduction, it is possible to extend the preceding analysis to incorporate this possibility (e.g., Iyengar & Greenhouse, 1988; Rust et al., 1990). Therefore we now take up the model originally suggested by Rust et al. (1990) and extend it so that it can be applied to meta-analyses in which studies employ t-tests and moreover differ in their sample sizes. This section is organized as follows. First, we review the basic assumptions of the extended model and derive its basic predictions. Second, the predicted bias in estimating δ is investigated as a function of the model’s parameters. Third, the maximum likelihood estimation of true effect sizes for t-tests is introduced and demonstrated with a numerical example using R code. Fourth, results are reported for Monte-Carlo simulations that examined the model’s power to detect potential publication biases and to correct d when such bias is present. The extended version of this model allows for two paths to publishing. According to the SP-path, a study is only published if the outcome statistically confirms the study’s hypothesis (“selective publishing,” SP), or in other words, if T > tα is observed. By contrast, if a study takes the PE-path, its results are published whether the results are significant or not (“publishing everything,” PE). Studies entering a given meta-analysis constitute a probabilistic mixture of these two paths. The meta-analyst may obtain significant outcomes from either path but may only obtain nonsignificant outcomes from the PE-path.7 In order to quantify this scenario, it is important to distinguish four possible outcomes (Figure 4). With probability psp, a randomly selected study takes the SP-path. This study will only contribute a t-value to the literature if its result is significant. The compound probability PðSP \ nsÞ of this outcome is psp ð1 βÞ, where 1 β is the power of the statistical test, that is, PðT > tα Þ ¼ 1 β ¼ 1 FT ðtα jε; νÞ. However, if a nonsignificant result is obtained in this study, it is not published but is instead stored in the file drawer. The compound probability of this outcome is PðSP \ nsÞ ¼ psp β. With probability 1 psp, a study takes the PE-path. In this case, the study is published whether its result is significant or not; the corresponding probabilities are PðPE \ sÞ ¼ ð1 psp Þ ð1 βÞ and PðPE \ nsÞ ¼ ð1 psp Þ β, respectively. It should be noted that the model suggested by Rust et al. (1990) is conceptually different from the mixture model 7
R. Ulrich et al., Effect Size Estimation
Figure 4. Possible outcomes within a certain research domain when scientific publishing represents a mixture of two paths to publication: Selective publishing (SP, upper path) versus publish everything (PE, lower path). A randomly selected study takes the SP-path with probability psp and the SE-path with probability 1 psp. For studies on the SP-path, only significant results are published, and studies with nonsignificant results end up in the file drawer. Significant outcomes are observed with probability 1 β, that is, with the statistical power of the t-test. Nonsignificant results are obtained with probability β, that is, the Type II error probability of the t-test. A study that takes the PE-path is published whether its outcome is significant or not.
just outlined. The Rust model assumes that all significant outcomes are published, whereas nonsignificant outcomes are put in the file drawer with probability psp and are published with the complementary probability 1 psp. According to this alternative view, the SP-path is only determined after the result is known. In contrast, our mixture model distinguishes between the SP- and PE-paths on an a priori basis (i.e., before the results are known). It can be shown that these two alternative conceptualizations are mathematically equivalent. Therefore, the mathematical results and MLE procedure developed in this section also apply to the Rust model. Moreover, the present mixture model is mathematically similar to the weight model of Iyengar and Greenhouse (1988). In contrast to this previous model, however, the present mixture model (a) applies to one-sided tests and (b) reparameterizes γ as psp, which is easier to interpret (see Equation 3). Figure 5 depicts the four compound probabilities generated by this mixture model for a one-sample t-test as a function of effect size δ, sample size n, and probability psp (left panels psp = 0.5, right panels psp = 0.8); the significance level is α = 0.05. The probability PðSP \ nsÞ represents the probability that a study will be stored in the file drawer.
There could be at least two reasons for selective publishing (i.e., the SP-path). First, selective publishing might be solely attributed to the researcher who conducted the study – a researcher could simply be reluctant to publish nonsignificant results (see Cooper et al., 1997; Dickersin, 2005). Second, selective publishing could also reflect the review process, that is, editors and reviewers may be more likely to accept submissions reporting significant results (see Sterling et al., 1995). The present mixture model does not distinguish between these two cases, and hence the model applies in either case.
Zeitschrift für Psychologie (2018), 226(1), 56–80
Ó 2018 Hogrefe Publishing
R. Ulrich et al., Effect Size Estimation
67
p sp = 80 %
p sp = 50 % 1 10 20 30 40 50
0.75 0.5 0.25
P(SP ∩ s)
P(SP ∩ s)
1
0 0.25
0.5
0.75
1
1.25
1.5
1.75
2
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
1
P(SP ∩ ns)
P(SP ∩ ns)
1 0.75 0.5 0.25 0
0.75 0.5 0.25 0
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
1
1
0.75
0.75
P(PE ∩ s)
P(PE ∩ s)
0.5 0.25 0
0
0.5 0.25 0
0.5 0.25 0
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2 1
P(PE ∩ ns)
1
P(PE ∩ ns)
0.75
0.75 0.5 0.25 0
0.75 0.5 0.25 0
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
True effect size
True effect size
Figure 5. Compound probabilities P( SP \ s ), P( SP \ ns ), P( PE \ s ), and P( PE \ s ) as a function of effect size δ, sample size n, and probability psp (left panels psp = 80%, right panels psp = 50%. P( SP \ ns ) represents the probability of a file-drawer result.
Naturally, this probability is larger for smaller effect sizes and sample sizes, because there are more nonsignificant results in these cases. Conversely, as one expects, the probability of a significant effect increases with effect size and sample size, whether the study takes the SP- or PE-path. The overall probability of publishing an article and thus contributing to a meta-analysis is consequently
PðResult is publishedÞ ¼ 1 psp β:
ð19Þ
Bias Inclusion of nonsignificant results in meta-analyses may substantially reduce the bias in estimating small effects. According to the above mixture model, the expected estimate of d in a meta-analytic study can be computed via the standard formula for conditional expectations
EðdÞ ¼ EðdjSPÞ PðSPÞ þ EðdjPEÞ PðPEÞ: Ó 2018 Hogrefe Publishing
ð20Þ
The components of this equation will be explained next, but it should be emphasized that P(SP) and P(PE) reflect the proportions of all published results, so P(SP) is not the same as the model parameter psp. First, EðdjSPÞ and EðdjPEÞ denote the expected means of d from published studies taking the SP- and PE-paths, respectively. Note that the expectation of EðdjSP \ sÞ simply corresponds to Expression 8 or 14 depending on whether a one-sample or two-sample pffiffiffiffiffiffiffi t-test is used. Furthermore, EðdjPEÞ is equal to 1=n EðTjε; νÞ for the pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi one-sample t-test and to ðn1 þ n2 Þ=ðn1 n2 Þ EðTjε; νÞ for the two-sample t-test, where EðTjε; νÞ is the mean of the noncentral t-distribution. Second, P(SP) and P(PE) in Equation 20 are the probabilities that a t-value within a meta-analysis came from studies taking the SP- and PE-paths, respectively. These probabilities are computed as
PðSPÞ ¼ ¼
PðSP \ sÞ PðResult is publishedÞ psp ½1 F T ðtα jε; νÞ 1 psp F T ðtα jε; νÞ
;
ð21Þ
Zeitschrift für Psychologie (2018), 226(1), 56–80
68
R. Ulrich et al., Effect Size Estimation
(A)
1.75
Expected estimated effect size
1.5 1.25 1 0.75 0.5
1.5 1.25 1 0.75 0.5 0.25
0.25 0
p sp = 50 %
2
10 20 30 40 50
1.75
Expected estimated effect size
(B)
p sp = 80 %
2
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
0
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
True effect size
True effect size
Figure 6. Expected estimated effect size E(d) of a one-sample t-test as a function of true effect size, δ, sample size n, and percentage of studies with selective reports psp (A: psp = 80% and B: psp = 50%). The significance level is α = 0.05.
and
Pðf PE \ s g [ f PE \ ns gÞ PðResult is publishedÞ 1 psp : ¼ 1 psp F T ðtα jε; νÞ
PðPEÞ ¼
ð22Þ
Note that P(SP) and P(PE) are conditional probabilities and thus differ from psp and 1 psp, respectively. One-Sample t-Tests The parameters ε and ν are defined as before, that is, pffiffiffi ε ¼ d n and ν = n 1. For example, assume psp = 0.8, n = 10, δ = 0.25, and α = 0.05. In this case, one computes P(SP) = 0.42, EðdjSPÞ ¼ 0:82, P(PE) = 0.58, EðdjPEÞ ¼ 0:27, and thus E(d) = 0.50, which still is a sizeable overestimation of the true effect size δ = 0.25.8 Figure 6 shows the expected estimated effect size as a function of the true effect size, δ, and sample size n. A strong bias is present for psp = 0.8, but it diminishes substantially for psp = 0.5. A comparison of these results with those of Figure 2 clearly suggests that the inclusion of nonsignificant results can drastically reduce the bias of the estimated effect size. Two-Sample t-Tests The preceding analysis of the mixture scenario is analogous for the two-sample t-test. The parameters ε and ν pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi are defined as before, that is, ε ¼ d ðn1 n2 Þ=ðn1 þ n2 Þ and ν ¼ n1 þ n2 2. For example, assume psp = 0.8, 8
n1 = n2 = 10, δ = 0.25, and α = 0.05. For this example, one obtains P(SP) = 0.35, EðdjSPÞ ¼ 1:04, P(PE) = 0.65, EðdjPEÞ ¼ 0:26, and thus E(d) = 0.53, which is again clearly larger than the true effect size δ = 0.25. Figure 7 provides a more complete picture of the resulting bias as a function of δ, n = n1 = n2, and psp. Again, the inclusion of nonsignificant effects in the meta-analysis greatly reduces the estimation bias, as a comparison with Figure 3 reveals.
Maximum Likelihood Estimation of True Effect Size psp and δ The maximum likelihood procedure can again be used to estimate the parameters psp and δ within this mixture model of selective publishing. To obtain maximum likelihood estimates, the PDF of the t-values is needed. This PDF describes the theoretical distribution of t-values on which meta-analytic researchers base their analyses in this scenario, and this PDF can be regarded as a probabilistic mixture distribution,
fT ðtÞ ¼ fT ðtjSPÞ PðSPÞ þ fT ðtjPEÞ PðPEÞ;
ð23Þ
where t-values in a meta-analysis are contributed with probabilities P(SP) and P(PE) from the SP- and PE-paths, respectively (see Figure 5 and Equations 21 and 22). Because the SP-path contributes only significant t-values, the conditional PDF for this path is (cf. Equation 7)
Some readers may be surprised that the expected mean EðdjPE) is not equal to but larger than the true effect size 0.25, demonstrating that d = (M 0)/S is not an unbiased estimate of δ. This bias has been already noted by Hedges (1981); see also Richardson (1996).
Zeitschrift für Psychologie (2018), 226(1), 56–80
Ó 2018 Hogrefe Publishing
R. Ulrich et al., Effect Size Estimation
(A)
(B)
p sp = 80 %
2
1.75
Expected estimated effect size
1.5 1.25 1 0.75 0.5
1.5 1.25 1 0.75 0.5 0.25
0.25 0
p sp = 50 %
2
10 20 30 40 50
1.75
Expected estimated effect size
69
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
0
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
True effect size
True effect size
Figure 7. Expected estimated effect size E(d) of a two-sample t-test as a function of true effect size, d, sample size n = n1 = n2, and percentage of studies with selective reports psp (A: psp = 80% and B: psp = 50%). The significance level is a = 0.05.
fT ðtjSPÞ ¼
fT ðtjε; νÞ fT ðtjε; νÞ ¼ I ðt > tα Þ; ð24Þ PðT > tα Þ 1 F T ðtα jε; νÞ
where the index function is I = 1 for t > tα and otherwise zero. Furthermore, because studies taking the PE-path are published whether the results are significant or not, the corresponding PDF is simply
fT ðtjPEÞ ¼ fT ðtjε; νÞ:
Maximum likelihood estimations (MLEs) of the δ and psp values can be obtained using the PDF of t shown in Equation 27. Let ~ t ¼ ðt1 ; . . . ; tk Þ again be the observed t-values that enter a meta-analysis. Then, the MLEs d and ^psp are the values of δ and psp that maximize
log L ~ t jd; psp ;~ n1 ;~ n2 ; α
ð25Þ ¼
Combining the preceding results, the unconditional PDF fT(t) of the probability mixture is
psp ½1 F T ðtα jε; νÞ
fT ðtjε; νÞ I ðt > tα Þ fT ðtÞ ¼ 1 psp F T ðtα jε; νÞ 1 F T ðtα jε; νÞ þ
1 psp 1 psp F T ðtα jε; νÞ
fT ðtjPEÞ:
ð26Þ
After simplifying the preceding expression, one obtains
fT ð t Þ ¼
h i fT ðtjε; νÞ 1 psp I ðt tα Þ 1 psp F T ðtα jε; νÞ
;
ð27Þ
with I = 1 for t tα and otherwise 0. Illustrations of this PDF in Figure 8 nicely illustrate how the effect of selective publication diminishes with increasing δ. This happens because almost all results will be published by virtue of being significant if the effect is large, so almost no studies are hidden in the file drawer. In that case, the distribution of t-values available to the meta-analyst will no longer depend much on psp. Ó 2018 Hogrefe Publishing
k X
h i log fT ðti jεi ; νi Þ 1 psp I ðti tα Þ
i¼1
k X
h i log 1 psp F T ðtα jεi ; νi Þ :
ð28Þ
i¼1
Appendix I in the Electronic Supplementary Material, ESM 3, contains the R code for one- and two-sample t-tests for estimating δ and psp under the assumptions of this mixture model. In addition, this code again provides the standard error of these estimates by numerically evaluating the corresponding observed Fisher information matrix and also confidence intervals for the two parameters; simulations reported next show that the confidence intervals have the desired coverage probabilities except when ^psp is close to zero or one, in which case a bootstrap procedure is preferable. We illustrate the MLE procedure with the numerical example shown in Table 2. These data were simulated under the assumptions of the mixture model with δ = 0, psp = 0.8, and α = 0.05. A traditional fixed-effect meta-analysis yields d = 0.20, 95% CI [0.11, 0.30] and this outcome would incorrectly indicate the presence of a reliable effect. By contrast, the MLE procedure gives d = 0.09, SE = 0.07, 95% CI [ 0.04, 0.22], and the Zeitschrift für Psychologie (2018), 226(1), 56–80
70
R. Ulrich et al., Effect Size Estimation
(A)
(B)
δ = 0.25 1 .75 .50 .25 0
0.8
0.8
0.7
Probability density function f(t)
Probability density function f(t)
0.7
0.6
0.5
0.4
0.3
0.6
0.5
0.4
0.3
0.2
0.2
0.1
0.1
0
Figure 8. Probability density functions predicted by the random mixture model. Each panel depicts the PDF for various values of the mixture probability psp = (0, 0.25, 0.50, 0.75, 1). A: δ = 0.25. B: δ = 0.50. The underlying t-distribution is associated with a one-sample t-test and n = 40.
δ = 0.5
-2
0
2
4
0
6
-2
0
t
2
4
6
t
Table 2. Illustrative example for the mixture model. n1 and n2 denote the number of participants in the experimental and control groups, respectively. For each study, column t contains the observed t-value and column d the estimated effect size (see Equation 10 in Hedges & Olkin, 1985, p. 81) Study
n1
n2
t
d
1
20
30
0.59
0.17
2
30
35
1.84
0.45
3
35
35
1.72
0.41
4
25
20
0.40
0.12
5
60
50
0.10
0.02
6
40
40
0.48
0.11
7
45
50
1.17
0.24
8
35
30
2.18
0.54
9
70
80
0.15
0.02
10
65
60
1.76
0.31
11
30
25
1.90
0.50
12
40
40
0.50
0.11
13
30
25
1.27
0.34
14
20
20
1.71
0.53
15
90
80
1.17
0.18
16
70
75
1.91
0.32
17
50
50
0.08
0.02
18
65
70
0.48
0.08
19
50
50
0.98
0.19
20
45
55
2.28
0.45
likelihood ratio test does not reject the null hypothesis δ = 0, w2 = 2.16, df = 1, p = 0.141. In addition, the MLE of psp is 0.81, SE = 0.14, 95% CI [0.40, 0.96], and the associated likelihood ratio test indicates that the null hypothesis psp = 0 should be rejected, w2 = 4.88, df = 1, p = 0.027. Zeitschrift für Psychologie (2018), 226(1), 56–80
It is also instructive to examine the mixture model’s predictions regarding funnel plots. For example, Figure 9 depicts various data sets generated by the model with psp equal to 0.4, 0.6, 0.8, and 1.0. The filled and open circles represent observations from publications that followed the SP-path and the PE-path, respectively. Such data sets are traditionally checked for evidence of publication bias with Egger’s method but the present mixture model provides an alternative approach. The Monte-Carlo simulations contained in Appendix B in the Electronic Supplementary Material, ESM 3, systematically evaluate not only whether the MLE procedure provides reasonable estimates of δ and psp but also the model’s power to detect publication bias compared to Egger’s method.
Application In a previous section we analyzed the data of the Shanks et al. (2015) meta-analysis with the model assuming that only significant positive results are published. It is also possible, however, to analyze this data set with the mixture model. For this data set, we obtained estimates of d = 0.44, SE = 0.07, and ^psp ¼ 1:00, SE = 0.00. Bootstrap confidence intervals for the two parameters, based on 3,000 bootstrap samples, were 95% CI [0.30, 0.59] and 95% CI [1.00, 1.00], respectively. In addition, we obtained a highly significant likelihood ratio test, w2 = 25.3, df = 1, p < .001, which is clearly consistent with the conclusion of Shanks et al. (2015) that these data reflect a strong publication bias. Readers interested in an extension of the mixture model to two-sided tests and in the predicted proportion Ó 2018 Hogrefe Publishing
R. Ulrich et al., Effect Size Estimation
71
100
Sample size n of each group
Sample size n of each group
100 80 60 40 20
k = 30, δ = 0.2, p sp = 0.4 0 -1
-0.5
0
0.5
1
1.5
80 60 40 20
k = 30, δ = 0.2, p sp = 0.6 0 -1
2
-0.5
Estimated effect size d
1
1.5
2
1.5
2
100
Sample size n of each group
Sample size n of each group
0.5
Estimated effect size d
100 80 60 40 20
k = 30, δ = 0.2, p sp = 0.8 0 -1
0
-0.5
0
0.5
1
1.5
2
Estimated effect size d
80 60 40 20
k = 30, δ = 0.2, p sp = 1 0 -1
-0.5
0
0.5
1
Estimated effect size d
Figure 9. Funnel plots for data (two-sample t-test) simulated under the assumptions of the mixture model. Each panel differs in the degree of publication bias psp = (0.4, 0.6, 0.8, 1.0). The y-axis represents the sample size n for each group. The x-axis represents the estimate d for each study. The filled and open circles depict these estimates for studies that follow the SP- and PE-paths, respectively. The gray curve within each panel indicates the cutoff value for statistical significance. Values of d larger than this cutoff are significant at α = 0.05. The true effect size was δ = 0.2 for all four simulated meta-analyses, and there were k = 30 studies per meta-analysis.
of significant results are referred to Appendices G and F, respectively, in the Electronic Supplementary Material, ESM 3.
Meta-Analysis With Underreported Significant Results The models in the previous sections assume that negative results are underreported and significant results are overreported (Sterling et al., 1995). However, under certain circumstances it is also possible that positive results are underreported or even suppressed completely. For example, the tobacco industry has been criticized for actively suppressing its own research that had demonstrated the damaging effects of smoking on health (Bero, 2003; Hirschhorn, 2000). Similarly, developers of a new drug might tend to suppress results suggesting the presence of toxic side effects, thereby erroneously creating the impression that Ó 2018 Hogrefe Publishing
the drug is safe (e.g., Noury et al., 2015). In the most extreme cases of suppression, there might be no significant results available for meta-analysis. In this section we analyze the publication bias that emerges from underreporting instead of overreporting significant effects. In addition, we show how maximum likelihood procedures can again be employed to estimate the true overall effect in the presence of such reporting biases. Analogously to the previous sections, we first consider the case when researchers would only report negative results and suppress all positive results. After this, we show how the previous mixture model can be adapted to the scenario of underreporting significant results.
Meta-Analysis With Only Nonsignificant Results The scenario considered in this subsection assumes that the t-values entering a meta-analysis are all statistically Zeitschrift für Psychologie (2018), 226(1), 56–80
72
R. Ulrich et al., Effect Size Estimation
(A)
10 20 30 40 50
1.75
1.5
1.25
1
0.75
0.5
0.25
0
α=1%
2
1.75
Expected estimated effect size
Expected estimated effect size
(B)
α=5%
2
1.5
1.25
1
0.75
0.5
0.25
0
0.25
0.5
0.75
1
1.25
1.5
1.75
0
2
0
0.25
0.5
True effect size
0.75
1
1.25
1.5
1.75
2
True effect size
Figure 10. Expected estimated effect size E( d|T ta ) of a one-sample t-test as a function of true effect size, δ, sample size n, and significance level α (A: α = 0.05 and B: α = 0.01). Each graph in this plot stops at a certain level of true effect size for numerical reasons. Specifically, numerical integration was not performed when the probability of observing a nonsignificant t-value was less than or equal to β = 0.05.
nonsignificant, that is, less than the critical value tα. Similar computation steps as for the previous analysis apply in this case. In particular, the conditional PDF of T for T tα is
fT ðtjT tα ; ε; νÞ ¼
fT ðtjε; νÞ ; t tα : F T ðtα jε; νÞ
ð29Þ
As before, this conditional distribution can be employed to evaluate the expected effect size. For example, for one-sample t-tests this conditional expected mean of d is computed via
1 EðdjT tα Þ ¼ pffiffiffi n
Z
tα 1
t fT ðtjT tα ; ε; νÞdt;
ð30Þ
as illustrated in Figure 10. As one anticipates, this figure reveals that the true effect size tends to be underestimated when the average effect size is estimated by merely averaging the observed effect sizes. Furthermore, the amount of bias increases considerably with true effect size. Finally, and as one would expect, the bias is somewhat less pronounced when a smaller significance level is used. A similar picture emerges for the two-sample t-test. The maximum likelihood method can again be used to estimate the true effect size when only nonsignificant results are available. In this case the log-likelihood function is given by k X log L ~ t jd;~ n1 ;~ n2 ; α ¼ log fT ðti jεi ; νi Þ i¼1
k X
log F T ðtα jεi ; νi Þ;
i¼1
Zeitschrift für Psychologie (2018), 226(1), 56–80
ð31Þ
Table 3. Illustrative example. n1 and n2 denote the number of participants in the experimental group and control group, respectively. Column t contains the observed t-value and column d the estimated effect size of each study (see Equation 10 in Hedges & Olkin, 1985, p. 81 ) Study
n1
n2
1
34
41
1.20
0.27
2
55
50
0.94
0.18
3
28
19
0.68
0.20
4
32
25
0.19
0.05
5
88
90
0.37
0.06
6
70
40
0.35
0.07
7
33
45
0.69
0.16
8
35
22
0.17
0.05
9
81
93
0.11
0.02
10
62
48
0.67
0.13
11
40
50
0.13
0.03
12
80
80
1.53
0.24
13
25
38
1.57
0.40
14
42
44
0.60
0.13
15
18
22
1.04
0.32
t
d
and this likelihood again has to be maximized numerically in order to compute d. For example, assume the nonsignificant t-values of k = 15 studies provided in Table 3. All studies used two-sample tests, and it is assumed that t-values at a significance level of α 0.05 would be suppressed. For this example, the MLE is d = 0.15, SE = 0.07, and 95% CI [0.02, 0.29]. The likelihood ratio test is significant, w2 = 5.42, df = 1, Ó 2018 Hogrefe Publishing
R. Ulrich et al., Effect Size Estimation
(B)
δ = 0.2
1.6
1.4
1.4
1.2
1
0.8
0.6
1.00 0.75 0.50 0.25 0.00
0.4
1.2
1
0.8
0.6
0.4
0.2
0
0.2
-2
0
2
4
6
0
-2
t
p = 0.020, indicating that the true effect size δ is reliably larger than zero. By contrast, the traditional fixed-effect meta-analysis (Hedges & Olkin, 1985, Chapter 6) yields a weighted estimate of 0.08 for δ with SE = 0.05 and 95% CI [ 0.02, 0.19]. Note that the MLE is distinctly larger than this traditional estimate. Clearly, if significant t-values are suppressed, the resulting observed d-values estimated from the reported t-values will tend to underestimate the true effect size, since d is monotonically related to t. The larger SE of the MLE is presumably due to the generally greater uncertainty involved in estimating δ when only nonsignificant t-values are available, because in that situation there is uncertainty about the number and values of significant t-values that have been excluded. We again performed Monte-Carlo simulations to assess the efficiency of this MLE procedure, which again revealed good statistical properties. For example, the estimates were virtually unbiased, and the estimated confidence interval covered the true parameter in 95% of all cases.
Meta-Analysis With Significant and Nonsignificant Results As in the previous mixture model, we consider two paths to scientific publishing, SP and PE. An analogous derivation as for the previous mixture model shows that the unconditional PDF of the t-values for this model is given by 9
Figure 11. Probability density functions predicted by the random mixture model for underreporting significant results. Each panel depicts the density function for various values of the mixture probability psp = (0, 0.25, 0.50, 0.75, 1). A: δ = 0.20. B: δ = 0.40. The underlying t-distribution is associated with a one-sample t-test and n = 40.
δ = 0.4
1.6
Probability density function f(t)
Probability density function f(t)
(A)
73
0
2
4
6
t
fT ð t Þ ¼
h i fT ðtjε; νÞ 1 psp I ðt > tα Þ 1 psp ½1 F T ðtα jε; νÞ
;
ð32Þ
with I = 1 for t > tα and otherwise 0. This function is illustrated in Figure 11; it shows that the distorting effects of selective publishing are larger when the true effect size increases. Estimates of δ and psp are obtained by numerically maximizing the log-likelihood function of this mixture model, which is given by
log L ~ t jd; psp ; ~ n1 ; ~ n2 ; α ¼
k X
h i log fT ðti jεi ; νi Þ 1 psp I ðti > tα Þ
i¼1
k X
n o log 1 psp ½1 F T ðtα jεi ; νi Þ :
ð33Þ
i¼1
We applied this model to the data from a recent metaanalysis by Vadillo, Konstantinidis, and Shanks (2015).9 These authors were concerned about possible underreporting of significant effects in research investigating subliminal perception. In this area, researchers typically infer the existence of subliminal perception from a combination of significant effects on a nonconscious measure (e.g., priming or psychophysiological responses) together with conscious discrimination performance that is not significantly above chance (e.g., hit rate does not significantly differ from the false alarm rate). Thus, an obvious question is whether
We thank these authors for sharing their data with us and for providing much helpful information.
Ó 2018 Hogrefe Publishing
Zeitschrift für Psychologie (2018), 226(1), 56–80
74
R. Ulrich et al., Effect Size Estimation
researchers tend to underreport significant results for conscious discrimination performance, perhaps adjusting display conditions until performance is not significantly different from chance. If so, the estimate of psp should be larger than 0. From the data set of this meta-analysis, we included all independent studies (k = 79) and used Equation 33 to estimate δ and psp. For δ the results were d = 0.28, SE = 0.04, 95% CI [0.20, 0.35], w2 = 97.9, df = 1, and p < 0.001. Furthermore, for psp the results were ^psp ¼ 0:00, SE = 0.33, 95% CI [0.00, 1.00], w2 = 0.0, df = 1, and p = 1. As mentioned above, when ^psp is very close to 0 or 1, the percentile-bootstrap procedure should be used to estimate confidence intervals with at least B = 3,000 bootstrap samples (Hogg, McKean, & Craig, 2005). Thus, for each bootstrap sample, the corresponding estimates of δ and psp were computed with Equation 33. For effect size δ, the average bootstrap estimate was 0.28 and 95% CI [0.21, 0.35] and thus virtually identical to the above results of the MLE procedure. This result supports Vadillo et al. (2015)’s conclusions that true discrimination performance in these experiments is above chance and that nonsignificant results are probably false negatives due to insufficient statistical power. Furthermore, the mean bootstrap estimate for psp was 0.04 and 95% CI [0.00, 0.36]. This outcome is consistent with the notion that researchers did not, or at least not on a large scale, suppress nonsignificant results in order to promote the hypothesis that their data demonstrate subliminal processing. The fact that many (i.e., 25) of these k = 79 studies reported significant effects strengthens this conclusion.10 As in the previous section, we again conducted extensive simulations to study the statistical properties of ^psp and d. The maximum likelihood estimator d again revealed excellent properties, with no indication of bias and a correct confidence interval coverage probability. For estimation of psp, it is recommended that k should be at least 20.
Discussion Meta-analysis can yield an especially reliable estimate of an effect size combining the results of many relevant studies. However, in many research areas, studies that report positive effects (i.e., relatively large and thus statistically significant effects in the expected direction) are more likely to find their way into the literature and thus to be included in a meta-analysis than studies that report nonsignificant effects or effects in the wrong direction (Dickersin, 2005). 10
If a meta-analysis is based on such a selective subset from all relevant studies, the resulting estimate will tend to overestimate the true effect (publication bias). In the extreme case, such a selective subset may mimic a real effect even when such an effect is not real. It has been recognized for a long time that the result of a meta-analysis can be biased (for a review, see Dickersin, 2005), although methods for dealing with this bias have been developed more recently. As reviewed in the Introduction, an early approach was Rosenthal’s Fail-Safe N (Rosenthal, 1979), followed by other nonparametric methods such as the funnel plot and cumulative metaanalysis. Model-based approaches have supplemented these nonparametric methods (see Hedges & Vevea, 2005; Jin et al., 2015) and incorporated assumptions about how a subset of all relevant studies is selected for inclusion in a meta-analysis. Based on these assumptions, it is then possible to estimate the true effect size from the selected subset of studies and to use these models for sensitivity analyses. In this paper, we pursued this model-based approach further, extending the formal work of various researchers (e.g., Hedges, 1984, 1992; Hedges & Olkin, 1985; Iyengar & Greenhouse, 1988; Rust et al., 1990). First, we assessed the bias when only studies with significant positive results enter a meta-analysis. Our analysis revealed that the bias under this scenario can be very large, which to our knowledge has so far been examined exclusively by computer simulations (Simonsohn et al., 2014b). The bias is especially large when the sample sizes are small and the true effect sizes are less than about δ = 0.75. An overestimation of 200% is quite possible under this scenario. We then described a maximum likelihood procedure to estimate the true effect size for one-sided t-tests. This procedure also allows one to compute a standard error estimate and thus a confidence interval for the true effect size. Furthermore, we also provided a likelihood ratio test for evaluating the null hypothesis δ = 0. Monte-Carlo simulations were employed to evaluate this procedure, and the results of these simulations verified its good accuracy. Second, we also modeled a scenario with two paths to publication – an approach that is similar to the one suggested by Rust et al. (1990). Along one path (SP, selective publishing), only studies with positive results are published, whereas negative results disappear in the file drawer. Along a second path (PE, publish everything), all results are published, regardless of statistical significance. Thus, the simple model is a special case of this mixture model in which all studies would take the SP-path, that is, psp = 1. We derived explicit mathematical expressions for computing the resulting bias, and it is also possible to estimate
We also performed a standard random-effects meta-analysis for these k = 79 studies. This analysis yielded a mean d of 0.27 and 95% CI [0.20, 0.33], a result that is very similar to the one reported by Vadillo et al. (2015, p. 91).
Zeitschrift für Psychologie (2018), 226(1), 56–80
Ó 2018 Hogrefe Publishing
R. Ulrich et al., Effect Size Estimation
75
Figure 12. Lower parts (0.0 p 0.1) of prototypical p-curves computed from the assumptions of the mixture model. The computations were based on psp = 0.6, α = 0.05, and one-sided two-sample t-tests with n = 20 participants per group. Note the discontinuous drop at α = 0.05.
20
Probability density function of p-value
18 16 14 12 δ = 0.1 δ = 0.3
10 8 6 4 2 0 0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
p-value
the proportion of studies in the file drawer within the framework of this model. The bias under this scenario tends to be smaller than the bias under the earlier scenario in which only positive results enter a meta-analysis. Moreover, the bias increases with the probability that the publication process follows the SP-path. Again, the bias is largest for small sample sizes and when the true effect is small. A maximum likelihood procedure for this model enables one to estimate the true effect size δ and in addition the probability psp for taking the SP-path, along with the corresponding standard errors of these estimates. Monte-Carlo simulations were again employed to check the accuracy of this procedure. It turned out that the estimates for the true effect size were surprisingly good and the estimates for psp were often acceptable even for metaanalyses with only k = 20 studies. Moreover, the model appeared to perform well when it was applied to the real meta-analysis data of Shanks et al. (2015). Like Rust et al. (1990), we developed a likelihood ratio test to examine the null hypothesis psp = 0, that is, to evaluate the presence of a publication bias. The simulations showed that this test is quite powerful when compared to traditional methods for examining the presence of a publication bias. Finally, we considered scenarios where positive instead of negative results are underreported and adapted the models to these scenarios. The biases that are involved in underreporting positive results were considered and the data of a real meta-analysis (Vadillo et al., 2015) used to illustrate this approach. The mixture model proposed in this paper is consistent with the observed shapes of p-curves, which have recently received considerable attention (Bruns & Ioannidis, 2016; Head, Holman, Lanfear, Kahn, & Jennions, 2015; Kelley Ó 2018 Hogrefe Publishing
& Kelley, 2015). These curves focus on the distribution across studies of the reported p-values (Simonsohn et al., 2014a; Ulrich & Miller, 2017). Under the null hypothesis, p-values are uniformly distributed in the interval between zero and one (Hung, O’Neill, Bauer, & Köhne, 2012). When studies assess true effects, however, the curve is skewed to the right, and the skewness increases with the power of the test (Simonsohn et al., 2014a). Looking at distributions of p-values reported in the literature, Masicampo and Lalande (2012) and de Winter and Dodou (2015) observed a prominent drop in the frequency of p-values just beyond the critical significance level of 0.05. Specifically, p-values between 0.051 and 0.059 were clearly less frequent than between 0.041 and 0.049 (e.g., see Figure 23 in de Winter & Dodou, 2015). This abrupt step has been attributed to questionable research practices including publication bias (Masicampo & Lalande, 2012; de Winter & Dodou, 2015), although it appears that questionable research practices such as data-monitoring cannot explain this drop in the p-curve (see Lakens, 2015b). As noted by Lakens (2015b), the drop appears to be in line with publication bias. Figure 12 shows a prototypical p-curve computed from the assumptions of the present mixture model with psp = 0.6 and two different values of δ. This figure demonstrates that the mixture model can account for the observed drop in empirical p-curves and thus supports Lakens’ (2015a, 2015b) claim that this drop is a sign of publication bias rather than of p-hacking. Like previous work (e.g., Hedges, 1984, 1992; Iyengar & Greenhouse, 1988) using the model-based approach to publication bias, we focused on t-tests. Although meta-analyses are frequently based on results from t-tests, other statistics such as correlation coefficients and odds ratios are also Zeitschrift für Psychologie (2018), 226(1), 56–80
76
R. Ulrich et al., Effect Size Estimation
α=5%
1 10 20 30 40 50
0.8
0.9
Expected estimated correlation coefficient
Expected estimated correlation coefficient
0.9
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1
α=1%
1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
True correlation coefficient
True correlation coefficient
Figure 13. Expected estimated correlation E( ^r|R ra ) as a function of true correlation ρ, sample size n, and significance level α (A: α = 5% and B: α = 1%).
often summarized in such analyses (Borenstein et al., 2009; Rothstein et al., 2005). The present model framework, however, can be extended to other tests if the PDF for the relevant test statistic is available. For example, the exact PDF fR ðrjρ; nÞ for the observed Pearson correlation coefficient r as a function of sample size n and true correlation ρ is known (Fisher, 1915; Johnson, Kotz, & Balakrishnan, 1994). An analogous computation to that of Equation 8 may be used to compute the expected correlation coefficient for situations in which only significant correlations are reported. In this case the expected observed correlation ^r is computed as
R1
r fR ðrjρ; nÞdr Eð^r jR > r α Þ ¼ rRα 1 ; f ðrjρ; nÞdr rα R
ð34Þ
where rα is the critical correlation coefficient of the statistical test under the null hypothesis with α = 0.05. Figure 13 shows the results of these computations and again demonstrates the large bias that can emerge when researchers only report significant results. In a similar fashion, the remaining previous equations could be adapted in order to compute the bias under the mixture model and also to estimate the true correlation from a sample of results limited by publication bias. Thus, the present analytical framework for the T statistics can in principle be extended to other statistics, although the computational burden may become heavy and extreme caution may be required to ensure that numerical problems do not distort the results. The MLE procedures considered in this paper assume that the true effect does not vary across the studies that
Zeitschrift für Psychologie (2018), 226(1), 56–80
enter into a meta-analysis, a major assumption of all fixed-effect models (Borenstein et al., 2009; Hedges & Olkin, 1985). This assumption, however, may be unrealistic in many cases (see Borenstein et al., 2009; McShane & Böckenholt, 2014; Miller & Schwarz, 2011). By contrast, random-effects models assume that the effect size varies randomly across studies (Borenstein et al., 2009; Hedges & Olkin, 1985). In order to assess the robustness of the present MLE procedures, we conducted additional MonteCarlo simulation studies in which the true effect size for each single study was randomly drawn from a normal distribution with mean μδ and variance τ2. These simulations revealed that the model is quite robust in many cases unless τ2 is relatively large (see Appendix C in the Electronic Supplementary Material, ESM 3). Hence, further work is needed to generalize the present models to metaanalyses that involve considerable between-study variability. The present work provides the fundamental concepts for such a generalization. The present models also assume that positive significant results are always published, and this assumption has also been embodied in previous selection models for correcting and detecting publication bias with two-sided tests (e.g., Hedges, 1984; Iyengar & Greenhouse, 1988). As mentioned in the Introduction, however, reviewers and authors may consider results as more conclusive when these results are strongly significant (Rosenthal & Gaito, 1963; Ulrich & Miller, 2017). Accordingly, it seems reasonable to assign larger weights to smaller levels of significance (e.g., Hedges & Vevea, 2005). For example, in an extended version of the mixture model, the probability of publication, PðpublishjpÞ,
Ó 2018 Hogrefe Publishing
R. Ulrich et al., Effect Size Estimation
77
may depend on the significance level p in the following way (cf. Hedges & Vevea, 2005, p. 164)
0
1:00; B B 0:99; B PðpublishjpÞ ¼ B B 0:90; B @ 0:50;
0:000 p < 0:005
0:30;
0:100 p 1:000
0:005 p < 0:010 0:010 p < 0:050
ð35Þ
0:050 p < 0:100
rather than being independent of the significance level as in the basic version of the mixture model, that is,
PðpublishjpÞ ¼
1:00; 1 psp ;
0:000 p < 0:050 0:050 p 1:000: ð36Þ
If the probability of publication varied gradually as in Equation 35 but the basic version of the mixture model was used to estimate psp, the estimates would no longer be based on a realistic model of the actual situation. Nevertheless, the model might still indicate the presence of publication bias and produce realistic estimates of the effect size δ. In order to examine the behavior of the basic mixture model in such a scenario, we conducted additional simulations (Appendix D, in the Electronic Supplementary Material, ESM 3). These simulations indicate that the basic mixture model is quite robust even under this scenario. In sum, the present work built on previous model-based approaches to the problem of estimating effect sizes in the presence of publication bias (e.g., Hedges, 1984; Iyengar & Greenhouse, 1988; Rust et al., 1990). Two major scenarios were considered. The first scenario addressed the publication bias when researchers mostly tend to publish significant positive results. In one variant of this scenario only significant positive results are published, and in the second variant some nonsignificant results are also published. In either variant of this scenario, the true effect size tends to be strongly overestimated and can thus create the false impression of a large effect, but the MLE approach developed here successfully counteracts that overestimation. The second scenario addressed the situation when researchers mostly tend to suppress significant positive results. In one variant of this scenario, all significant positive results are suppressed, and in the second variant some significant results are also published. This scenario tends to underestimate effects and can thus create the false impression of a small or absent effect, but the MLE approach counteracts this underestimation. Naturally, this modelbased approach will provide the most accurate estimates in situations where its assumptions are met, but additional simulations provide reassurance that the technique is
Ó 2018 Hogrefe Publishing
robust to moderate violations of two key assumptions that are unlikely to be met perfectly in practice (i.e., if publication probability depends on p value or sample size in a graded manner). Furthermore, even in situations where it is difficult to tell whether the assumptions are met, the model can still be useful for sensitivity analyses (Borenstein et al., 2009; Duval, 2005). Although procedures to adjust for publication bias are useful tools, of course, they cannot protect against p-hacking (Simonsohn et al., 2014a) or fraud. Acknowledgments We thank the Alexander von Humboldt foundation for a research award supporting the second author during the preparation of this manuscript and the German Research Foundation (Graduate School, GRK 2277; Statistical Modeling in Psychology, SMiP) for supporting this work. We would like to thank anonymous reviewers for their comments on a previous version of this manuscript. The action editor for this article was Michael Bošnjak. Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at https://doi.org/10.1027/ 2151-2604/a000319 ESM 1. Text (.txt). R scripts. ESM 2. Text (.txt). MATLAB code. ESM 3. Text (.pdf). Appendices.
References Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017). Sample-size planning for more accurate statistical power: A method adjusting sample effect sizes for publication bias and uncertainty. Psychological Science, 28, 1547–1562. https://doi.org/ 10.1177/0956797617723724 Becker, B. J. (2005). Fail-safe N or file-drawer number. In H. R. Rothstein, A. J. Sutton, & M. Borenstein (Eds.), Publication bias in meta-analysis: Prevention, assessment and adjustments (pp. 111–125). Chichester, UK: Wiley. https://doi.org/10.1002/ 0470870168.ch7 Begg, C. B., & Berlin, J. A. (1988). Publication bias: A problem in interpreting medical data. Journal of the Royal Statistical Society. Series A (Statistics in Society), 151, 419–463. https:// doi.org/10.2307/2982993 Begg, C. B., & Mazumdar, M. (1994). Operating characteristics of a rank correlation test for publication bias. Biometrics, 50, 1088–1101. Retrieved from http://www.jstor.org/stable/2533446 Bero, L. (2003). Implications of the tobacco industry documents for public health and policy. Annual Review of Public Health, 24, 267–288. https://doi.org/10.1146/annurev.publhealth.24.100901. 140813
Zeitschrift für Psychologie (2018), 226(1), 56–80
78
Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to meta-analysis. Chichester, UK: Wiley. https://doi.org/10.1002/9780470743386 Bruns, S. B., & Ioannidis, J. P. A. (2016). p-curve and p-hacking in observational research. PLoS One, 11, e0149144. https://doi. org/10.1371/journal.pone.0149144 Citkowicz, M., & Vevea, J. L. (2017). A parsimonious weight function for modeling publication bias. Psychological Methods, 22, 28–41. https://doi.org/10.1037/met0000119 Coburn, K. M., & Vevea, J. L. (2015). Publication bias as a function of study characteristics. Psychological Methods, 20, 310–330. https://doi.org/10.1037/met0000046 Cohen, J. (1988). Statistical power analysis for the behavioral sciences (Vol. 2). Hillsdale, NJ: Erlbaum. Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159. https://doi.org/10.1037/0033-2909.112.1.155 Cooper, H., DeNeve, K., & Charlton, K. (1997). Finding the missing science: The fate of studies submitted for review by a human subjects committee. Psychological Methods, 2, 447–452. https://doi.org/10.1037/1082-989X.2.4.447 Coursol, A., & Wagner, E. E. (1986). Effect of positive findings on submission and acceptance rates: A note on meta-analysis bias. Professional Psychology: Research and Practice, 17, 136–137. https://doi.org/10.1037/0735-7028.17.2.136 de Winter, J. C., & Dodou, D. (2015). A surge of p-values between 0.041 and 0.049 in recent decades (but negative results are increasing rapidly too). PeerJ, 3, e733. https://doi.org/10.7717/ peerj.733 Dickersin, K. (2005). Publication bias: Recognizing the problem, understanding its origins and scope, and preventing harm. In H. R. Rothstein, A. J. Sutton, & M. Borenstein (Eds.), Publication bias in meta-analysis: Prevention, assessment and adjustments (pp. 9–33). Chichester, UK: Wiley. Driessen, E., Hollon, S. D., Bockting, C. L. H., Cuijpers, P., & Turner, E. H. (2015). Does publication bias inflate the apparent efficacy of psychological treatment for major depressive disorder? A systematic review and meta-analysis of US National Institutes of Health-funded trials. PLoS One, 10, e0137864. https://doi.org/10.1371/journal.pone.0137864 Duval, S. (2005). Publication bias: Recognizing the problem, understanding its origins and scope, and preventing harm. In H. R. Rothstein, A. J. Sutton, & M. Borenstein (Eds.), The trim and fill method (pp. 127–144). Chichester, UK: Wiley. Duval, S., & Tweedie, R. (2000). Trim and fill: A simple funnel-plotbased method of testing and adjusting for publication bias in meta-analysis. Biometrics, 56, 455–463. https://doi.org/ 10.1111/j.0006-341x.2000.00455.x Egger, M., Davey Smith, G., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test. British Medical Journal, 315, 629–634. https://doi.org/10.1136/ bmj.315.7109.629 Etz, A., & Vandekerckhove, J. (2016). A Bayesian perspective on the reproducibility project: Psychology. PLoS One, 11, 1–13. https://doi.org/10.1371/journal.pone.0149794 Fanelli, D. (2010a). Do pressures to publish increase scientists’ bias? An empirical support from US states data. PLoS One, 5, 1–7. https://doi.org/10.1371/journal.pone.0010271 Fanelli, D. (2010b). “Positive” results increase down the hierarchy of the sciences. PLoS One, 5, 1–10. https://doi.org/10.1371/ journal.pone.0010068 Fanelli, D. (2012). Negative results are disappearing from most disciplines and countries. Scientometrics, 90, 891–904. https:// doi.org/10.1007/s11192-011-0494-7 Fisher, R. (1915). Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large
Zeitschrift für Psychologie (2018), 226(1), 56–80
R. Ulrich et al., Effect Size Estimation
population. Biometrika, 10, 507–521. https://doi.org/10.2307/ 2331838 Francis, G. (2014). The frequency of excess success for articles in Psychological Science. Psychonomic Bulletin & Review, 21, 1180–1187. https://doi.org/10.3758/s13423-014-0601-x Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345, 1502–1505. https://doi.org/10.1126/science.1255484 Frazier, T. W., Demaree, H. A., & Youngstrom, E. A. (2004). Metaanalysis of intellectual and neuropsychological test performance in attention-deficit/hyperactivity disorder. Neuropsychology, 18, 543–555. https://doi.org/10.1037/0894-4105.18.3.543 Ginsel, B., Aggarwal, A., Xuan, W., & Harris, I. (2015). The distribution of probability values in medical abstracts: An observational study. BMC Research Notes, 8, 721. https://doi. org/10.1186/s13104-015-1691-x Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82, 1–20. https://doi. org/10.1037/h0076157 Guan, M., & Vandekerckhove, J. (2016). A Bayesian approach to mitigation of publication bias. Psychonomic Bulletin & Review, 23, 74–86. https://doi.org/10.3758/s13423-015-0868-6 Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015). The extent and consequences of p-hacking in science. PLoS Biology, 13, e1002106. https://doi.org/10.1371/journal. pbio.1002106 Hedges, L. V. (1981). Distribution theory for Glass’s estimator of effect size and related estimators. Journal of Educational and Behavioral Statistics, 6, 107–128. https://doi.org/949 10.3102/ 10769986006002107 Hedges, L. V. (1984). Estimation of effect size under nonrandom sampling: The effects of censoring studies yielding statistically insignificant mean differences. Journal of Educational and Behavioral Statistics, 9, 61–85. https://doi.org/10.3102/ 10769986009001061 Hedges, L. V. (1992). Modeling publication selection effects in meta-analyis. Statistical Science, 7, 246–255. https://doi.org/ 10.1214/ss/1177011364 Hedges, L. V. (2017). Plausibility and influence in selection models: A comment on Citkowicz and Vevea (2017). Psychological Methods, 22, 42–46. https://doi.org/10.1037/met0000108 Hedges, L. V., & Olkin, I. (1985). Statistical methods for metaanalysis. Orlando, FL: Academic Press. Hedges, L. V., & Vevea, J. L. (1996). Estimating effect size under publication bias: Small sample properties and robustness of a random effects selection model. Journal of Educational and Behavioral Statistics, 21, 299–332. https://doi.org/10.3102/ 10769986021004299 Hedges, L. V., & Vevea, J. L. (2005). Selection method approaches. In H. R. Rothstein, A. J. Sutton, & M. Borenstein (Eds.), Publication bias in meta-analysis: Prevention, assessment and adjustments (pp. 145–174). Chichester, UK: Wiley. Hirschhorn, N. (2000). Shameful science: Four decades of the German tobacco industry’s hidden research on smoking and health. Tobacco Control, 9, 242–248. https://doi.org/10.1136/ tc.9.2.242 Hogg, R. V., McKean, J., & Craig, A. T. (2005). Introduction to mathematical statistics (6th ed.). New Delhi, India: Pearson Prentice Hall. Hung, J. H. M., O’Neill, R. T., Bauer, P., & Köhne, K. (2012). The behavior of the p-value when the alternative hypothesis is true. Biometrics, 53, 11–22. Retrieved from http://www.jstor.org/ stable/2533093 Hyde, J. S., Fennema, E., & Lamon, S. J. (1990). Gender differences in mathematics performance – A meta-analysis. Psychological
Ó 2018 Hogrefe Publishing
R. Ulrich et al., Effect Size Estimation
Bulletin, 107, 139–155. https://doi.org/10.1037//0033-2909. 107.2.139 Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2, e124. https://doi.org/10.1371/ journal.pmed.0020124 Ioannidis, J. P. A. (2014). How to make more published research true. PLoS Medicine, 11, e1001747. https://doi.org/10.1371/ journal.pmed.1001747 Ioannidis, J. P. A. (2015). Failure to replicate: Sound the alarm. Cerebrum cer-12-a-1(Nov–Dec), 1–12. Iyengar, S., & Greenhouse, J. B. (1988). Selection models and the file drawer problem. Statistical Science, 3, 133–135. https:// doi.org/10.1214/ss/1177013019 Jin, Z. C., Zhou, X. H., & He, J. (2015). Statistical methods for dealing with publication bias in meta-analysis. Statistics in Medicine, 34, 343–360. https://doi.org/10.1002/sim.6342 John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23, 524–532. https:// doi.org/10.1177/0956797611430953 Johnson, N. L., Kotz, S., & Balakrishnan, N. (1994). Continuous univariate distributions (Vol. 2). New York, NY: Wiley. Kelley, G. A., & Kelley, K. S. (2015). Evidential value that exercise improves BMI z-score in overweight and obese children and adolescents. BioMed Research International, 2015, 1–5. https://doi.org/10.1155/2015/151985 Kroesbergen, E. H., & Van Luit, J. E. H. (2003). Mathematics interventions for children with special educational needs. Remedial and Special Education, 24, 97–114. https://doi.org/ 10.1155/2015/151985 Lakens, D. (2015a). On the challenges of drawing conclusions from p-values just below 0.05. PeerJ, 3, e1142. https://doi.org/doi. org/10.7717/peerj.1142 Lakens, D. (2015b). What p-hacking really looks like: A comment on Masicampo and LaLande (2012). The Quarterly Journal of Experimental Psychology, 68, 829–832. https://doi.org/ 10.1080/17470218.2014.982664 Lane, D. M., & Dunlap, W. P. (1978). Estimating effect size: Bias resulting from the significance criterion in editorial decisions. British Journal of Mathematical and Statistical Psychology, 31, 107–112. Lau, J., Ioannidis, J. P. A., Terrin, N., Schmid, C. H., & Olkin, I. (2006). Evidence based medicine: The case of the misleading funnel plot. British Medical Journal, 333, 597–600. https://doi. org/10.1136/bmj.333.7568.597 Light, R. J., & Pillemer, D. B. (1984). Summing up: The science of reviewing research. Cambridge, MA: Harvard University Press. Lodewijkx, H., Brouwer, B., Kuipers, H., & van Hezewijk, R. (2013). Overestimated effect of Epo administration on aerobic exercise capacity: A meta-analysis. American Journal of Sports Science and Medicine, 1, 17–27. https://doi.org/10.12691/ ajssm-1-2-2 Masicampo, E. J., & Lalande, D. R. (2012). A peculiar prevalence of p values just below .05. The Quarterly Journal of Experimental Psychology, 65, 2271–2279. https://doi.org/10.1080/17470218. 2012.711335 McShane, B. B., & Böckenholt, U. (2014). You cannot step into the same river twice: When power analyses are optimistic. Perspectives on Psychological Science, 9, 612–625. https://doi.org/ 10.1177/1745691614548513 McShane, B. B., Böckenholt, U., & Hansen, K. T. (2016). Adjusting for publication bias in meta-analysis: An evaluation of selection methods and some cautionary notes. Perspectives on Psychological Science, 11, 730–749. https://doi.org/10.1177/ 1745691616662243
Ó 2018 Hogrefe Publishing
79
Miller, J., & Schwarz, W. (2011). Aggregate and individual replication probability within an explicit model of the research process. Psychological Methods, 16, 337–360. https://doi.org/ 10.1037/a0023347 Molenberghs, P., Cunnington, R., & Mattingley, J. B. (2009). Is the mirror neuron system involved in imitation? A short review and meta-analysis. Neuroscience and Biobehavioral Reviews, 33, 975–980. https://doi.org/10.1016/j.neubiorev. 2009.03.010 Nelder, B. J. A., & Mead, R. (1964). A simplex method for function minimization. The Computer Journal, 7, 308–313. Noury, J. L., Nardo, J. M., Healy, D., Jureidini, J., Raven, M., Tufanaru, C., & Abi-jaoude, E. (2015). Restoring Study 329: Efficacy and harms of paroxetine and imipramine in treatment of major depression in adolescence. British Medical Journal, 351, h320. https://doi.org/10.1136/bmj.h4320 Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349, aac4716. https://doi. org/10.1126/science.aac471 Pashler, H., & Harris, C. R. (2012). Is the replicability crisis overblown? Three arguments examined. Perspectives on Psychological Science, 7, 531–536. https://doi.org/10.1177/ 1745691612463401 Peters, J. L., Sutton, A. J., Jones, D. R., Abrams, K. R., & Rushton, L. (2007). Performance of the trim and fill method in the presence of publication bias and between-study heterogeneity. Statistics in Medicine, 26, 4544–4562. https://doi.org/ 10.1002/sim.2889 Richard, F. D., Bond, C. F., & Stokes-Zoota, J. J. (2003). One hundred years of social psychology quantitatively described. Review of General Psychology, 7, 331–363. https://doi.org/ 10.1037/1089-2680.7.4.331 Richardson, J. T. E. (1996). Measures of effect size. Behavior Research Methods, Instruments, & Computers, 28, 12–22. https://doi.org/10.3758/BF03203631 Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86, 638–641. https://doi. org/10.1037/0033-2909.86.3.638 Rosenthal, R., & Gaito, J. (1963). The interpretation of levels of significance by psychological researchers. The Journal of Psychology, 55, 33–38. https://doi.org/10.1080/00223980. 1963.9916596 Ross, S. M. (1980). Introduction to probability models (2nd ed.). New York, NY: Academic Press. Rothstein, H. R., Sutton, A. J., & Borenstein, M. (2005). Publication bias in meta-analysis: Prevention, assessment and adjustments. Chichester, UK: Wiley. Rust, R. T., Lehmann, D. R., & Farley, J. U. (1990). Estimating publication bias in meta-analysis. Journal of Marketing Research, 27, 220–226. Retrieved from http://www.jstor.org/ stable/3172848 Shadish, W. R., Navarro, A. M., Matt, G. E., & Phillips, G. (2000). The effects of psychological therapies under clinically representative conditions: A meta-analysis. Psychological Bulletin, 126, 512–529. https://doi.org/10.1037//0033-2909.126.4.512 Shampine, L. F. (2008). Vectorized adaptive quadrature in MATLAB. Journal of Computational and Applied Mathematics, 211, 131–140. https://doi.org/10.1016/j.cam.2006.11.021 Shanks, D. R., Vadillo, M. A., Riedel, B., Clymo, A., Govind, S., Hickin, N., . . . Puhlmann, L. M. C. (2015). Romance, risk, and replication: Can consumer choices and risk-taking be primed by mating motives? Journal of Experimental Psychology: General, 144, 142–158. https://doi.org/10.1037/xge0000116 Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). p-Curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143, 534–547. https://doi.org/10.1037/a0033242
Zeitschrift für Psychologie (2018), 226(1), 56–80
80
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). p-Curve and effect size: Correcting for publication bias using only significant results. Psychological Science, 9, 666–681. https:// doi.org/10.1177/1745691614553988 Smart, R. G. (1964). The importance of negative results in psychological research. The Canadian Psychologist, 5, 225–232. https://doi.org/10.1037/h0083036 Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance – or vice versa. Journal of the American Statistical Association, 54, 30–34. https://doi.org/10.1080/01621459.1959.10501497 Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. The American Statistician, 49, 108–112. https://doi.org/10.1080/ 00031305.1995.10476125 Sterne, J. A. C., Becker, B. J., & Egger, M. (2005). The funnel plot. In H. R. Rothstein, A. J. Sutton, & M. Borenstein (Eds.), Publication bias in meta-analysis: Prevention, assessment and adjustments (pp. 75–98). Chichester, UK: Wiley. Sterne, J. A. C., & Egger, M. (2005). Regression methods to detect publication and other bias in meta-analysis. In H. R. Rothstein, S. Sutton, & M. Borenstein (Eds.), Publication bias in metaanalysis: Prevention, assessment and adjustments (pp. 99–110). Chichester, UK: Wiley. Sutton, A. J. (2005). Evidence concerning the consequences of publication and related biases. In H. R. Rothstein, A. J. Sutton, & M. Borenstein (Eds.), Publication bias in meta-analysis: Prevention, assessment and adjustments (pp. 175–192). Chichester, UK: Wiley. Sutton, A. J., Song, F., Gilbody, S. M., & Abrams, K. R. (2000). Modelling publication bias in meta-analysis: A review. Statistical Methods in Medical Research, 9, 421–445. https://doi.org/ 10.1191/096228000701555244 Szucs, D., & Ioannidis, J. P. (2017). Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLoS Biology, 15, 1–18. https://doi.org/10.1371/journal.pbio.2000797 Thornton, A., & Lee, P. (2000). Publication bias in meta-analysis: Its causes and consequences. Journal of Clinical Epidemiology, 53, 207–216. https://doi.org/10.1016/S0895-4356(99)00161-4
Zeitschrift für Psychologie (2018), 226(1), 56–80
R. Ulrich et al., Effect Size Estimation
Ulrich, R., & Miller, J. (2017). Some properties of the p-curves, with an application to gradual publication bias. Psychological Methods. Advance Online Publication. https://doi.org/10.1037/ met0000125 Vadillo, M. A., Konstantinidis, E., & Shanks, D. R. (2015). Underpowered samples, false negatives, and unconscious learning. Psychonomic Bulletin & Review, 23, 87–102. https://doi.org/ 10.3758/s13423-015-0892-6 van Assen, M. A. L. M., van Aert, R. C. M., Nuijten, M. B., & Wicherts, J. M. (2014). Why publishing everything is more effective than selective publishing of statistically significant results. PLoS one, 9, e84896. https://doi.org/10.1371/journal. pone.0084896 van Assen, M. A. L. M., van Aert, R. C. M., & Wicherts, J. M. (2015). Meta-analysis using effect size distributions of only statistically significant studies. Psychological Methods, 20, 293–309. https://doi.org/10.1037/met0000025 Vevea, J. L., & Hedges, L. V. (1995). A general linear model for estimating effect size in the presence of publication bias. Psychometrika, 60, 419–435. https://doi.org/10.1007/ BF02294384 Vevea, J. L., & Woods, C. M. (2005). Publication bias in research synthesis: Sensitivity analysis using a priori weight functions. Psychological Methods, 10, 428–443. https://doi.org/10.1037/ 1082-989X.10.4.428 Received July 14, 2017 Revision received October 31, 2017 Accepted November 2, 2017 Published online February 2, 2018 Rolf Ulrich Department of Psychology University of Tubingen Schleichstr. 4 72076 Tubingen Germany ulrich@uni-tuebingen.de
Ó 2018 Hogrefe Publishing
Call for Papers “Sustainable Human Development: Challenges and Solutions for Implementing the United Nations’ Goals” A Topical Issue of the Zeitschrift für Psychologie Guest Editors: Suman Verma,1 Anne C. Petersen,2 and Jennifer E. Lansford3 1
Panjab University, Chandigarh, India
2
University of Michigan, Ann Arbor, MI, USA
3
Duke University, Durham, NC, USA
The 17 Sustainable Development Goals (SDGs) enshrined in the 2030 Agenda for Sustainable Development adopted at the United Nations in September 2015 by the 193 member states represent a new global development initiative. Encompassing three core dimensions of economic, social, and environmental development, the Agenda has become the center of a renewed development framework for countries to meet the changing development priorities and development gaps that previous strategies have been unable to close. This topical issue of the Zeitschrift für Psychologie aims to focus on the major challenges to achieving sustainable human development as well as solutions for articulating development strategies for the achievement of the SDGs. Adopting a holistic approach requires science and research, technology and innovation, policy and action, and international scientific cooperation in designing and enhancing systems that provide solutions to sustainable development challenges to benefit the most vulnerable and marginalized and leave no one behind. Psychology plays an important role for sustainable development from informing evidence-based targets and indicators to assessing progress, testing solutions, and identifying emerging risks and opportunities. The 2030 Agenda provides an opportunity for securing a voice for psychologists in the policy framework. Psychological and social science research demonstrates that social inequalities prevent people from developing their capacities and contributing as productive members of society. Research in psychology indicates that being engaged in decent work promotes psychosocial empowerment by reducing marginalization and poverty. Empowering people to be productive by providing training Ó 2018 Hogrefe Publishing
about entrepreneurship and income-generating activities, life skills development, equal access to education and lifelong learning, particularly for youth, are important pathways to decent work and the achievement of sustainable development. Overall, the multi-stakeholder approach to crafting the global agenda on sustainable development has created multiple opportunities for psychologists to contribute to the deliberations and advocate for the inclusion of psychological principles and solutions to complex global problems. We invite original or review articles and meta-analyses, shorter research notes, and opinion papers on the following key dimensions of the 2030 Agenda and SDGs. We especially welcome work that advances our current knowledge by addressing new perspectives, theoretical frameworks, and assessment methods related to sustainable human development. The topics covered might include (but are not limited to): – Poverty and social protection; – Inequality and development; – Health, lifelong learning, and global citizenship; – Conflict and development; – The role of psychologists in the realization of SDGs: generating scientific knowledge, developing indicators, tracking development and well-being indicators, capacity building. Interested authors are invited to submit their abstracts on potential papers electronically to guest editor Suman Verma (E-mail suman992003@yahoo.com). How to submit: Interested authors should submit a letter of intent including: (1) a working title for the manuscript, Zeitschrift für Psychologie (2018), 226(1), 81–82 https://doi.org/10.1027/2151-2604/a000313
82
Call for Papers
(2) names, affiliations, and contact information for all authors, and (3) an abstract of no more than 500 words detailing the content of the proposed manuscript. There is a two-stage submissions process. Initially, interested authors are requested to submit only abstracts of their proposed papers. Authors of the selected abstracts will then be invited to submit full papers. All papers will undergo blind peer review. Deadline for submission of abstracts is April 15, 2018. Deadline for submission of full papers is August 15, 2018. The journal seeks to maintain a short turnaround time, with the final version of accepted papers being due by
Zeitschrift für Psychologie (2018), 226(1), 81–82
November 15, 2018. The topical issue will be published as issue 2(2019). For additional information, please contact the guest editor.
About the Journal The Zeitschrift für Psychologie, founded in 1890, is the oldest psychology journal in Europe and the second oldest in the world. One of the founding editors was Hermann Ebbinghaus. Since 2007 it is published in English and devoted to publishing topical issues that provide state-of the-art reviews of current research in psychology. For detailed author guidelines, please see the journal’s website at www.hogrefe.com/j/zfp
Ó 2018 Hogrefe Publishing
Call for Papers “Advances in HEXACO Personality Research” A Topical Issue of the Zeitschrift für Psychologie Guest Editors: Reinout E. de Vries,1,2 Michael C. Ashton,3 and Kibeom Lee4 1
VU Amsterdam, The Netherlands
2
University of Twente, The Netherlands
3
Brock University, St. Catharines, Canada
4
University of Calgary, Canada
Nearly two decades ago, a six-factor model of personality structure – now called the HEXACO model of personality, comprising Honesty-Humility (H), Emotionality (E), Extraversion (X), Agreeableness (A), Conscientiousness (C), and Openness to Experience (O) – was proposed as an alternative to the well-known Big Five or Five-Factor personality model. In the last 15 years, more than 500 studies have been conducted using the HEXACO personality inventory (see the hexaco.org website for an overview). These studies have shown, for instance, that (1) the maximum crossculturally replicable personality space is best described using six instead of five personality dimensions, (2) the HEXACO model, through its inclusion of the extra Honesty-Humility dimension, is able to explain a range of important criterion variables better than is the Big Five model, (3) the Dark Triad traits, which have often been considered as a supplement to the Big Five traits, are essentially equivalent to the (reversed) HEXACO Honesty-Humility factor, and (4) HonestyHumility and Openness to Experience stand out as personality dimensions because of their unique relations with two important sociopolitical value dimensions and because people are more likely to assume close friends or partners to be similar on these two dimensions. In this call for papers, we reach out to researchers who use the HEXACO model and even to those who may be critical of it, to submit a manuscript to the Zeitschrift für Psychologie to further extend – or to discuss potential gaps in – our knowledge about the HEXACO model and its correlates. Submissions can consist of an original research article, a review article, a short research note (“Research Spotlight”), or an opinion article. All submissions will be judged on relevance to the topic, originality, and scientific rigor by expert peer reviewers. How to submit: Interested authors should submit a letter of intent including: (1) a working title for the manuscript, Ó 2018 Hogrefe Publishing
(2) names, affiliations, and contact information for all authors, and (3) an abstract of no more than 500 words detailing the content of the proposed manuscript to one of the guest editors Reinout E. de Vries (re.de. vries@vu.nl), Michael C. Ashton (mashton@brocku.ca), or Kibeom Lee (kibeom@ucalgary.ca). Authors of the selected abstracts will then be invited to submit full papers. All those papers will undergo blind peer review. Deadline for submission of abstracts is July 15, 2018. Deadline for submission of full papers is November 15, 2018. The journal seeks to maintain a short turnaround time, with the final version of the accepted papers being due by February 28, 2019. The topical issue will be published as issue 3 (2019). For additional information, please contact the guest editors.
About the Journal The Zeitschrift für Psychologie, founded in 1890, is the oldest psychology journal in Europe and the second oldest in the world. One of the founding editors was Hermann Ebbinghaus. Since 2007 it is published in English and devoted to publishing topical issues that provide state-ofthe-art reviews of current research in psychology. For detailed author guidelines, please see the journal’s website at www.hogrefe.com/j/zfp/
Zeitschrift für Psychologie (2018), 226(1), 83 https://doi.org/10.1027/2151-2604/a000322
Instructions to Authors The Zeitschrift für Psychologie publishes high-quality research from all branches of empirical psychology that is clearly of international interest and relevance, and does so in four topical issues per year. Each topical issue is carefully compiled by guest editors. The subjects being covered are determined by the editorial team after consultation within the scientific community, thus ensuring topicality. The Zeitschrift für Psychologie thus brings convenient, cutting-edge compilations of the best of modern psychological science, each covering an area of current interest. Zeitschrift für Psychologie publishes the following types of articles: Review Articles, Original Articles, Research Spotlights, Horizons, and Opinions. Manuscript submission: A call for papers is issued for each topical issue. Current calls are available on the journal’s website at www.hogrefe.com/j/zfp. Manuscripts should be submitted as Word or RTF documents by e-mail to the responsible guest editor(s). An article can only be considered for publication in the Zeitschrift für Psychologie if it can be assigned to one of the topical issues that have been announced. The journal does not accept general submissions. Detailed instructions to authors are provided at http://www. hogrefe.com/j/zfp Copyright Agreement: By submitting an article, the author confirms and guarantees on behalf of him-/herself and any coauthors that he or she holds all copyright in and titles to the submitted contribution, including any figures, photographs, line drawings, plans, maps, sketches and tables, and that the article and its contents do not infringe in any way on the rights of third parties. The author indemnifies and holds harmless the publisher from any third-party claims. The author agrees, upon acceptance of the article for publication, to transfer to the publisher on behalf of him-/herself and any coauthors the exclusive right to reproduce and distribute the article and its contents, both physically and in nonphysical, electronic, and other form, in the journal to which it
Zeitschrift für Psychologie (2018), 226(1)
has been submitted and in other independent publications, with no limits on the number of copies or on the form or the extent of the distribution. These rights are transferred for the duration of copyright as defined by international law. Furthermore, the author transfers to the publisher the following exclusive rights to the article and its contents: 1. The rights to produce advance copies, reprints, or offprints of the article, in full or in part, to undertake or allow translations into other languages, to distribute other forms or modified versions of the article, and to produce and distribute summaries or abstracts. 2. The rights to microfilm and microfiche editions or similar, to the use of the article and its contents in videotext, teletext, and similar systems, to recordings or reproduction using other media, digital or analog, including electronic, magnetic, and optical media, and in multimedia form, as well as for public broadcasting in radio, television, or other forms of broadcast. 3. The rights to store the article and its content in machinereadable or electronic form on all media (such as computer disks, compact disks, magnetic tape), to store the article and its contents in online databases belonging to the publisher or third parties for viewing or downloading by third parties, and to present or reproduce the article or its contents on visual display screens, monitors, and similar devices, either directly or via data transmission. 4. The rights to reproduce and distribute the article and its contents by all other means, including photomechanical and similar processes (such as photocopying or facsimile), and as part of so-called document delivery services. 5. The right to transfer any or all rights mentioned in this agreement, as well as rights retained by the relevant copyright clearing centers, including royalty rights to third parties. Online Rights for Journal Articles: Guidelines on authors’ rights to archive electronic versions of their manuscripts online are given in the document ‘‘Guidelines on sharing and use of articles in Hogrefe journals’’ on the journal’s web page at www.hogrefe.com/j/zfp July 2017
Ó 2018 Hogrefe Publishing
Test development and construction: Current practices and advances “This book is indispensable for all who want an up-to-date resource about constructing valid tests.” Prof. Dr. Johnny R. J. Fontaine, President of the European Association of Psychological Assessment, Faculty of Psychology and Educational Sciences, Ghent University, Belgium
Karl Schweizer / Christine DiStefano (Editors)
Principles and Methods of Test Construction Standards and Recent Advances
(Series: Psychological Assessment – Science and Practice – Vol. 3) 2016, vi + 336 pp. US $69.00 / € 49.95 ISBN 978-0-88937-449-2 Also available as eBook This latest volume in the series Psychological Assessment – Science and Practice describes the current state-of-the-art in test development and construction. The past 10–20 years have seen substantial advances in the methods used to develop and administer tests. In this volume many of the world’s leading authorities collate these advances and provide information about current practices, thus equipping researchers and students to successfully construct new tests using the best modern standards and
www.hogrefe.com
techniques. The first section explains the benefits of considering the underlying theory when designing tests, such as factor analysis and item response theory. The second section looks at item format and test presentation. The third discusses model testing and selection, while the fourth goes into statistical methods that can find group-specific bias. The final section discusses topics of special relevance, such as multitraitmultimethod analyses and development of screening instruments.
This Hotspots issue showcases some of the vibrant meta-analytic research activity that is underway in psychology, including not only applications of state-of-the-art meta-analytic methods to a steadily increasing number of substantive research fields but also methodological innovations to address challenges of meta-analytic research. It includes a ground-breaking approach for identifying hotspot topics using latent Dirichlet allocation to identify research topics of particular significance over time, a meta-analysis investigating the cross-cultural comparability of a well-established psychometric instrument capturing self-esteem, as well as replications and extensions of previous meta-analyses and a methodological contribution on effect size estimation.
Contents include: How to Identify Hot Topics in Psychology Using Topic Modeling André Bittermann and Andreas Fischer The Structure of the Rosenberg Self-Esteem Scale: A Cross-Cultural Meta-Analysis Timo Gnambs, Anna Scharl, and Ulrich Schroeders An Update of a Meta-Analysis on the Clinical Outcomes of Deep Transcranial Magnetic Stimulation (DTMS) in Major Depressive Disorder (MDD) Helena M. Gellersen and Karina Karolina Kedzior A Meta-Analytic Re-Appraisal of the Framing Effect Alexander Steiger and Anton Kühberger Effect Size Estimation From t-Statistics in the Presence of Publication Bias: A Brief Review of Existing Approaches With Some Extensions Rolf Ulrich, Jeff Miller, and Edgar Erdfelder
Hogrefe Publishing Group Göttingen · Berne · Vienna · Oxford · Paris Boston · Amsterdam · Prague · Florence Copenhagen · Stockholm · Helsinki · Oslo Madrid · Barcelona · Seville · Bilbao Zaragoza · São Paulo · Lisbon www.hogrefe.com
ISBN 978-0-88937-547-5 90000 9 780889 375475