Michael Bošnjak Nadine Wedderhoff (Editors)
Zeitschrift für Psychologie Founded in 1890 Volume 228 / Number 1 / 2020 Editor-in-Chief Edgar Erdfelder Associate Editors Michael Bošnjak Benjamin E. Hilbig Andrea Kiesel Iris-Tatjana Kolassa Bernd Leplow Steffi Pohl Barbara Schober Birgit Schyns Christiane Spiel Elsbeth Stern
Hotspots in Psychology 2020
Hogrefe OpenMind Open Access Publishing? It’s Your Choice! Your Road to Open Access Authors of papers accepted for publication in any Hogrefe journal can now choose to have their paper published as an open access article as part of the Hogrefe OpenMind program. This means that anyone, anywhere in the world will – without charge – be able to read, search, link, send, and use the article for noncommercial purposes, in accordance with the internationally recognized Creative Commons licensing standards.
The Choice Is Yours 1. Open Access Publication: The final “version of record” of the article is published online with full open access. It is freely available online to anyone in electronic form. (It will also be published in the print version of the journal.) 2. Traditional Publishing Model: Your article is published in the traditional manner, available worldwide to journal subscribers online and in print and to anyone by “pay per view.” Whichever you choose, your article will be peer-reviewed, professionally produced, and published both in print and in electronic versions of the journal. Every article will be given a DOI and registered with CrossRef.
www.hogrefe.com
How Does Hogrefe’s Open Access Program Work? After submission to the journal, your article will undergo exactly the same steps, no matter which publishing option you choose: peer-review, copy-editing, typesetting, data preparation, online reference linking, printing, hosting, and archiving. In the traditional publishing model, the publication process (including all the services that ensure the scientific and formal quality of your paper) is financed via subscriptions to the journal. Open access publication, by contrast, is financed by means of a one-time article fee (€ 2,500 or US $3,000) payable by you the author, or by your research institute or funding body. Once the article has been accepted for publication, it’s your choice – open access publication or the traditional model. We have an open mind!
Michael Bošnjak Nadine Wedderhoff (Editors)
Hotspots in Psychology 2020
Zeitschrift für Psychologie Volume 228 /Number 1/2020
Ó 2020 Hogrefe Publishing Hogrefe Publishing Incorporated and registered in the Commonwealth of Massachusetts, USA, and in Göttingen, Lower Saxony, Germany No part of this publication may be reproduced, stored in a retrieval system or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the publisher. Cover image Óistockphoto.com/TommL Printed and bound in Germany ISBN 978-0-88937-574-1 The Zeitschrift für Psychologie, founded by Hermann Ebbinghaus and Arthur König in 1890, is the oldest psychology journal in Europe and the second oldest in the world. Since 2007, it appears in English and is devoted to publishing topical issues that provide convenient state-of-the-art compilations of research in psychology, each covering an area of current interest. The Zeitschrift für Psychologie is available as a journal in print and online by annual subscription and the different topical compendia are also available as individual titles by ISBN.
Editor-in-Chief
Edgar Erdfelder, University of Mannheim, Psychology III, Schloss, Ehrenhof-Ost, 68131 Mannheim, Germany, Tel. +49 621 181-2146, Fax +49 621 181-3997, erdfelder@psychologie.uni-mannheim.de
Associate Editors
Michael Bošnjak, Trier, Germany Benjamin E. Hilbig, Landau, Germany Andrea Kiesel, Freiburg, Germany
Iris-Tatjana Kolassa, Ulm, Germany Bernd Leplow, Halle, Germany Steffi Pohl, Berlin, Germany Barbara Schober, Vienna, Austria
Birgit Schyns, Reims, France Christiane Spiel, Vienna, Austria Elsbeth Stern, Zurich, Switzerland
Editorial Board
Michael Ashton, St. Catharine’s, Canada Daniel M. Bernstein, Surrey, Canada Tom Buchanan, London, UK Mike W.-L. Cheung, Singapore Reinout E. de Vries, Amsterdam, The Netherlands Andrew Gloster, Basel, Switzerland Timo Gnambs, Bamberg, Germany Rainer Greifeneder, Basel, Switzerland Vered Halamish, Ramat Gan, Israel
Moritz Heene, Munich, Germany Suzanne Jak, Amsterdam, The Netherlands Jennifer E. Lansford, Durham, NC, USA Kibeom Lee, Calgary, Canada Tania Lincoln, Hamburg, Germany Alexandra Martin, Wuppertal, Germany Anne C. Petersen, Ann Arbor, MI, USA Ulf-Dietrich Reips, Konstanz, Germany Frank Renkewitz, Erfurt, Germany
Anna Sagana, Maastricht, The Netherlands Melanie Sauerland, Maastricht, The Netherlands Holger Steinmetz, Trier, Germany Monika Undorf, Mannheim, Germany Omer van den Bergh, Leuven, Belgium Suman Verma, Chandigarh, India Nadine Wedderhoff, Trier, Germany
Publisher
Hogrefe Publishing, Merkelstr. 3, 37085 Göttingen, Germany, Tel. +49 551 999 50 0, Fax +49 551 999 50 425, publishing@hogrefe.com North America: Hogrefe Publishing, 361 Newbury Street, 5th Floor, Boston, MA 02115, USA, Tel. +1 (866) 823 4726, Fax +1 (617) 354 6875, customerservice@hogrefe-publishing.com
Production
Christina Sarembe, Hogrefe Publishing, Merkelstr. 3, 37085 Göttingen, Germany, Tel. +49 551 999 50 424, Fax +49 551 999 50 425, production@hogrefe.com
Subscriptions
Hogrefe Publishing, Herbert-Quandt-Str. 4, 37081 Göttingen, Germany, Tel. +49 551 999 50 900, Fax +49 551 999 50 998
Advertising/Inserts
Hogrefe Publishing, Merkelstr. 3, 37085 Göttingen, Germany, Tel. +49 551 999 50 423, Fax +49 551 999 50 425, marketing@hogrefe.com
ISSN
ISSN-L 2151-2604, ISSN-Print 2190-8370, ISSN-Online 2151-2604
Copyright Information
Ó 2020 Hogrefe Publishing. This journal as well as the individual contributions and illustrations contained within it are protected under international copyright law. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without prior written permission from the publisher. All rights, including translation rights, reserved.
Publication
Published in 4 topical issues per annual volume.
Subscription Prices
Calendar year subscriptions only. Rates for 2020: Institutions - from US $353.00 / € 272.00 (print only; pricing for online access can be found in the journals catalog at hgf.io/journalscatalog); Individuals US $195.00 / €139.00 (all plus US $16.00 / €12.00 shipping & handling; € 6.80 in Germany). Single issue US $49.00 / € 34.95 (plus shipping & handling).
Payment
Payment may be made by check, international money order, or credit card, to Hogrefe Publishing, Merkelstr. 3, 37085 Göttingen, Germany. US and Canadian subscriptions can also be ordered from Hogrefe Publishing, 361 Newbury Street, 5th Floor, Boston, MA 02115, USA
Electronic Full Text
The full text of Zeitschrift für Psychologie is available online at www.econtent.hogrefe.com and in PsycARTICLESTM.
Abstracting Services
Abstracted/indexed in Current Contents/Social and Behavioral Sciences (CC/S&BS), Social Sciences Citation Index (SSCI), Research Alert, PsycINFO, PASCAL, PsycLit, IBZ, IBR, ERIH, and PSYNDEX. Impact Factor (2018): 1.183
Zeitschrift für Psychologie (2020), 228(1)
Ó 2020 Hogrefe Publishing
Contents Editorial
Hotspots in Psychology – 2020 Edition Michael Bošnjak and Nadine Wedderhoff
1
Review Articles
Systematic Review and Network Meta-Analysis of Anodal tDCS Effects on Verbal Episodic Memory: Modeling Heterogeneity of Stimulation Locations Gergely Janos Bartl, Emily Blackshaw, Margot Crossman, Paul Allen, and Marco Sandrini
3
Original Articles
Call for Papers
Ó 2020 Hogrefe Publishing
Response Rates in Online Surveys With Affective Disorder Participants: A Meta-Analysis of Study Design and Time Effects Between 2008 and 2019 Tanja Burgard, Michael Bošnjak, and Nadine Wedderhoff
14
Dealing With Artificially Dichotomized Variables in Meta-Analytic Structural Equation Modeling Hannelies de Jonge, Suzanne Jak, and Kees-Jan Kan
25
Assessing the Quality of Systematic Reviews in Healthcare Using AMSTAR and AMSTAR2: A Comparison of Scores on Both Scales Karina Karolina De Santis and Ilkay Kaplan
36
Power-Enhanced Funnel Plots for Meta-Analysis: The Sunset Funnel Plot Michael Kossmeier, Ulrich S. Tran, and Martin Voracek
43
Addressing Publication Bias in Meta-Analysis: Empirical Findings From Community-Augmented Meta-Analyses of Infant Language Development Sho Tsuji, Alejandrina Cristia, Michael C. Frank, and Christina Bergmann
50
‘‘Web-Based Research in Psychology’’: A Topical Open Access Issue of the Zeitschrift für Psychologie Guest Editors: Ulf-Dietrich Reips and Tom Buchanan
62
‘‘Dark Personality in the Workplace’’: A Topical Issue of the Zeitschrift für Psychologie Guest Editors: Birgit Schyns, Susanne Braun, and Pedro Neves
63
Zeitschrift für Psychologie (2020), 228(1)
Editorial Hotspots in Psychology – 2020 Edition Michael Bošnjak1,2 and Nadine Wedderhoff2 1
ZPID – Leibniz Institute for Psychology Information, Germany
2
Department of Psychology, University of Trier, Germany Abstract: This editorial gives a brief introduction to the six articles included in the fourth “Hotspots in Psychology” of the Zeitschrift für Psychologie. The format is devoted to systematic reviews and meta-analyses in research-active fields that have generated a considerable number of primary studies. The common denominator is the research synthesis nature of the included articles, and not a specific psychological topic or theme that all articles have to address. Moreover, methodological advances in research synthesis methods relevant for any subfield of psychology are being addressed. Comprehensive supplemental material to the articles can be found in PsychArchives (https://www.psycharchives.org).
This fourth edition of the “Hotspots in Psychology” format is devoted to systematic reviews and meta-analyses in research-active (i.e., hotspot) fields that have generated a considerable number of primary studies. On par with the first three “Hotspots in Psychology” topical issues (Bošnjak & Erdfelder, 2018; Bošnjak & Gnambs, 2019; Erdfelder & Bošnjak, 2016), the common denominator is the research synthesis nature of the articles, and not a specific psychological topic or theme addressed by the articles. Moreover, these articles address methodological advances in research synthesis methods relevant for any subfield of psychology. Substantive issues using meta-analysis are addressed by the following two papers. Bartl, Blackshaw, Crossman, Allen, and Sandrini (2020) applied a network meta-analytic approach to analyze the effect of transcranial direct current stimulation (tDCS) on memory. Moreover, the focal point of the study was the analysis of different electrode locations. Although memory enhancement effects of tDCS have frequently been suggested in the empirical literature, the authors analyzed results from 23 experiments and concluded that there is not a conclusive modulation of verbal memory retrieval based on the electrode location. In view of declining response rates in public opinion surveys, Burgard, Bošnjak, and Wedderhoff (2020) meta-analytically explored whether or not a similar trend also applies to online surveys in clinical psychology, specifically in studies on depression or general anxiety disorders. The authors estimated a mean response rate of approximately 43%, which is indeed declining over time. Additional moderators of the response rate were survey length and the type of invitation used. As expected by sensitivity reinforcement theory, no effect of incentives on survey participation in this
Ó 2020 Hogrefe Publishing
specific group (scoring high on neuroticism) could be observed. Methodological advances in the area of research synthesis methods are the focus of the following four papers. De Jonge, Jak, and Kan (2020) analyzed the performance of a meta-analytic structural equation modeling (MASEM) approach when variables in primary studies have been artificially dichotomized. More specifically, they report on the results of two simulation studies comparing the impact of the inclusion of point-biserial versus biserial correlations in a full and partial mediation model. Overall, the biserial correlation performs better, and the authors recommend its use. Assessing the quality of a systematic review is a crucial step for researchers as well as for practitioners and social and political decision makers. Though many instruments have been developed to assess quality-related aspects of systematic reviews, two critical appraisal instruments that are commonly applied are AMSTAR and its revised version AMSTAR2. To analyze differences between these two instruments, quality estimates for ten systematic reviews are assessed and compared in De Santis and Kaplan (2020). Although ratings on AMSTAR and AMSTAR2 are similar for most of the items, the quality of items on the revised version has substantially improved. Thus, compared to its previous version, AMSTAR2 may be seen as the preferred instrument for quality assessment. Kossmeier, Tran, and Voracek (2020) have developed a graphical approach to depict study-level statistical power in the context of meta-analysis. Sunset (power-enhanced) funnel plots are recommended to visualize statistical power for assessing the inferential value of a set of studies. First application of the sunset funnel plot with two published
Zeitschrift für Psychologie (2020), 228(1), 1–2 https://doi.org/10.1027/2151-2604/a000398
2
Editorial
meta-analyses from medicine and psychology are presented, and software to create this variation of the funnel plot is provided via a tailored R function. Tsuji, Cristia, Frank, and Bergmann (2020) addressed two questions about unpublished data using MetaLab, a collection of community-augmented meta-analyses focused on developmental psychology. First, the authors assessed to what extent MetaLab datasets include gray literature, and by what search strategies they are unearthed. The authors find that an average of 11% of data points are from unpublished literature. Second, the authors analyzed the effect of including versus excluding unpublished literature on estimates of effect size and publication bias, and found this decision does not affect outcomes. Overall, we very much hope that the contributions to this fourth “Hotspots in Psychology” issue stimulate further research and contribute to new and ongoing scientific discussions. We would like to point readers to the comprehensive sets of supplemental material facilitating reproduction and replication, which can be found in PsychArchvies (https://www.psycharchives.org).
Burgard, T., Bošnjak, M., & Wedderhoff, N. (2020). Response rates in online surveys with affective disorder participants: A metaanalysis of study design and time effects between 2008 and 2019. Zeitschrift für Psychologie, 228(1), 14–24. https://doi.org/ 10.1027/2151-2604/a000394 De Jonge, H., Jak, S., & Kan, K. J. (2020). Dealing with artificially dichotomized variables in meta-analytic structural equation modeling. Zeitschrift für Psychologie, 228(1), 25–35. https://doi. org/10.1027/2151-2604/a000395 De Santis, K. K., & Kaplan, I (2020). Assessing the quality of systematic reviews in healthcare using AMSTAR and AMSTAR2: A comparison of scores on both scales. Zeitschrift für Psychologie, 228(1), 36–42. https://doi.org/10.1027/2151-2604/ a000397 Erdfelder, E., & Bošnjak, M. (2016). Hotspots in Psychology: A new format for special issues of the Zeitschrift für Psychologie. Zeitschrift für Psychologie, 224(3), 141–144. https://doi.org/ 10.1027/2151-2604/a000249 Kossmeier, M., Tran, U. S., & Voracek, M. (2020). Power-enhanced funnel plots for meta-analysis: The sunset funnel plot. Zeitschrift für Psychologie, 228(1), 43–49. https://doi.org/ 10.1027/2151-2604/a000392 Tsuji, S., Cristia, A., Frank, M. C., & Bergmann, C. (2020). Addressing publication bias in meta-analysis: Empirical findings from community-augmented meta-analyses of infant language development. Zeitschrift für Psychologie, 228(1), 50–61. https://doi.org/10.1027/2151/a000393
References
Published online March 31, 2020
Bartl, G. J., Blackshaw, E., Crossman, M., Allen, P., & Sandrini, M. (2020). Systematic review and network meta-analysis of anodal tDCS effects on verbal episodic memory: Modelling heterogeneity of stimulation locations. Zeitschrift für Psychologie, 228(1), 3–13. https://doi.org/10.1027/2151-2604/a000396 Bošnjak, M., & Erdfelder, E. (2018). Hotspots in Psychology – 2018 Edition. Zeitschrift für Psychologie, 226(1), 1–2. https://doi.org/ 10.1027/2151-2604/a000323 Bošnjak, M., & Gnambs, T. (2020). Hotspots in Psychology – 2019 Edition. Zeitschrift für Psychologie, 227(1), 1–3. https://doi.org/ 10.1027/2151-2604/a000350
Michael Bošnjak ZPID – Leibniz Institute for Psychology Information Universitätsring 15 54296 Trier Germany mb@leibniz-psychology.org
Zeitschrift für Psychologie (2020), 228(1), 1–2
Ó 2020 Hogrefe Publishing
Review Article
Systematic Review and Network Meta-Analysis of Anodal tDCS Effects on Verbal Episodic Memory Modeling Heterogeneity of Stimulation Locations Gergely Janos Bartl1 , Emily Blackshaw1, Margot Crossman1, Paul Allen1,2, and Marco Sandrini1 1
Department of Psychology, University of Roehampton, London, UK
2
Department of Psychosis Studies, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, UK
Abstract: There is growing interest in the study of transcranial direct current stimulation (tDCS), a non-invasive brain stimulation technique, as an effective intervention to improve memory. In order to evaluate the relative efficacy of tDCS based on the location of anodal electrode sites, we conducted a systematic review examining the effect of stimulation applied during encoding on subsequent verbal episodic memory in healthy adults. We performed a network meta-analysis of 20 studies (23 experiments) with N = 978 participants. Left ventrolateral prefrontal and temporo-parietal sites appeared most likely to enhance episodic memory, although any significant effects were based on findings from single studies only. We did not find evidence for verbal retrieval enhancement of tDCS versus sham stimulation where the effect was based on more than one experimental paper. More frequent replication efforts and stricter reporting standards may improve the quality of evidence and allow more precise estimation of population-level effects of tDCS. Keywords: verbal memory, tDCS, prefrontal cortex, parietal cortex, network meta-analysis
Episodic memory is the long-term memory for specific events or episodes (Tulving, 1983). In addition to medial temporal lobe structures of the brain, it has been shown that lateral prefrontal and temporo-parietal cortices (PFC and TPC, respectively) also contribute to episodic memory function (Spaniol et al., 2009; Dickerson & Eichenbaum, 2010; Manenti, Cotelli, Robertson, & Miniussi, 2012; Szczepanski & Knight, 2014; Rugg & King, 2018). This type of long-term memory declines with age (Ronnlund, Nyberg, Backman, & Nilsson, 2005), a process accelerated in pathological conditions such as amnestic mild cognitive impairment (aMCI) and Alzheimer’s disease (AD). Over the last decade, there has been growing interest in the use of non-invasive brain stimulation techniques as a tool to enhance memory (Sandrini & Cohen, 2013, 2014), with a view to possible future applications in pathological aging. Among these techniques is transcranial direct current stimulation (tDCS), a safe and well-tolerated neuromodulation approach (Dayan, Censor, Buch, Sandrini, Ó 2020 Hogrefe Publishing
& Cohen, 2013). tDCS is a portable device which uses a constant low-intensity current (1–2 mA) delivered directly to the cortex through the cranium via surface electrode pads, anode and cathode, typically for up to a 30-minute duration (Dayan et al., 2013). Anodal tDCS applied to primary motor cortex is considered to increase cortical excitability, whereas cathodal tDCS to decrease cortical excitability (Nitsche & Paulus, 2000). Potential positive effects of tDCS on long-term memory have been suggested using a variety of electrode locations. Javadi and Walsh (2012) reported improved performance in a verbal recognition task following stimulation with the anode over the dorsolateral prefrontal cortex (DLPFC), as opposed to motor cortex- or sham stimulation. Jones, Gözenman, and Berryhill (2014) observed similar effects following tDCS with the anode over the left, but not right posterior parietal cortex (PPC). In a more recent example, Medvedeva et al. (2018) found that stimulation with the anode over the ventrolateral prefrontal cortex (VLPFC) Zeitschrift für Psychologie (2020), 228(1), 3–13 https://doi.org/10.1027/2151-2604/a000396
4
G. J. Bartl et al., tDCS and Verbal Episodic Memory: A Network Meta-Analysis
during intentional encoding improved delayed memory performance. These examples are indicative of the range of electrode locations which have been proposed as beneficial to enhance episodic memory in these tasks. In their recent work Galli, Vadillo, Sirota, Feurra, and Medvedeva (2018) conducted a systematic review and meta-analysis of tDCS studies targeting long-term episodic memory, addressing the occasionally conflicting results emerging in this field. The authors reported a lack of overall significant effects of tDCS on memory, despite the number of significant results in original studies. In contrast to Galli et al. (2018), our interest specifically concerned the application of tDCS in boosting learning, with enhancement of long-term verbal episodic memory. We find this question particularly relevant in order to evaluate the potential of clinical application of tDCS in pathological aging. Research synthesis efforts in the field, while beneficial, encounter the problem of how to model the wide variety of electrode locations within the constraints of traditional pairwise meta-analyses. A number of solutions have been attempted, for example, pooling effect sizes independently (Horvath, Forte, & Carter, 2015), grouping very different electrode placements as similar (Hsu, Ku, Zanto, & Gazzaley, 2015), or including electrode location as a moderator variable (Galli et al., 2018). These solutions may lead to the loss of a direct statistical comparison of treatments of interest, or statistical power when evaluating the relative effectiveness of electrode configurations. In our current project, we took a new approach and conducted a Network Meta-Analysis (NMA; see e.g., Lumley, 2002; Lu & Ades, 2006) to address this research question. NMA has been used increasingly as a research synthesis method over the last decade, predominantly in the field of clinical trials (see Cipriani et al., 2018; Riley et al., 2017), to make use of a combination of available direct and indirect comparisons of interventions. For example, there may be studies comparing stimulation type A with B, and B with C. In this case, pooling effect sizes results in two direct comparisons with their respective variance: the mean difference between A and B (MAB, varAB), B and C (MBC, varBC). Although there are no studies testing A against C directly, their relative difference can be estimated as an indirect comparison: MAC = MAB + MBC, while the variance of such an estimate would be varAC = varAB + varAC. Simply put, the difference between the effects of two stimulation types (e.g., different electrode locations) can be estimated by how they perform against a common comparator, although the certainty of these indirect estimates would be lower than that of a result of a pairwise, direct comparison. In the presence of common comparators, NMA enables researchers to evaluate relative efficacy of interventions in a single analysis based on a network combining direct and
Zeitschrift für Psychologie (2020), 228(1), 3–13
indirect evidence, providing additional information compared to pairwise meta-analysis methods (see e.g., Higgins & Welton, 2015 or Riley et al., 2017). In an example from the field of tDCS, Elsner, Kwakkel, Kugler, and Mehrholz (2017) carried out NMA in order to evaluate the relative efficacy of tDCS in stroke recovery. They reported that cathodal, but not anodal or dual stimulation, improved activities of daily living capacity relative to sham. The aim of our paper was to synthesize evidence regarding differences in episodic memory enhancement as a function of tDCS stimulation site. This may enable researchers to evaluate the potential benefits of using particular electrode locations, for example, in the field of neuromodulation interventions for memory decline. We reviewed evidence regarding the effects of tDCS on verbal episodic memory in order to evaluate potential behavioral effects beyond the stimulation session itself. For learning and memory, synchronizing the learning task with stimulationinduced plasticity may be critical. In this case, tDCS may enhance task-related activity (Martin, Liu, Alonzo, Green, & Loo, 2014; Shin, Foerster, & Nitsche, 2015). Therefore, our focus was on studies applying tDCS during encoding. Thus, two main questions were addressed. If a single session of tDCS is applied during the learning/encoding phase, (1) which anode placement location is the most effective in enhancing delayed verbal memory retrieval, and (2) what degree of enhancement is likely to occur in setups with the most effective electrode locations? We also intended to explore the feasibility of using NMA to synthesize evidence from non-invasive brain stimulation experiments, particularly in order to compare the relative efficacy of stimulation sites.
Methods Literature Search We followed the guidelines and checklist regarding conducting and reporting systematic reviews and network meta-analysis in the extension of the PRISMA Statement (Hutton et al., 2015). Our aim was to include studies of adult human participants based on the following primary criteria: (1) randomized controlled trials or within-subjects designs (2) applying a single session of tDCS during encoding (3) with a subsequent verbal retrieval task. We carried out a systematic review of English language publications up until 31st March 2019, using databases MEDLINE, PsychINFO, PsychARTICLES, and OpenDissertations via the search platform provided by EBSCO Information Services. The search terms used were “tdcs OR transcranial direct current stimulation OR transcranial electric* OR tes OR
Ó 2020 Hogrefe Publishing
G. J. Bartl et al., tDCS and Verbal Episodic Memory: A Network Meta-Analysis
non-invasive brain stimulation”, combined with “memory OR recall OR recognition OR retriev*”, and “verbal OR word OR declar* OR episod* OR associat*”. We evaluated the sensitivity of our search strategy by comparing results to a predefined list of potentially eligible studies already known to us. We included all randomized studies which adopted a single- or double-blind design and contained at least two different tDCS stimulation conditions. Titles and abstracts of items identified during the search were screened independently by two authors (Gergely Janos Bartl and Emily Blackshaw) for inclusion based on criteria 1–3. Potential clashes were resolved by agreement and consulting a third reviewer (Marco Sandrini). Full-text versions of identified hits were read and assessed in-depth for eligibility criteria by Gergely Janos Bartl and Marco Sandrini. Clashes were resolved by further review until agreement was reached. Corresponding authors of identified eligible articles were approached for full-text versions or data where this was not openly accessible to the reviewers.
Outcome Variable We were interested in the effect of tDCS applied during encoding on later episodic memory retrieval tested using both recall and recognition tasks, and the modulation of this effect by anodal electrode placement. We included both traditional tDCS studies using one anode and one cathode, as well as high definition (HD) tDCS montages where one polarity is represented by multiple electrodes (Datta et al., 2009). For instance, in the case of one anode and multiple cathodes, it may be plausible to assume that current density is more focalized in regions proximal to the anodal stimulation site (Villamar et al., 2013). Our key criteria regarding inclusion were that studies conducted an encoding session of material having at least one verbal element (words, pseudo-words, sentences, names) with concurrent application of tDCS, and employed a subsequent, delayed retrieval task. Studies testing performance from concurrent and immediate retrieval tasks were not included in our analysis, given our focus on delayed episodic memory effects. Data eligible for inclusion were screened and identified by two authors (GB and MS). Data were extracted from the original publications where possible. This included values reported in the manuscript text, tables, and figures. Authors were contacted when relevant data were not available. In case of studies with multiple memory measures, a pooled effect size was calculated. For example, (a) Day 1 and Day 2 scores, (b) performance of different participant groups (e.g., young and old adults), or (c) test results from recall and recognition tasks were combined into a summary measure. Corrections were made as necessary in the case of within-subject studies, using the correlation value r = .5. This was selected as a Ó 2020 Hogrefe Publishing
5
plausible value based on an earlier similar review (Galli et al., 2018) and data available from our own lab in similar experiments. Hedges’ g was calculated as a standardized mean difference measure (SMD) for all comparisons using the Metacont function of the Meta package in R (Schwarzer, 2007).
NMA Method We performed a network meta-analysis using the Netmeta package (Rücker & Schwarzer, 2015) in the statistical software R. Connectedness of the network was evaluated using a network diagram, and by visual exploration of experimental conditions listed in the summary data file. We evaluated the suitability of both fixed- and random-effects models during our analysis. Consistency of direct and indirect effects was evaluated using a node-splitting procedure (Dias, Welton, Caldwell, & Ades, 2010). Potential effect modifier parameters (e.g., task type, stimulation intensity) were extracted for all included studies and summarized in Table 1. Following the selection of the final model, we ranked tDCS based on anodal electrode location by efficacy against a common comparator using P-scores (Rücker & Schwarzer, 2015). P-scores measure the extent of certainty that a treatment performs better than another treatment, based on network meta-analysis point estimates and standard error. They may take values from 0 to 1, with higher values representing better success (Rücker & Schwarzer, 2015). The contrast of each electrode location placement versus sham stimulation was evaluated using 95% confidence intervals (CI) and visualized using a forest plot. Moderation analyses were run using the Metafor package (Viechtbauer, 2010).
Evaluation of Bias We rated studies for potential bias using the Cochrane Collaborations Risk of Bias Tool (Higgins et al., 2011). Gergely Janos Bartl, Emily Blackshaw, and Marco Sandrini accessed full-text versions of included studies and rated items independently. Differences between bias ratings were resolved with agreement.
Results Search Results A total of 612 records were identified in the initial database search (588) and additional resources, for example, items indexed in other reviews (24). The database search itself identified 20 out of 23 previously known potentially eligible Zeitschrift für Psychologie (2020), 228(1), 3–13
6
G. J. Bartl et al., tDCS and Verbal Episodic Memory: A Network Meta-Analysis
studies, demonstrating a sensitivity of 86.95%. Three hundred ninety-seven records remained after the removal of duplicates. Three hundred seventy items were excluded during abstract screening due to not meeting eligibility criteria (1–3) set out above. This left 27 records for full-text review. Two studies were removed due to describing overlapping experiments (both were PhD theses with experimental data also published in peer-reviewed articles). Four studies were removed as their design did not meet the inclusion criteria (tDCS applied during learning, followed by verbal episodic retrieval test) following full-text review. One study was not included in the analysis, as the relevant data were not available. This resulted in the inclusion of 20 separate publications (describing 23 experiments) in our final quantitative and qualitative synthesis. The total sample consisted of N = 978 participants. Sixteen studies used healthy young adult samples, 3 studies tested elderly groups, and 1 study included both a young and elderly group. The age of the total sample was M = 29.34, SD = 7.25, with a 58.76–41.24% female-to-male ratio. The majority of studies reported encoding tasks using explicit, intentional learning instructions. The timing of the delayed retrieval task ranged from a few minutes to one week. Testing of previously learned material was conducted using free or cued recall, and recognition tasks, and in the case of some studies, a combination of the above. The selection flowchart is displayed in Figure 1.
Data Analysis We coded tDCS used in studies as 16 different categories based on the location of the placement of the anodal electrode, according to the International 10–20 EEG System for electrode placement (Klem, Lüders, Jasper, & Elger, 1999). Key parameters including electrode placement, stimulation intensity, retrieval task, and sample characteristics of the included studies are reported in Table 1. Figure 2 displays the network of included comparisons in our analysis. All apart from one included study reported using a sham stimulation condition as a comparator, which we chose as the primary comparison when evaluating relative efficacy. There was evidence of significant total heterogeneity, Q(17) = 32.44, p = .013, τ2 = 0.0935; I2 = 47.6%. Therefore, comparisons and effect sizes are reported below based on our random-effects NMA. There were no inconsistencies indicated between direct and indirect effect size estimates. Efficacy of anodal electrode placement locations versus sham stimulation is displayed in Figure 3. Transcranial direct current stimulation (tDCS) with the anode over F7 (left ventrolateral PFC) had a high certainty of being more successful in inducing memory enhancement than other electrode locations (P-score = .9553). The effect of this electrode configuration, based on data from three Zeitschrift für Psychologie (2020), 228(1), 3–13
588 items identified through database search
24 items identified through other sources
612 items identified in total
215 duplicate hits removed
397 items screened for eligibility
370 items excluded (based on title and abstract)
27 items assessed indepth for eligibility
7 items not included:
ineligible design: 4 describing experiments already included: 2 relevant data not available: 1
20 items included in synthesis (23 experiments) Figure 1. Flowchart of study screening and selection process.
direct comparisons (total N = 71) reported in Medvedeva et al. (2018), was significantly positive: g = 1.21, 95% CI [0.63, 1.79]. One other contrast showed significant effects of tDCS with the anode over CP5-TP7 (left inferior parietal lobule/temporal-parietal region), P-score = .9367, g = 1.22, 95% CI [0.23, 2.21], although the sham contrast is based on a single direct comparison (N = 30) only (Rivera-Urbina, Mendez Joya, Nitsche, & Molero-Chamizo, 2019). tDCS with the anode over F3 (left dorsolateral PFC), a frequently used stimulation site in the field with 10 head-to-head comparisons (total N = 352) had a small, statistically nonsignificant effect versus sham stimulation, g = 0.16, 95% CI [ 0.1, 0.43], P-score = .5792. We evaluated the effect of two potential moderators: blinding, and current density (at the anodal site) on our pooled outcome variable using pairwise meta-analysis on data available from sham-controlled studies. Blinding was selected as a moderator variable in order to evaluate whether effect sizes extracted from the relatively high proportion of single-blind studies (ca. 50%) differ from those with double-blind designs. Given the suggestion (Vöröslakos et al., 2018) that the cortical penetration of tDCS is relatively weak, the possibility of higher current densities leading to greater enhancement was also examined as a potential moderator. Our random-effects model suggested evidence for significant heterogeneity, Q(31) = 72.23, p < .001., τ2 = 0.1421, 95% CI [0.018, 0.266]; I2 = 58.55%. Including the two moderators resulted in a nonsignificant decrease in heterogeneity Q(28) = 58.49, Ó 2020 Hogrefe Publishing
Ó 2020 Hogrefe Publishing
1
12. Leshikar et al. (2017)
4
1
11. Leach, McCurdy, Trumbo, Matzen, and Leshikar (2018)
15. Medvedeva et al. (2018)
1
10. Leach, McCurdy, Trumbo, Matzen, and Leshikar (2016)
3
3
9. Jones et al. (2014)
15. Medvedeva et al. (2018)
1
9. Jones et al. (2014)
1
1
8. Javadi and Walsh (2012)
15. Medvedeva et al. (2018)
96
1
7. Jacobson, Goren, Lavidor, and Levy (2012)***
1
14
1
6. Habich et al. (2017)
14. Matzen, Trumbo, Leach, and Leshikar (2015)
20
1
5. Gaynor and Chua (2017)
1
69
1
4. Diez, Gomez-Ariza, DiezAlamo, Alonso, and Fernandez (2017)
13. Manuel and Schnider (2016)
15
1
3. de Lara et al. (2017)*
19.8
Mean age
22
34
34
24
26
42
20
32
12
44
72
73
23
24
22.3
23.5
21.6
44
71.7
21.1
23
22.5
12
24.8
20.8
21.8
24.8
150 21.2
1
2. Brunyé, Smith, Horner, and Thomas (2018)
30
N
1
Experiment
1. Boggio et al. (2009)
Study ID
Design
Applied current
1 mA
1.5 mA
1.5 mA
1 mA
1 mA
Exp. conditions
10 min 20 min 15 min 15 min 25 min
25 cm2 ca. 12 cm2 and 30 cm2 35 cm2 35 cm2 35 cm2
2 mA
Between 2 mA
Within
Between 2 mA, sham
Between 2 mA, 0.1 mA
1 mA, sham
24 min
30 min ca. 9 min 15 min ca. 9 min
35 cm2
11 cm2 35 cm2 35 cm2 35 cm2
Face-name associative learning, delayed recognition task
Word learning, delayed recognition task (1 hr) Word learning task, delayed free recall (20 min) word learning task, delayed free recall (20 min) Face-name associative learning, followed by recall and pair recognition task
F7 to contralateral deltoid (active vs. sham) F7 to contralateral deltoid (active vs. sham) F7 to contralateral deltoid (active vs. sham)
(Continued on next page)
Incidental word learning, delayed recognition task (1 hr) Word learning, recognition task (24 hr)
Word learning, recognition task (24 hr)
Face name pairs associative learning (implicit), cued recall and recognition task (same day and next day) F3 to Fp2, F4 to Fp1, sham Continuous learning/recognition task versus P3 to Fp2, P4 to Fp1, (of repeated items) with pseudosham words, delayed recognition task (30 min) F9 to contralateral arm face-name pairs associative learning, (active vs. sham) recognition task, delayed name recall
F3 to contralateral upper arm (active vs. sham)
25 min
Between 1.5 mA, 0.1 mA 11 cm2
F3-Fp2 (active vs. sham), Fp2-F3, Cz-Fp2 P3 to contralateral cheek (active vs. sham) P4 to contralateral cheek (active vs. sham) F9 to contralateral upper arm (active vs. sham)
Word learning, delayed recognition task (20 min)
20 min
35 cm2 P3 to P6, P6 to P3
F3 to Fp2, CP3 to CP4, sham Word pairs associative learning, delayed pair-recognition task (24 hr) F3 to Fp2 (active vs. sham) Word learning, delayed free recall
20 min
Word pair associative learning, delayed cued recall (10 min, 24 hr) Word learning, delayed old-new recognition task (2 min)
35 cm2
Duration of F3-contralateral upper arm learning task (active vs. sham)
Mixed
Learning and retrieval task Word learning, followed by recognition task Word learning, delayed free recall (2 days)
AF3 to 4 cathodes (active vs. sham) FT9 to contralateral shoulder (active vs. sham)
F3-Fp2, Fp2-F3, sham
T3 to T4
3 cm2 anode, 20 min 4 3 cm2 cathode 35 cm2 20 min
Duration of learning
10 min
35 cm2 Not specified
Stimulation duration
Electrode surface or diameter
Between 1.5 mA, 0.1 mA 11 cm2
Between 2 mA, 0.1 mA
Within
Within
Within
Within
Between 1 mA
Between 2 mA
Between 2 mA
Within
Between 1.5 mA
Between 2 mA
Table 1. Key parameters of studies included in the analyses
G. J. Bartl et al., tDCS and Verbal Episodic Memory: A Network Meta-Analysis 7
Zeitschrift für Psychologie (2020), 228(1), 3–13
Zeitschrift für Psychologie (2020), 228(1), 3–13
1
1
1
1
1
Experiment 9 cm –35 cm
28 68.9
Between 1.5 mA, sham 35 cm2 anode, 25 cm2 cathode
45 21.9
Between 1.5 mA
20 min
Stimulation duration
35 cm2
d = 2.5 cm centre, 7.5–9.8 cm ring
15 min
15 min
20 min
d = 1 cm electrodes 20 min (1 anode, 4 cathodes)
Between 1 mA
2 mA
50 23.2
2
2
Electrode surface or diameter
Within
Applied current
16 21.8
Design Between 0.8 mA
Mean age
48 24.5
N
Word learning, delayed free- and semantic strategy recall (20 min)
CP5& TP7 to Fp2 (use 1.5 mA and sham only for analysis) F3 to Fp2 (active vs. sham)
Word learning, delayed free recall (2 days and 30 days)
F3 to AF3, F5, FC, FC3 (active) versus Cp5 to C5, TP7, Cp3, P5 (active) versus P9 to Fp1, Fp2, FC4 (active) versus F4 to Cp4, Cp6 (sham) CP5 centre to ring electrode Novel item name learning, cued recall (active vs. sham) (last test, not overlapping with stimulation included)
Implicit learning of action phrases (read or enact), followed by delayed recognition task (45 min and 1 week) Word learning, delayed free recall (30 min)
Learning and retrieval task
F3 to Fp2 (active), CP3 to Fp2 (active vs. sham)
Exp. conditions
Note. *Stimulation during encoding included only; **High Definition tDCS setup; ***Control arm of study not included due to potential lack of randomized allocation across conditions.
20. Sandrini et al. (2016)
19. Rivera-Urbina et al. (2019)
18. Perceval, Martin, Copland, Laine, and Meinzer (2017)**
17. Nikolin, Loo, Bai, Dokos, and Martin (2015)**
16. Meier and Sauter (2018)
Study ID
Table 1. (Continued)
8 G. J. Bartl et al., tDCS and Verbal Episodic Memory: A Network Meta-Analysis
Figure 2. Network diagram of the final random-effects NMA model. Circles refer to the different placement locations of the anodal electrode according to the International 10–20 EEG System for electrode placement. Area of circles is proportional to included sample size per stimulation condition (range: 16–364). Width of lines is proportional to number of studies with direct comparisons (range: 1–10).
Figure 3. Forest plot of NMA results. Efficacy of each active (anodal placement) versus sham condition is listed (Hedges’ g).
p < .001, τ2 = 0.111, 95% CI [0.00, 0.225]; I2 = 51.5%, with neither blinding (g = 0.251, 95% CI [ 0.113, 0.615]) nor current density (g = 0.058, 95% CI [ 0.262, 0.146]) having a significant effect. Details of bias evaluation are summarized in Table 2. The majority of studies (16 out of 20) described using randomization during allocation to stimulation conditions, although this process was not typically described in detail. All publications reported some form of blinding process,
Ó 2020 Hogrefe Publishing
G. J. Bartl et al., tDCS and Verbal Episodic Memory: A Network Meta-Analysis
9
Table 2. Evaluation of bias according to the Cochrane Risk of Bias Tool
Study
Selection bias
Reporting bias
Other bias
Performance Detection bias bias
Random sequence Allocation generation concealment
Selective reporting
Other sources of bias
Blinding
Blinding
Attrition bias Incomplete outcome data
Boggio et al. (2009)
L
U
L
L
L
L
L
Brunyé et al. (2018)
L
U
L
L
H
H
L
de Lara et al. (2017)
L
L
L
L
L
L
L
Diez et al. (2017)
L
U
L
L
H
H
L
Gaynor and Chua (2017)
L
U
L
L
H
H
L
Habich et al. (2017)
L
U
L
L
L
L
L
Jacobson et al. (2012)
H
L
L
L
L
L
L
Javadi and Walsh (2012)
L
L
L
L
H
H
L
Jones et al. (2014) Experiment 1
U
L
L
L
H
H
L
Jones et al. (2014) Experiment 3
U
L
L
L
H
H
L
Leach et al. (2016)
L
U
L
L
L
L
L
Leach et al. (2018)
U
U
L
L
L
L
L
Leshikar et al. (2017)
L
L
L
L
L
L
L
Manuel and Schnider (2016)
L
U
L
L
H
H
L
Matzen et al. (2015)
U
U
L
L
L
L
L
Medvedeva et al. (2018) Experiment 1
L
U
L
L
H
H
L
Medvedeva et al. (2018) Experiment 3
L
L
L
L
H
H
L
Medvedeva et al. (2018) Experiment 4
L
U
L
L
H
H
L
Meier and Sauter (2018)
L
U
L
L
H
H
L
Nikolin et al. (2015)
L
L
L
L
H
H
L
Perceval et al. (2017)
L
L
L
L
L
L
L
Rivera-Urbina et al. (2019)
L
U
L
L
L
L
L
Sandrini et al. (2016)
L
L
L
L
L
L
L
Note. L = low; U = unclear; H = high risk of bias.
11 of them double-blind. Blinding success was rarely evaluated. Drop-out rates from experiments were generally low.
Discussion In our review and analyses, we summarized data on verbal episodic memory performance following tDCS applied during encoding from 23 experiments, with a total sample size of N = 978. We deemed our search successful in identifying relevant studies based on previously set criteria. Our random-effects NMA results indicated that tDCS applied with the anode over frontal (F7) and temporoparietal (CP5/TP7) regions may have a higher probability in enhancing delayed retrieval when applied during encoding. These two significant effects stem from single studies or publications, and in our analyses these could be compared to the primary reference (i.e., sham condition) based only on direct evidence. The identified sites showing a potential in enhancing learning are proximal to some a priori regions of interest, for example, left ventral TPC and left ventrolateral PFC, areas known to be involved in episodic memory processes (Manenti, Cotelli, Robertson, Ó 2020 Hogrefe Publishing
& Miniussi, 2012; Spaniol et al., 2009). However, there is considerable uncertainty of the mean effect size estimates in these cases. Further replications across different experimental designs and research groups would allow for better estimation of these effects, while also increasing their generalizability. Importantly, we did not find any statistically significant episodic memory enhancing effects where direct and indirect evidence from multiple studies were available. For example, the frequently employed stimulation with the anode over F3 (i.e., DLPFC) ranked relatively high against other electrode locations, but we did not find evidence for significant benefit over sham. Thus, earlier findings reporting long-term memory enhancement following stimulation during encoding, for example, with the anode over DLPFC (Javadi & Walsh, 2012), or PPC (Jones et al., 2014), are not corroborated by our network meta-analysis estimates. One possible reason may be differences in exact testing procedures: it has been suggested that tDCS may have differential effects during intentional versus incidental learning (Medvedeva et al., 2018), or enhance recall but not recognition (Leshikar et al., 2017). Our choice of using pooled effect sizes where multiple memory tests were available, Zeitschrift für Psychologie (2020), 228(1), 3–13
10
G. J. Bartl et al., tDCS and Verbal Episodic Memory: A Network Meta-Analysis
while reducing potential bias due to selective inclusion of outcomes, meant that we were not able to address these suggestions for potential effect moderators. However, one plausible explanation behind the failure to find significant effects in these cases remains the possibility of a lack of robust effects of tDCS on cognition in the long-term – an important aspect to consider when evaluating the need for clinical trials in the field. Stimulation intensity, anode-cathode electrode locations, and the employed episodic memory tasks varied across studies, resulting in a large heterogeneity in included study designs. Direct replication attempts were not often reported. Quantifying true effects of tDCS on verbal episodic memory with more certainty would benefit from such efforts in the future. We also note a lack of pre-registered studies in our analyzed sample. This particular approach (pre-registration) has been recommended as one of the tools that could improve replicability in science in general (Munafò et al., 2017) and in our cognitive neuroscience subfield in particular (Szucs & Ioannidis, 2017). All included studies adopted some form of blinding procedure, with an approximately 50% split between single- and double-blind designs. Differences in blinding method did not appear to affect reported scores according to our moderation analyses. However, exact blinding procedures were not always described in sufficient detail. Attrition rates, where described, were generally low, corresponding with the conclusion of reviews dedicated to evaluating the safety and tolerance of tDCS methods (Bikson et al., 2016; Woods et al., 2016). Allocation concealment was not addressed in detail in the majority of studies with between-subject designs. In future studies exploring potential therapeutic benefits of non-invasive brain stimulation methods, the adoption of commonly used clinical trial reporting guidelines, for example, principles of the CONSORT Statement (Schulz, Altman, & Moher, 2010) could be considered in order to control for potential sources of bias. As described earlier, the synthesis of direct and indirect effects is not new in the field of tDCS Elsner et al. (2017). In our case, the application of NMA is a novel approach to modeling heterogeneity of effects on cognitive performance due to differences in electrode placements and targeted areas. Some previous reviews of non-invasive stimulation studies reached different conclusions, some argued for evidence of enhancement of cognitive functions (Dedoncker, Brunoni, Baeken, & Vanderhasselt, 2016; Hsu et al., 2015; Summers, Kang, & Cauraugh, 2016), while some found no evidence (Horvath, Forte, & Carter, 2015; Tremblay et al., 2014). Given the strong assumption in the field that stimulation of different sites should lead to differential neural and behavioral effects, modeling heterogeneity due to differences in stimulation sites is an important aspect of any research synthesis effort. Accounting Zeitschrift für Psychologie (2020), 228(1), 3–13
for this variability with the right statistical synthesis method may reduce the contradiction between reviews on similar topics. We argue that NMA may be one of the most promising currently available methods to account for these differences, retaining the benefits of synthesizing evidence in a single analysis, and making use of a wider (direct and indirect) evidence base. In terms of limitations of our review, we recognize that some of the main findings (e.g., most efficient stimulation sites in memory enhancement) were tested by a single study or group only and would therefore benefit from replication. A larger number of studies addressing each head-to head comparison may also enable a meaningful evaluation of publication bias, for example, using comparison-adjusted funnel plots as in Chaimani and Salanti (2012). Our primary interest in conducting this review was in exploring potential beneficial effects of tDCS applied during encoding in pathological aging. However, the final sample mainly consisted of young, healthy adults, reducing the generalizability of our results to key populations of interest. In addition, while the placement of anodal electrode – the basis of classification in our analysis – is often used in conjunction with describing targeted cortical areas in the literature, there may be significant differences in intracranial current distribution based on cathodal placement (Woods et al., 2016). Incorporating the distribution and the strength of tDCS-induced electric fields for given montages in an analysis, instead of electrode locations, may be a useful alternative in the future. However, such estimates derived using currently available modeling software have not yet been extensively tested in vivo (see e.g., Jog et al., 2016). Finally, while our analysis looked at memory performance enhanced by tDCS, a topic with potential therapeutic relevance, clinical outcomes were not the key focus of this review. Previous reviews provide some information on this topic (e.g., Hsu et al., 2015; Summers, Kang, & Cauraugh, 2016), with at least one other related systematic review ongoing (Zhang, Liu, Li, Zhang, & Qu, 2018) in the field. In summary, we adopted an NMA approach to conduct a novel synthesis of direct and indirect evidence of tDCSeffects on verbal memory retrieval when applied during encoding. Focussing on behavioral results, our analysis addressed the question, currently unanswered using neuroimaging means, of whether there is a differential behavioral effect of tDCS when applied at different stimulation sites. Our current results do not suggest a conclusive modulation of memory based on the locations of the anode electrode, and further replications of studies reporting potentially effective stimulation locations would be necessary to allow more precise evaluation of these findings. At the same time, we suggest the NMA framework is a useful approach to comparing the efficacy of non-invasive brain stimulation techniques (e.g., tDCS, transcranial alternating Ó 2020 Hogrefe Publishing
G. J. Bartl et al., tDCS and Verbal Episodic Memory: A Network Meta-Analysis
current stimulation, repetitive Transcranial Magnetic Stimulation; Dayan et al., 2013) in a variety of cognitive domains while accounting for heterogeneity in the location of targeted cortical areas.
References Bikson, M., Grossman, P., Thomas, C., Zannou, A. L., Jiang, J., Adnan, T., & Brunoni, A. R. (2016). Safety of transcranial direct current stimulation: Evidence based update 2016. Brain Stimulation, 9, 641–661. https://doi.org/10.1016/j.brs.2016.06.004 Boggio, P. S., Fregni, F., Valasek, C., Ellwood, S., Chi, R., Gallate, J., . . . Snyder, A. (2009). Temporal lobe cortical electrical stimulation during the encoding and retrieval phase reduces false memories. PLoS One, 4, e4959. https://doi.org/10.1371/journal. pone.0004959 Brunyé, T. T., Smith, A. M., Horner, C. B., & Thomas, A. K. (2018). Verbal long-term memory is enhanced by retrieval practice but impaired by prefrontal direct current stimulation. Brain and Cognition, 128, 80–88. https://doi.org/10.1016/j.bandc.2018. 09.008 Cipriani, A., Furukawa, T. A., Salanti, G., Chaimani, A., Atkinson, L. Z., Ogawa, Y., . . . Egger, M. (2018). Comparative efficacy and acceptability of 21 antidepressant drugs for the acute treatment of adults with major depressive disorder: A systematic review and network meta-analysis. Lancet, 391, 1357–1366. https://doi.org/10.1016/S0140-6736(17)32802-7 Chaimani, A., & Salanti, G. (2012). Using network meta-analysis to evaluate the existence of small-study effects in a network of interventions. Research Synthesis Methods, 3, 161–176. https://doi.org/10.1002/jrsm.57 Datta, A., Bansal, V., Diaz, J., Patel, J., Reato, D., & Bikson, M. (2009). Gyri-precise head model of transcranial direct current stimulation: Improved spatial focality using a ring electrode versus conventional rectangular pad. Brain Stimulation, 2, 201– 207. https://doi.org/10.1016/j.brs.2009.03.005 Dayan, E., Censor, N., Buch, E. R., Sandrini, M., & Cohen, L. G. (2013). Non-invasive brain stimulation: From physiological mechanisms to network dynamics and back. Nature Neuroscience, 16, 838–844. https://doi.org/10.1038/nn.3422 Dedoncker, J., Brunoni, A. R., Baeken, C., & Vanderhasselt, M. A. (2016). A systematic review and meta-analysis of the effects of transcranial direct current stimulation (tDCS) over the dorsolateral prefrontal cortex in healthy and neuropsychiatric samples: Influence of stimulation parameters. Brain Stimulation, 9, 501–517. https://doi.org/10.1016/j.brs.2016.04.006 Dias, S., Welton, N. J., Caldwell, D. M., & Ades, A. E. (2010). Checking consistency in mixed treatment comparison metaanalysis. Statistics in Medicine, 29, 932–944. https://doi.org/ 10.1002/sim.3767 Diez, E., Gomez-Ariza, C. J., Diez-Alamo, A. M., Alonso, M. A., & Fernandez, A. (2017). The processing of semantic relatedness in the brain: Evidence from associative and categorical false recognition effects following transcranial direct current stimulation of the left anterior temporal lobe. Cortex, 93, 133–145. https://doi.org/10.1016/j.cortex.2017.05.004 Dickerson, B. C., & Eichenbaum, H. (2010). The episodic memory system: Neurocircuitry and disorders. Neuropsychopharmacology, 35, 86–104. https://doi.org/10.1038/npp.2009.126 Elsner, B., Kwakkel, G., Kugler, J., & Mehrholz, J. (2017). Transcranial direct current stimulation (tDCS) for improving capacity in activities and arm function after stroke: A network
Ó 2020 Hogrefe Publishing
11
meta-analysis of randomised controlled trials. Journal of Neuroengineering and Rehabilitation, 14, 95. https://doi.org/ 10.1186/s12984-017-0301-7 Galli, G., Vadillo, M. A., Sirota, M., Feurra, M., & Medvedeva, A. (2018). A systematic review and meta-analysis of the effects of transcranial direct current stimulation (tDCS) on episodic memory. Brain Stimulation, 12, 231–241. https://doi.org/ 10.1016/j.brs.2018.11.008 Gaynor, A. M., & Chua, E. F. (2017). tDCS over the prefrontal cortex alters objective but not subjective encoding. Cognitive Neuroscience, 8, 156–161. https://doi.org/10.1080/17588928.2016. 1213713 Habich, A., Klöppel, S., Abdulkadir, A., Scheller, E., Nissen, C., & Peter, J. (2017). Anodal tDCS enhances verbal episodic memory in initially low performers. Frontiers in Human Neuroscience, 11, 542. https://doi.org/10.3389/fnhum.2017.00542 Higgins, J. P., Altman, D. G., Gøtzsche, P. C., Jüni, P., Moher, D., Oxman, A. D., . . . Sterne, J. A. (2011). The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials. British Medical Journal, 343, d5928. https://doi.org/10.1136/bmj.d5928 Higgins, J. P., & Welton, N. J. (2015). Network meta-analysis: A norm for comparative effectiveness? The Lancet, 386, 628–630. https://doi.org/10.1016/S0140-6736(15)61478-7 Horvath, J. C., Forte, J. D., & Carter, O. (2015). Quantitative review finds no evidence of cognitive effects in healthy populations from single-session transcranial direct current stimulation (tDCS). Brain Stimulation, 8, 535–550. https://doi.org/10.1016/ j.brs.2015.01.400 Hutton, B., Salanti, G., Caldwell, D. M., Chaimani, A., Schmid, C. H., Cameron, C., . . . Mulrow, C. (2015). The PRISMA extension statement for reporting of systematic reviews incorporating network meta-analyses of health care interventions: Checklist and explanations. Annals of Internal Medicine, 162, 777–784. https://doi.org/10.7326/M14-2385 Hsu, W. Y., Ku, Y., Zanto, T. P., & Gazzaley, A. (2015). Effects of noninvasive brain stimulation on cognitive function in healthy aging and Alzheimer’s disease: A systematic review and metaanalysis. Neurobiology of Aging, 36, 2348–2359. https://doi.org/ 10.1016/j.neurobiolaging.2015.04.016 Jacobson, L., Goren, N., Lavidor, M., & Levy, D. A. (2012). Oppositional transcranial direct current stimulation (tDCS) of parietal substrates of attention during encoding modulates episodic memory. Brain Research, 1439, 66–72. https://doi.org/ 10.1016/j.brainres.2011.12.036 Javadi, A. H., & Walsh, V. (2012). Transcranial direct current stimulation (tDCS) of the left dorsolateral prefrontal cortex modulates declarative memory. Brain Stimulation, 5, 231–241. https://doi.org/10.1016/j.brs.2011.06.007 Jog, M. V., Smith, R. X., Jann, K., Dunn, W., Lafon, B., Truong, D., & Wang, D. J. (2016). In-vivo imaging of magnetic fields induced by transcranial direct current stimulation (tDCS) in human brain using MRI. Scientific Reports, 6, 34385. https://doi.org/ 10.1038/srep34385 Jones, K. T., Gözenman, F., & Berryhill, M. E. (2014). Enhanced long-term memory encoding after parietal neurostimulation. Experimental Brain Research, 232, 4043–4054. https://doi.org/ 10.1007/s00221-014-4090-y Klem, G. H., Lüders, H. O., Jasper, H. H., & Elger, C. (1999). The ten-twenty electrode system of the International Federation. Electroencephalography and Clinical Neurophysiology, 52, 3–6. Lara, G. A. D., Knechtges, P. N., Paulus, W., & Antal, A. (2017). Anodal tDCS over the left DLPFC did not affect the encoding and retrieval of verbal declarative information. Frontiers in Neuroscience, 11, 452. https://doi.org/10.3389/fnins.2017.00452
Zeitschrift für Psychologie (2020), 228(1), 3–13
12
G. J. Bartl et al., tDCS and Verbal Episodic Memory: A Network Meta-Analysis
Leach, R. C., McCurdy, M. P., Trumbo, M. C., Matzen, L. E., & Leshikar, E. D. (2016). Transcranial stimulation over the left inferior frontal gyrus increases false alarms in an associative memory task in older adults. Healthy Aging Research, 5, 1–6. https://doi.org/10.1097/01.HXR.0000511878.91386.f8 Leach, R. C., McCurdy, M. P., Trumbo, M. C., Matzen, L. E., & Leshikar, E. D. (2018). Differential age effects of transcranial direct current stimulation on associative memory. The Journals of Gerontology: Series B, 74(7), 1163–1173. https://doi.org/ 10.1093/geronb/gby003 Leshikar, E. D., Leach, R. C., McCurdy, M. P., Trumbo, M. C., Sklenar, A. M., Frankenstein, A. N., & Matzen, L. E. (2017). Transcranial direct current stimulation of dorsolateral prefrontal cortex during encoding improves recall but not recognition memory. Neuropsychologia, 106, 390–397. https://doi. org/10.1016/j.neuropsychologia.2017.10.022 Lu, G., & Ades, A. E. (2006). Assessing evidence inconsistency in mixed treatment comparisons. Journal of the American Statistical Association, 101, 447–459. https://doi.org/10.1198/ 016214505000001302 Lumley, T. (2002). Network meta-analysis for indirect treatment comparisons. Statistics in Medicine, 21, 2313–2324. https:// doi.org/10.1002/sim.1201 Manenti, R., Cotelli, M., Robertson, I. H., & Miniussi, C. (2012). Transcranial brain stimulation studies of episodic memory in young adults, elderly adults and individuals with memory dysfunction: A review. Brain Stimulation, 5, 103–109. https:// doi.org/10.1016/j.brs.2012.03.004 Manuel, A. L., & Schnider, A. (2016). Effect of prefrontal and parietal tDCS on learning and recognition of verbal and nonverbal material. Clinical Neurophysiology, 127, 2592–2598. https://doi.org/10.1016/j.clinph.2016.04.015 Martin, D. M., Liu, R., Alonzo, A., Green, M., & Loo, C. K. (2014). Use of transcranial direct current stimulation (tDCS) to enhance cognitive training: Effect of timing of stimulation. Experimental Brain Research, 232, 3345–3351. https://doi.org/10.1007/ s00221-014-4022-x Matzen, L. E., Trumbo, M. C., Leach, R. C., & Leshikar, E. D. (2015). Effects of non-invasive brain stimulation on associative memory. Brain Research, 1624, 286–296. https://doi.org/10.1016/j. brainres.2015.07.036 Medvedeva, A., Materassi, M., Neacsu, V., Beresford-Webb, J., Hussin, A., Khan, N., . . . Galli, G. (2018). Effects of anodal transcranial direct current stimulation over the ventrolateral prefrontal cortex on episodic memory formation and retrieval. Cerebral Cortex, 29, 657–665. https://doi.org/10.1093/cercor/ bhx347 Meier, B., & Sauter, P. (2018). Boosting memory by tDCS to frontal or parietal brain regions? A study of the enactment effect shows no effects for immediate and delayed recognition. Frontiers in Psychology, 9, 867. https://doi.org/10.3389/ fpsyg.2018.00867 Munafò, M. R., Nosek, B. A., Bishop, D. V., Button, K. S., Chambers, C. D., Du Sert, N. P., . . . Ioannidis, J. P. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1, 21. https://doi.org/10.1038/s41562-016-0021 Nitsche, M. A., & Paulus, W. (2000). Excitability changes induced in the human motor cortex by weak transcranial direct current stimulation. The Journal of Physiology, 527, 633–639. https:// doi.org/10.1111/j.1469-7793.2000.t01-1-00633.x Nikolin, S., Loo, C. K., Bai, S., Dokos, S., & Martin, D. M. (2015). Focalised stimulation using high definition transcranial direct current stimulation (HD-tDCS) to investigate declarative verbal learning and memory functioning. NeuroImage, 117, 11–19. https://doi.org/10.1016/j.neuroimage.2015.05.019
Zeitschrift für Psychologie (2020), 228(1), 3–13
Perceval, G., Martin, A. K., Copland, D. A., Laine, M., & Meinzer, M. (2017). High-definition tDCS of the temporo-parietal cortex enhances access to newly learned words. Scientific Reports, 7, 17023. https://doi.org/10.1038/s41598-017-17279-0 Riley, R. D., Jackson, D., Salanti, G., Burke, D. L., Price, M., Kirkham, J., & White, I. R. (2017). Multivariate and network meta-analysis of multiple outcomes and multiple treatments: rationale, concepts, and examples. British Medical Journal, 358, j3932. https://doi.org/10.1136/bmj.j3932 Rivera-Urbina, G. N., Mendez Joya, M. F., Nitsche, M. A., & Molero-Chamizo, A. (2019). Anodal tDCS over Wernicke’s area improves verbal memory and prevents the interference effect during words learning. Neuropsychology, 33, 263. https://doi. org/10.1037/neu0000514 Ronnlund, M., Nyberg, L., Backman, L., & Nilsson, L. G. (2005). Stability, growth, and decline in adult life span development of declarative memory: Cross-sectional and longitudinal data from a population-based study. Psychological Aging, 20, 3–18. https://doi.org/10.1037/0882-7974.20.1.3 Rücker, G., & Schwarzer, G. (2015). Ranking treatments in frequentist network meta-analysis works without resampling methods. BMC Medical Research Methodology, 15, 58. https:// doi.org/10.1186/s12874-015-0060-8 Rugg, M. D., & King, D. R. (2018). Ventral lateral parietal cortex and episodic memory retrieval. Cortex, 107, 238–250. https:// doi.org/10.1016/j.cortex.2017.07.012 Sandrini, M., & Cohen, L. G. (2013). Noninvasive brain stimulation in neurorehabilitation. Handbook of Clinical Neurology, 116, 499–524. https://doi.org/10.3389/fnhum.2014.00378 Sandrini, M., & Cohen, L. G. (2014). Effects of brain stimulation on declarative and procedural memories. In R. Cohen Kadosh (Ed.), The stimulated brain (pp. 237–263). London, UK: Academic Press. Sandrini, M., Manenti, R., Brambilla, M., Cobelli, C., Cohen, L. G., & Cotelli, M. (2016). Older adults get episodic memory boosting from noninvasive stimulation of prefrontal cortex during learning. Neurobiology of Aging, 39, 210–216. https://doi.org/ 10.1016/j.neurobiolaging.2015.12.010 Schulz, K. F., Altman, D. G., & Moher, D. (2010). CONSORT 2010 statement: Updated guidelines for reporting parallel group randomised trials. BMC Medicine, 8, 18. https://doi.org/ 10.1186/1741-7015-8-18 Schwarzer, G. (2007). meta: An R package for meta-analysis. R News, 7, 40–45. https://doi.org/10.1136/bmj.c332 Shin, Y. I., Foerster, A., & Nitsche, M. A. (2015). Transcranial direct current stimulation (tDCS) – application in neuropsychology. Neuropsychologia, 69, 154–175. https://doi.org/10.1016/j. neuropsychologia.2015.02.002 Spaniol, J., Davidson, P. S., Kim, A. S., Han, H., Moscovitch, M., & Grady, C. L. (2009). Event-related fMRI studies of episodic encoding and retrieval: Meta-analyses using activation likelihood estimation. Neuropsychologia, 47, 1765–1779. https://doi. org/10.1016/j.neuropsychologia.2009.02.028 Summers, J. J., Kang, N., & Cauraugh, J. H. (2016). Does transcranial direct current stimulation enhance cognitive and motor functions in the ageing brain? A systematic review and metaanalysis. Ageing Research Reviews, 25, 42–54. https://doi.org/ 10.1016/j.arr.2015.11.004 Szczepanski, S. M., & Knight, R. T. (2014). Insights into human behavior from lesions to the prefrontal cortex. Neuron, 83, 1002–1018. https://doi.org/10.1016/j.neuron.2014.08.011 Szucs, D., & Ioannidis, J. P. (2017). Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLoS Biology, 15, e2000797. https://doi.org/10.1371/journal.pbio.2000797
Ó 2020 Hogrefe Publishing
G. J. Bartl et al., tDCS and Verbal Episodic Memory: A Network Meta-Analysis
Tremblay, S., Lepage, J. F., Latulipe-Loiselle, A., Fregni, F., Pascual-Leone, A., & Théoret, H. (2014). The uncertain outcome of prefrontal tDCS. Brain Stimulation, 7, 773–783. https://doi. org/10.1016/j.brs.2014.10.003 Tulving, E. (1983). Elements of episodic memory. London, UK: Oxford University Press. Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36, 1–48. https://doi.org/10.18637/jss.v036.i03 Villamar, M. F., Volz, M. S., Bikson, M., Datta, A., DaSilva, A. F., & Fregni, F. (2013). Technique and considerations in the use of 4 1 ring high-definition transcranial direct current stimulation (HD-tDCS). Journal of Visualized Experiments, 77, e50309. https://doi.org/10.3791/50309 Vöröslakos, M., Takeuchi, Y., Brinyiczki, K., Zombori, T., Oliva, A., Fernández-Ruiz, A., . . . Berényi, A. (2018). Direct effects of transcranial electric stimulation on brain circuits in rats and humans. Nature Communications, 9, 483. https://doi.org/ 10.1038/s41467-018-02928-3 Woods, A. J., Antal, A., Bikson, M., Boggio, P. S., Brunoni, A. R., Celnik, P., . . . Knotkova, H. (2016). A technical guide to tDCS, and related non-invasive brain stimulation tools. Clinical Neurophysiology, 127, 1031–1048. https://doi.org/10.1016/ j.clinph.2015.11.012 Zhang, J., Liu, J., Li, J., Zhang, C., & Qu, M. (2018). Non-invasive brain stimulation for improving cognitive function in people with dementia and mild cognitive impairment. Cochrane Database of Systematic Reviews, (7). https://doi.org/10.1002/ 14651858.CD013065 History Received May 14, 2019 Revision received October 14, 2019 Accepted November 16, 2019 Published online March 31, 2020
Ó 2020 Hogrefe Publishing
13
Acknowledgments The authors thank Diego Vitali (University of Roehampton) for comments on the original project concept and analyses reported in this article. Open Data Supporting materials are available through the online repositories Psycharchives (DOIs: 10.23668/psycharchives.2619; 10.23668/ psycharchives.2620) and Open Science Framework (https://osf. io/cfyvk/). Our choice of analytic software Netmeta (Rücker & Schwarzer, 2015) was motivated by enabling a free and openaccess reproducibility of our analysis. Funding Gergely Janos Bartl received project-specific funding from Santander UK. Gergely Janos Bartl and Emily Blackshaw are supported by the Vice-Chancellor’s Scholarship from the University of Roehampton. ORCID Gergely Janos Bartl https://orcid.org/0000-0002-5947-1304 Gergely Janos Bartl Department of Psychology University of Roehampton Whitelands College Holybourne Avenue London SW15 4JD UK bartlg@roehampton.ac.uk
Zeitschrift für Psychologie (2020), 228(1), 3–13
Review Article
Response Rates in Online Surveys With Affective Disorder Participants A Meta-Analysis of Study Design and Time Effects Between 2008 and 2019 Tanja Burgard1 , Michael Bošnjak1,2 , and Nadine Wedderhoff2 1
Research Synthesis Unit, Leibniz Institute for Psychology Information (ZPID), Trier, Germany
2
Department of Psychology, University of Trier, Germany
Abstract: A meta-analysis was performed to determine whether response rates to online psychology surveys have decreased over time and the effect of specific design characteristics (contact mode, burden of participation, and incentives) on response rates. The meta-analysis is restricted to samples of adults with depression or general anxiety disorder. Time and study design effects are tested using mixed-effects meta-regressions as implemented in the metafor package in R. The mean response rate of the 20 studies fulfilling our meta-analytic inclusion criteria is approximately 43%. Response rates are lower in more recently conducted surveys and in surveys employing longer questionnaires. Furthermore, we found that personal invitations, for example, via telephone or face-to-face contacts, yielded higher response rates compared to e-mail invitations. As predicted by sensitivity reinforcement theory, no effect of incentives on survey participation in this specific group (scoring high on neuroticism) could be observed. Keywords: response rates, online survey, meta-analysis, affective disorders
Declining Survey Response Rates and Oversurveying Nonresponse is one of the most severe problems in social and behavioral research challenging both the internal and the external validity of surveys (Hox & de Leeuw, 1994). There are different forms of nonresponse. The dependent outcome of this meta-analysis is the response rate, which is defined here as the number of complete interviews divided by the number of interview attempts (interviews plus the number of refusals and breakoffs plus all cases of unknown eligibility; American Association for Public Opinion Research, 2016). If the causes for missingness are independent to any other (observed or unobserved) parameter (i.e., data are missing completely at random; Little & Rubin, 2019), nonresponse reduces the amount of data collected. A smaller sample size leads to a larger sampling variance, resulting in less precise estimates and lower statistical power. However, if the reason for nonresponse is nonrandom, missing data can cause biased results and invalid conclusions, as
Zeitschrift für Psychologie (2020), 228(1), 14–24 https://doi.org/10.1027/2151-2604/a000394
the final respondents are no longer representative for the population of interest (Groves & Peytcheva, 2008). There is ample evidence on declining response rates to household surveys in the social and political sciences (Brick & Williams, 2013; Krosnick, 1999), and to surveys in counseling and clinical psychology (Van Horn, Green, & Martinussen, 2009). This trend can aggravate the possible bias due to nonresponse. To explain this decline, participation in a scientific study can be regarded as a culturally shaped decision problem (Haunberger, 2011a). Evidence indicating cultural differences in response patterns, including the extent of nonresponse, is found in cross-cultural survey methodology (Baur, 2014). In cultures emphasizing individualism, individuals are mainly responsible for themselves and decisions tend to be based on an individual cost-benefit analysis. In this context, value-expectancy theories (Esser, 2001), such as the theory of planned behavior (Ajzen, 1991), are especially suitable to explain participation in surveys (Bošnjak, Tuten, & Wittmann, 2005; Haunberger, 2011b).
Ó 2020 Hogrefe Publishing
T. Burgard et al., Response Rates in Psychological Online Surveys
15
In Western societies, a shift from collectivist values toward individualism has been observed (Greenfield, 2013; Hofstede, 2001). For instance, substantial increases in individualistic tendencies in word use and naming of children have been detected (for China between 1975 and 2015: Zeng & Greenfield, 2015; for Japan: Ogihara, 2017; for the United States between 2004 and 2015: Twenge, Dawson, & Campbell, 2016). There is also evidence of changes in relational and cultural practices in both the United States and Japan across several decades until 2015 (Grossmann & Varnum, 2015; Hamamura, 2011; Ogihara, 2018). Furthermore, increases in individualistic behavioral choices, practices, and values were observed by Santos, Varnum, and Grossmann (2017) for 37 of the 51 countries they examined. Taken together, these findings substantiate a global shift toward individualistic values and behaviors. This shift serves as a rationale for the overall decline in participation rates, as participants feel less socially obliged to help the interviewer, for instance, if the survey does not provide a benefit for themselves. In contrast, it also nurtures the assumption that characteristics of the study design related to the individual costs and benefits of participants, such as incentives and interest in the topic, might have gained in importance for motivating people to participate in a survey (Esser, 1986). As survey participation is interrelated with culture and communication (Schwarz, 2003), another factor that might have caused changes in response rates is the increase of Internet usage in recent years. In the European Union, the share of individuals using the Internet has increased from 60% in 2007 to 84% in 2018 (World Bank, 2018). The Internet has also become a more popular platform for conducting surveys in recent years, due to the fast and easy implementation and low costs of online surveys. Yet they are thought to suffer even more from issues of nonresponse and a lack of representativeness (Cook, Heath, & Thompson, 2000). In their meta-analysis, Lozar Manfreda, Bošnjak, Berzelak, Haas, and Vehovar (2008) concluded that web surveys yield lower response rates than other survey modes. Interestingly, the ever-increasing growth of the Internet and the increase in web surveys in general have not changed the willingness to participate in these types of surveys relative to other survey modes (Daikeler, Bošnjak, & Lozar Manfreda, 2019). More than a decade ago, Shi and Fan’s (2008) meta-analytic comparison revealed that web surveys yielded an average response rate of 34% in contrast to 45% for paper surveys. Following the general trend of increased nonresponse rates, the absolute level of these response rates might have decreased in the 11 years since this meta-analysis. A decrease in the participation in online psychology surveys may be a result of oversurveying (Groves et al., 2004; Weiner & Dalessio, 2006), reflecting the research
trends in other scientific branches and for other modes of data collection. This is due to less attention to single communication requests, because of the amount of information to be processed. As a consequence, potential survey participants may not be interested in taking part in single studies (Groves, Cialdini, & Couper, 1992). Oversurveying may also influence the perception of social exchange, in the sense of giving participants the feeling to have done their part after having participated in a few studies, reducing the willingness to participate in the following (Groves & Magilavy, 1981). Given the severe consequences of nonresponse on external validity, it should be the ambition of every scientist to keep survey nonresponse to a minimum. Therefore, it is essential to know the possible effects of a study’s design on people’s willingness to participate in the study. This knowledge may serve as a guide when determining, for example, the use of incentives or the contact mode of the invitation. In this meta-analysis, we will examine if the trend of declining response rates holds for online surveys in psychology, specifically focusing on participants with depression or anxiety disorders. From an epidemiological perspective, this is an important population that may be hard to reach and difficult to motivate to participate in studies. The moderating effects of time and survey design will be tested using study characteristics (contact mode, number of items, and use of incentives). The results of the meta-analysis should guide researchers in how to optimally implement online psychology surveys that yield high response rates. Thus, our first hypothesis focuses on the time effect:
Ó 2020 Hogrefe Publishing
Hypothesis 1 (H1): The response rates in online psychology surveys have decreased over time.
Effects of Study Design Characteristics on Response Rates In times of oversurveying, one method to draw attention to studies in order to achieve higher response rates is contacting the potential participants personally. Participants can be invited to access online surveys in various ways that differ in the extent of personal contact. For example, contacting potential respondents by phone is a more personal invitation than sending an e-mail invitation to participate via a mailing list. Schaefer and Dillman (1998) stress the importance of a personal contact to potential respondents, an act which conveys their importance for the survey institution. In a study of student engagement in a university survey (Nair, Adams, & Mertova, 2008), about half of the nonrespondents recontacted by telephone were convinced by the personal contact to complete the online survey. The meta-analysis of Cook et al. (2000) also shows that more Zeitschrift für Psychologie (2020), 228(1), 14–24
16
T. Burgard et al., Response Rates in Psychological Online Surveys
personalized contacts yield higher response rates in online surveys. Examining the type of contact to deliver the invitation to participate in a survey, we assume:
on response rates. Incentives contingent on the return of the questionnaire did not provide significant benefits, independent of the type of incentive. In general, cash incentives have a stronger effect on response rates than lottery tickets or other nonmonetary incentives (Pforr et al., 2015). This difference between prepaid and promised incentives was also corroborated by Mercer et al. (2015), but only for telephone and mail surveys. For in-person interviews, the timing of the incentive had no significant impact on the response rates. These findings from cross-sectional research indicate that incentives, under certain conditions, may have an effect on response rates. However, in the present research we are considering a special population, namely samples with a considerable share of respondents suffering from depressive or anxiety disorders. Following the reinforcement sensitivity theory (Corr, 2002), we can expect that this population, scoring high on neuroticism, will be less sensitive to rewards (Beevers & Meyer, 2002; Bijttebier, Beck, Claes, & Vandereycken, 2009; Pinto-Meza et al., 2006). This would also imply that the effect of incentives for survey participation will be lower than expected for the general population. Thus, we hypothesize:
Hypothesis 2 (H2): Personal or phone contact as an invitation mode yields higher response rates in online psychology surveys than e-mail invitations. The influence of survey length on response rates was examined meta-analytically by Rolstad, Adler, and Rydén (2011): They found a clear association between questionnaire length and response rates. Yet it is not clear whether the difference in response rates is directly attributable to the length of the questionnaires. For the association between questionnaire length and experienced response burden, only weak support is found. In Mercer, Caporaso, Cantor, and Townsend’s (2015) meta-analysis, multiple criteria were used to classify surveys as burdensome, and findings indicated that a survey classified as burdensome led to response rates more than 20% lower than for low-burden surveys. Galesic and Bošnjak (2009) conducted an experiment in which the announced length of the survey, incentives, and the order of thematic blocks were randomly assigned to participants. Findings revealed that the respondents were more likely to start the survey when the stated length was shorter. However, many surveys do not provide information on the length of the survey, with the consequence that a longer survey may lead to higher breakoff rates (e.g., Mavletova & Couper, 2015, in their metaanalysis of mobile web surveys) and thus incomplete datasets. As we are interested in the response rate as the share of completed interviews related to all interview attempts, breakoff during the survey also means lower survey response in this case. In the context of the higher importance of the costbenefit analysis due to cultural individualization, over time it can be expected that longer studies suffer more from the decrease in participation than shorter ones. Thus: Hypothesis 3 (H3): The higher the number of items in an online survey questionnaire, the lower is the response rate. An intensively researched topic in the area of survey participation is the effect of incentives. An early meta-analysis showed that prepaid monetary incentives were the most effective, with an average increase in participation of 19.1 percentage points (Church, 1993). The meta-analysis moreover revealed that only initial incentives had an effect 1
Hypothesis 4 (H4): Response rates in online psychology surveys in a group scoring high on neuroticism are not affected by incentives awarded for participation.
Method Inclusion and Exclusion Criteria This review has been reported in accordance with the PRISMA statement1 (Moher, Liberati, Tetzlaff, Altman, & The PRISMA Group, 2009). Of interest are psychological studies reporting response rates from online surveys. Studies reporting on mixed survey types (e.g., online with telephone reminders) that do not report online survey-only rates or studies where the type of survey is not explicitly reported were excluded. Moreover, to be useful for hypothesis testing, at least one of the study design characteristics of interest has to be reported: number of items in the questionnaire, use of incentives, or contact mode of the invitation. Student samples were excluded due to differing motivation structure and incentives. Especially psychology students are often obliged to take part in psychology surveys as part of their studies. Their motivation therefore
The PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) Statement consists of a checklist with 27 items that have to be reported in a research synthesis and a diagram for reporting the flow of information through the four phases of the literature selection. It helps authors to improve their reporting.
Zeitschrift für Psychologie (2020), 228(1), 14–24
Ó 2020 Hogrefe Publishing
T. Burgard et al., Response Rates in Psychological Online Surveys
17
is not comparable to other populations participating voluntarily. Moreover, the meta-analysis is restricted to samples of adults with general anxiety disorder or depression, as this population is of growing epidemiological importance (WHO, 2017) and, therefore, of special interest in the domain of psychology. Beyond this, the restriction on a focus population kept the number of primary studies manageable. In the case of panel studies, only the first wave is of interest due to panel mortality in later waves. In longitudinal studies with multiple cross-sectional samples, there is a new sample for each wave and, thus, all samples are coded separately. There is no restriction concerning the year or language of publication. Relevant systematic reviews identified during screening were tagged for reference checking. An overview of the inclusion and exclusion criteria for study selection is presented in Table S1 (available at: https://doi.org/10.23668/psycharchives.2626).
PSYNDEX, and ReStore. The results of these database searches are shown in Table S2 (available at: https://doi. org/10.23668/psycharchives.2626). In addition, the following conference proceedings were searched manually for potentially relevant records: European Survey Research Association (conference year 2017) and American Association for Public Opinion Research (conference years 2016, 2017, 2018). The main search terms utilized for the literature search were: (participat* OR respons* OR respond*) AND (“online” OR “internet” OR “web” OR “electronic” OR “world wide web” OR “computer” OR “email”) AND (“interview” OR “survey” OR “questionnaire”) AND (depress* OR anxi*). The exact search strategies differed slightly between the respective databases and are reported in detail in Supplement S5 (available at: https://doi.org/10.23668/ psycharchives.2626). The retrieved records were screened for eligibility by three independent coders. In the first step, literature that definitely did not meet the inclusion criteria was identified via abstract screening by two screeners. An agreement of 92% was reached for this initial screening. To achieve full agreement, the first two screeners discussed all disagreements and a third screener was consulted concerning the remaining discrepancies. In the end, full agreement was reached. Potentially relevant records were then assessed in a full-text screening to conclusively identify the eligible literature. The full-text screening was conducted by two of the screeners and full consensus was achieved.
Moderator Analyses In the meta-analysis, we test moderating effects on survey response rates. As potential moderators, basic information from the report, such as name of the first author, publication year, type of report, and funding, are coded, as well as information on the sample and potentially relevant study design characteristics. In the present meta-analysis, we focus on a specific population, that is, adults with depression or anxiety disorders; thus, relevant descriptive information includes the specific diagnosed disorders in the population and the percentages of participants diagnosed with each disorder. Moreover, the mean age of the population, the percentage of women in the sample, the year, and the country of data collection are of interest. Characteristics of the study design that could have an effect on the willingness to participate in an online survey are the contact mode of the invitation (mail, e-mail, phone, or personal contact), the burden of the survey (measured with the number of items), the use of incentives, and the topic of the survey. Finally, a response rate is either given in the report or calculated using the formula defined by the American Association for Public Opinion Research (2016): the number of complete interviews divided by the number of interview attempts (interviews plus the number of refusals and breakoffs plus all cases of unknown eligibility).
Search Strategies To search for relevant records, 10 databases were used: PsycInfo, Embase, Medline and In-Process, Medline Ahead of Print/Daily Update, the Campbell Library, Science Citation Index, SocINDEX and the Cochrane Central Register of Controlled Trials (CENTRAL), PubPsych containing Ó 2020 Hogrefe Publishing
Coding Procedures Half of the included studies were coded by two coders to detect possible discrepancies or sources of misunderstanding in the coding guide. An interrater agreement rate of more than 80% was found between coders for the majority of the coded moderators. All discrepancies could be solved by discussion. Information on the flow of participants, which is crucial for the calculation of the response rate, differed slightly in some cases, such that initial agreement was only 55%. These discrepancies were reevaluated and discussed until consensus was finally achieved.
Statistical Methods The outcome is the response rate for each treatment. The treatment is an invitation to participate in an online survey. The response rate is a relative measure and thus restricted to values between 0 and 1. It is calculated by dividing the returned, usable questionnaires (equivalent to the sample size of the study) by the number of potential respondents contacted, that would have been eligible or for whom eligibility is unknown. Zeitschrift für Psychologie (2020), 228(1), 14–24
18
T. Burgard et al., Response Rates in Psychological Online Surveys
Our data were collected on several levels (report, study, sample, outcomes), but as we only have one response rate per study and sample, and in each report there is only one usable sample reported, we actually do not have a multilevel data structure. Using the metafor package in R (Viechtbauer, 2010), the time effect is assessed by calculating mixed-effects meta-regressions to test the influence of the year of data collection and the characteristics of the survey design on the response rate. In the mixed-effects model we assume that the true effect sizes may vary between studies. The observed variance in effect sizes is then comprised of the variance of the true effect sizes (the heterogeneity) and the random error. Since we want to examine what proportion of the observed variance reflects real differences in effect sizes, we will calculate the I2 statistic. To assess the proportion of true variance in response rates explained by the model used, we will calculate an index analogous to the R2 index for primary studies. This index, defined as true variance explained (Borenstein, Hedges, Higgins, & Rothstein, 2009), computes the percent reduction in true variance by comparing T2 of the model including the moderators of interest versus the null model without moderators. The index is restricted to values between 0 and 1. The higher the percent reduction, the more variance in response rates is explained with the corresponding model.
105 articles were screened as full text. Of these records, 20 were found to be relevant for coding. The main reasons for exclusion of full text articles was missing information on the flow of participants, thus that no response rate could be computed. Figure S3 (available at https://doi.org/10. 23668/psycharchives.2627) shows the selection process of literature in detail. Table S4 (available at https://doi.org/ 10.23668/psycharchives.2626) gives an overview of the characteristics of the studies finally included. The major drawback resulting from the small sample of included studies is the lack of studies published before 2008. This was not intended, but a result of the restriction on samples with anxiety disorder or depression and the requirement of information for the calculation of the response rate. Table 1 reports the means and standard deviations of the variables examined in our meta-analysis. For example, the mean publication year is 2016. Eighty-five percent of the samples (n = 17) were invited to participate via e-mail. The mean number of items in the studies was 54. Only four studies reported the use of incentives. Therefore, the timing and kind of incentives, both characteristics potentially highly relevant for the effectiveness of incentives, could not be distinguished in this meta-analysis. The mean response rate over the 20 studies is 43%. Table 1 also reports the interrelations between the variables. As expected, we found lower response rates for newer studies and for questionnaires containing more items. Figure 1 displays the cumulative forest plot of the 20 studies included in the meta-analysis. The studies are sorted chronologically, and the evidence is summed up from study to study. At the beginning (studies published in 2008), the confidence intervals are broad and the overall mean response rates are volatile. From about 2018 on, the overall effect remains stable and is hardly affected by new evidence. This suggests that our evidence to estimate the mean response rate for the studies in this meta-analysis is satisfactory at this point. The mixed-effects meta-analysis conducted in R reveals an overall response rate of 42.8% for the 20 studies, with a 95% confidence interval between 31.7% and 53.9%. Almost all of the variance is between the studies (I2=99.92%). The funnel plot in Figure 2 depicts a relationship between the response rates and the standard errors. It seems that the response rates in smaller studies are higher than in larger
Publication Bias and Selective Reporting Assessment of publication bias is crucial in this context because studies with a low response rate may be less likely to be published. Therefore, the response rate will be plotted against the standard error, following Sterne et al.’s (2011) recommendations. The symmetry of the resulting funnel plot will provide a qualitative indication of the existence of a publication bias. Moreover, the performance of Egger’s test will provide a p-value for a formal test of publication bias.
Results The literature search yielded 2,874 potentially relevant records for screening. Of these, 2,769 records could be excluded due to obviously not meeting inclusion criteria.
Table 1. Univariate and bivariate distributions of study characteristics (n = 20 Studies) Variable
Mean
SD
E-mail invitation
Publication year
2016
3.44
E-mail invitation
0.85
0.37
–
Number of items
54
42.92
–
Incentive
0.20
0.41
Response rate
0.43
0.25
Zeitschrift für Psychologie (2020), 228(1), 14–24
0.313
Number of items
Incentive
Response rate
0.176
0.037
0.444
0.485
0.210
0.018
–
0.192
0.191
–
–
–
0.059
–
–
–
–
Ó 2020 Hogrefe Publishing
T. Burgard et al., Response Rates in Psychological Online Surveys
19
Figure 1. Cumulative forest plot.
Figure 2. Funnel plot.
Figure 3. Meta-regression plot of response rates over time.
studies. The result of Egger’s test confirms that the relationship is significant (z = 2.29, p = .0219). As the response rate is not the outcome of interest in the studies, a publication bias is not the most plausible explanation for this finding. Taking into account the assumptions of positive effects of personal contact to potential participants, an alternative
rationale for this relationship might be that the participants in smaller studies were more likely to be contacted personally and that this contact might have resulted in higher response rates. In Figure 3, the bivariate distribution of publication year and response rate is plotted. The linear regression line
Ó 2020 Hogrefe Publishing
Zeitschrift für Psychologie (2020), 228(1), 14–24
20
T. Burgard et al., Response Rates in Psychological Online Surveys
Table 2. Results of meta-regressions Moderator
Full model of study design characteristics
Full model study design + additional controls
Intercept
0.729*** [0.426; 1.033], p < .001
Publication year (H1)
0.177**[ 0.287;
0.068], p = .002
0.172** [ 0.301;
0.043], p = .009
Contact mode of invitation (H2) (e-mail vs. other)
0.342* [ 0.683;
0.000], p = .050
0.172** [ 0.301;
0.043], p = .009
Number of items (H3)
0.144*[ 0.264;
0.024], p = .018
Incentives (H4)
0.055 [ 0.295; 0.185], p = .654
Funds Mean age sample I2 2
R
0.698** [0.271; 1.124], p = .001
0.137. [ 0.296; 0.022], p = .092 0.052 [ 0.313; 0.201], p = .695
–
0.030 [ 0.226; 0.286], p = .819
–
0.001 [ 0.120; 0.121], p = .991
99.72%
99.59%
28.79%
18.08%
Note. N = 20 studies. Significance levels: ***p < .001, **p < .01, *p < .05, p < .1.
Table 3. Selected predictions for response rates from the full model Publication year
Number of items
Incentives
Contact mode: e-mail
Actual response rate
Predicted response rate
Difference actual vs. predicted response rates
2008
187
0
1
0.3966
0.3315
2011
18
0
1
0.9839
0.7219
0.26
2018
52
1
1
0.1904
0.2117
0.02
2018
54
0
1
0.0872
0.2865
0.2
2017
56
0
1
0.4457
0.2993
0.15
2019
75
0
0
0.6602
0.4862
0.17
2008
34
1
1
0.7496
0.7973
0.05
2018
38
1
1
0.2119
0.2544
0.04
0.07
Note. Bold values highlight the relevant characteristic in the respective comparison and the corresponding results.
shows the negative relationship between both variables. This relationship is also significant, and the corresponding R2 is 15.75%. That means that almost 16% of the variance in the response rates of the studies in the meta-analysis can be explained by the publication year. As can be seen in Figure 3, we obviously have one study without nonresponse. Omitting this study from the analysis, as its effect size deviates significantly from the other studies, does not change the conclusions presented in Table 2. Neither the size nor the significance level of effects is affected. This may be due to the small sample size (n = 30) of this outlier study and speaks for the robustness of our results. In Table 2, the results of the meta-regressions conducted are reported. The inclusion of additional information about the funding and the mean age of the sample does not change the overall conclusions of the hypothesis testing. There is evidence for an overall decrease in response rates over time. The mode of contacting participants is also relevant. The least personal contact mode was via e-mail. Samples contacted this way showed less willingness to participate in an online survey than samples approached personally, by phone, or mail. A higher number of items in the questionnaire is also related to lower response rates. Corroborating our expectations, an effect of incentives is not supported for the population considered in this Zeitschrift für Psychologie (2020), 228(1), 14–24
meta-analysis. With only four studies reporting incentives in our sample, we were unable to distinguish different types of incentives or take into account the timing of incentives, yet this finding does not necessarily mean that incentives have no effect at all. On the contrary, it might also be possible, as previous research indicates, that incentives only have an effect on response rates under certain conditions. To illustrate the influence of the moderators investigated in this meta-analysis, Table 3 shows predicted response rates depending on the values of the study design characteristics. For each relevant characteristic, two similar studies were matched that only differed substantially in the expression of this characteristic. The first comparison of this kind is at the top of the table: two somewhat dated studies (from the years 2008 and 2011) with samples contacted via e-mail and not given incentives for participation are compared. The burden of participation measured with the number of items is extremely high in the first study and very low in the second study. The difference in predicted response rates is about 40%, the actual response rates differ even more. The third study and the fourth study (both from 2018) only differ with respect to incentives. This difference hardly influences the prediction of response rates from the model. Moreover, the model does not predict the actual response Ó 2020 Hogrefe Publishing
T. Burgard et al., Response Rates in Psychological Online Surveys
21
rate well. This is plausible because, as the meta-regression demonstrated, incentives are not useful for the explanation of the response rates. The third comparison is between two rather similar studies (from the years 2017 and 2019) differing in mode of contact, with one study contacting participants via e-mail and the other study utilizing a different mode of contact. For these two studies, the actual as well as the predicted response rate values were approximately 20% higher when participants were contacted by means of a more personal invitation to participate. Finally, a clear difference in response rates is also found for the two similar studies from 2018 and 2008. The response rate in the more recent study is about 50% lower than that in the older study, and this finding is also predicted by the model.
This is in line with previous research showing an effect of length of survey on the initial response rate (Galesic & Bošnjak, 2009). Moreover, Mavletova and Couper’s (2015) meta-analysis revealed a similar relationship for survey length and breakoff during the survey. Thus, to keep the burden for the respondent low, researchers should aim to design their surveys to be as brief as possible. If the survey can be responded to within a few minutes’ time, it might be helpful to mention this in the invitation to the survey. Third, when sending invitations to participate in online surveys, the meta-regressions clearly indicate that it is more effective to approach potential participants using more personal forms of contact, such as face-to-face or phone contact. Cook et al.’s (2000) earlier meta-analysis provided evidence for the importance of personal forms of contact to achieve higher response rates. The more recent studies investigated in this meta-analysis support Cook et al.’s (2000) finding. Thus, to attain high response rates in surveys conducted online, we recommend contacting and personally inviting respondents to participate in an offline mode before sending them the survey or the link to the survey. A potentially highly relevant moderator is the use of incentives. In the present study, our search strategy uncovered only a small number (i.e., four) of studies utilizing incentives to include in the meta-analysis, and no effect of incentives was found. Previous research has shown, however, that the effectiveness of incentives for increasing response rates depends on the timing and type of incentive (Church, 1993; Pforr et al., 2015). Here, we were unable to differentiate the type or timing of incentives of the four studies reporting the use of incentives in our sample; consequently, the effectiveness of incentives could not be evaluated in detail. More evidence on the use of incentives in online surveys is needed. A potential strategy could be to increase the population of interest in the meta-analysis to include more, diverse groups, with the consequence of an expanded evidence base allowing for more detailed analyses. Moreover, experimental primary studies examining this effect in detail, for example, by varying timing and type of incentives, would also be of interest. Further study design factors that could be included in a future meta-analysis are contact protocols, such as the use of prenotifications and reminders (Bošnjak, Neubarth, Couper, Bandilla, & Kaczmire, 2008; Cook et al., 2000), or the use and design of an advance letter or e-mail (for a meta-analysis on advance letters in telephone surveys, see De Leeuw, Callegaro, Hox, Korendijk, & LensveltMulders, 2007). These study characteristics are reported less frequently than, for example, the use of incentives or the contact mode for invitation. Hence, the small number of studies (i.e., 20) in this meta-analysis did not allow us
Discussion To conclude, the hypothesized influences on response rates were mainly confirmed. The mean response rate of 43% is rather high compared to the mean response rates of 34% and 39.6% for online samples found in the meta-analyses of Shi and Fan (2008) and Cook et al. (2000), respectively. This may, however, be due to our restriction to include only samples of respondents with depression or anxiety disorder. First, because many samples were recruited personally from patient lists in hospitals and second, due to the personal relevance of the topics of the surveys, that all surrounded the affective disorders the participants were suffering from. These restrictions to our study sample as well as the necessary information requirements to compute the response rates resulted in a small pool of studies available for our meta-analysis. Moreover, the lack of studies published before 2008 was another factor contributing to the low number of studies available for the analyses. Due to these limitations concerning the generalizability of the results, the existing evidence on response rates in online surveys for other populations would be of great importance and should be meta-analyzed in the future. There are several conclusions that can be drawn from the meta-analysis to guide researchers to optimally implement online psychology surveys and achieve high response rates. First of all, we found clear evidence for the expected decrease in response rates, despite our small sample of studies and the short time interval examined. This result corroborates existing findings of the numerous studies on response rates (Brick & Williams, 2013; Krosnick, 1999; Van Horn et al., 2009). Second, the increasing number of items in a survey significantly reduces the response rates. Thus, researchers should strive to keep the burden of the survey rather small. Ó 2020 Hogrefe Publishing
Zeitschrift für Psychologie (2020), 228(1), 14–24
22
T. Burgard et al., Response Rates in Psychological Online Surveys
to examine these characteristics as moderators. Nonetheless they may be highly relevant for achieving high response rates and should be included in studies of online surveys in the future. The results of our meta-analysis does not allow conclusions to be made concerning nonresponse bias, although we know that this is crucial for drawing conclusions on the generalizability of survey results (Groves & Peytcheva, 2008). Depending on the target outcome of the respective study, we could argue that patients or former patients participating in a survey are also more willing to deal with their psychological problems. This may result in differences between responders and nonresponders, and thus to nonresponse bias, if the treatment of and dealing with a disorder is the topic of the study. Focusing on online surveys, to examine nonresponse bias and update Groves and Peytcheva’s meta-analysis (2008), we need research that empirically examines the characteristics of nonrespondents and their reasons for rejecting the participation (as, e.g., in the study of Sax, Gilmartin, & Bryant, 2003). This can be accomplished, for example, by reviewing administrative records, performing screening interviews before the main interview, or conducting follow-up surveys with nonrespondents. Sample characteristics known to influence response decisions, such as gender or education, can then be compared between responders and nonresponders. A larger difference between the groups of responders and nonresponders would suggest more nonresponse bias. A more recent research trend that also requires further examination in the context of web surveys is the increase in mobile web surveys. Findings suggest that their breakoff rates are significantly higher than those rates found in surveys that are completed via PC (Mavletova & Couper, 2015). Since the future of web surveys appears to be moving toward implementation via mobile devices, it is vital that research focuses on the optimization of web response rates by investigating the effects of design factors on survey participation and breakoff.
*Axisa, C., Nash, L., Kelly, P., & Willcock, S. (2019). Psychiatric morbidity, burnout and distress in Australian physician trainees. Australian Health Review. https://doi.org/10.1071/AH18076 Baur, N. (2014). Comparing societies and cultures: challenges of cross-cultural survey research as an approach to spatial analysis. Historical Social Research, 39(2), 257–291. https:// doi.org/10.12759/hsr.39.2014.2.257-291 Beevers, C., & Meyer, B. (2002). Lack of positive experiences and positive expectancies mediate the relationship between BAS responsiveness and depression. Cognition and Emotion, 16, 549–564. https://doi.org/10.1080/02699930143000365 Bijttebier, P., Beck, I., Claes, L., & Vandereycken, W. (2009). Gray’s reinforcement sensitivity theory as a framework for research on personality-psychopathology associations. Clinical Psychology Review, 29, 421–430. https://doi.org/10.1016/j.cpr.2009.04.002 Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. (2009). Introduction to meta-analysis. Chichester, UK: Wiley. Brick, J. M., & Williams, D. (2013). Explaining rising nonresponse rates in cross-sectional surveys. The Annals of the American Academy of Political and Social Science, 645, 36–59. https:// doi.org/10.1177/0002716212456834 Bošnjak, M., Tuten, T. L., & Wittmann, W. W. (2005). Unit (non) response in web-based access panel surveys: An extended planned-behavior approach. Psychology and Marketing, 22, 489–505. https://doi.org/10.1002/mar.20070 Bošnjak, M., Neubarth, W., Couper, M. P., Bandilla, W., & Kaczmire, L. (2008). Prenotification in Web-based access panel surveys – The influence of mobile text messaging versus e-mail on response rates and sample composition. Social Science Computer Review, 26, 213–223. https://doi.org/10.1177/ 0894439307305895 Church, A. H. (1993). Estimating the effect of incentives on mail survey response rates. Public Opinion Quarterly, 57, 62–79. https://doi.org/10.1086/269355 Cook, C., Heath, F., & Thompson, R. L. (2000). A meta-analysis of response rates in web- or Internet-based surveys. Educational and Psychological Measurement, 60, 821–836. https://doi.org/ 10.1177/00131640021970934 Corr, P. J. (2002). J. A. Gray’s reinforcement sensitivity theory: Tests of the joint subsystems hypothesis of anxiety and impulsivity. Personality and Individual Differences, 33, 511– 532. https://doi.org/10.1016/S0191-8869(01)00170-2 *Crawford, N. M., Hoff, H. S., & Mersereau, J. E. (2017). Infertile women who screen positive for depression are less likely to initiate fertility treatments. Human Reproduction, 32, 582–587. https://doi.org/10.1093/humrep/dew351 Daikeler, J., Bošnjak, M., & Lozar Manfreda, K. (2019). Web versus other survey modes: An updated and extended meta-analysis comparing response rates. Journal of Survey Statistics and Methodology, smz008, 1–27. https://doi.org/10.1093/jssam/ smz008 *De Graaff, A. A., Van Lankveld, J., Smits, L. J., Van Beek, J. J., & Dunselman, G. A. (2016). Dyspareunia and depressive symptoms are associated with impaired sexual functioning in women with endometriosis, whereas sexual functioning in their male partners is not affected. Human Reproduction, 31, 2577–2586. https://doi.org/10.1093/humrep/dew215 De Leeuw, E., Callegaro, M., Hox, J., Korendijk, E., & LensveltMulders, G. (2007). The influence of advance letters on response in telephone surveys. Public Opinion Quarterly, 71, 413–443. https://doi.org/10.1093/poq/nfm014 Esser, H. (1986). Über die Teilnahme an Befragungen [Participation in opinion polls]. ZUMA Nachrichten, 10, 38–47. SSOAR – Social Science Open Access Repository. Esser, H. (2001). Soziologie. Sinn und Kultur (Vol. 6). Frankfurt am Main, New York: Campus.
References * Studies included in the meta-analysis. Ajzen, I. (1991). The theory of planned behavior. Organizational Behavior and Human Decision Processes, 50(2), 171–211. https://doi.org/10.1016/0749-5978(91)90020-T *Al Atassi, H., Shapiro, M. C., Rao, S. R., Dean, J., & Salama, A. (2018). Oral and maxillofacial surgery resident perception of personal achievement and anxiety: A cross-sectional analysis. Journal of Oral and Maxillofacial Surgery, 76, 2532–2539. https://doi.org/10.1016/j.joms.2018.06.018 American Association for Public Opinion Research. (2016). Standard definitions: Final dispositions of case codes and outcome rates for surveys (9th edition). AAPOR
Zeitschrift für Psychologie (2020), 228(1), 14–24
Ó 2020 Hogrefe Publishing
T. Burgard et al., Response Rates in Psychological Online Surveys
23
Galesic, M., & Bošnjak, M. (2009). Effects of questionnaire length on participation and indicators of response quality in a web survey. Public Opinion Quarterly, 73, 349–360. https://doi.org/ 10.1093/poq/nfp0313 *Goodwin, G. M., Price, J., De Bodinat, C., & Laredo, J. (2017). Emotional blunting with antidepressant treatments: A survey among depressed patients. Journal of Affective Disorders, 221, 31–35. https://doi.org/10.1016/j.jad.2017.05.048 *Gouttebarge, V., Jonkers, R., Moen, M., Verhagen, E., Wylleman, P., & Kerkhoffs, G. (2017). The prevalence and risk indicators of symptoms of common mental disorders among current and former Dutch elite athletes. Journal of Sports Sciences, 35, 2148–2156. https://doi.org/10.1080/02640414.2016.1258485 Greenfield, P. M. (2013). The changing psychology of culture from 1800 through 2000. Psychological Science, 24, 1722–1731. https://doi.org/10.1177/0956797613479387 Grossmann, I., & Varnum, M. E. W. (2015). Social structure, infectious diseases, disasters, secularism, and cultural change in America. Psychological Science, 26, 311–324. https://doi.org/ 10.1177%2F0956797614563765 Groves, R. M., & Magilavy, L. J. (1981). Increasing response rates to telephone surveys: A door in the face for foot-in-the-door? Public Opinion Quarterly, 45, 346–358. https://doi.org/10.1086/268669 Groves, R. M., Cialdini, R. B., & Couper, M. (1992). Understanding the decision to participate in a survey. Public Opinion Quarterly, 56, 475–495. https://www.jstor.org/stable/2749203 Groves, R. M., Fowler, F. J., Couper, M. P., Lepowski, J. M., Singer, E., & Tourangeau, R. (2004). Survey Methodology. Wiley Series in Survey Methodology. Hoboken: Wiley. Groves, R. M., & Peytcheva, E. (2008). The impact of nonresponse rates on nonresponse bias. Public Opinion Quarterly, 72, 167– 189. https://doi.org/10.1093/poq/nfn011 Hamamura, T. (2011). Are cultures becoming individualistic? A cross-temporal comparison of individualism-collectivism in the United States and Japan. Personality and Social Psychology Review, 16, 3–24. https://doi.org/10.1177/1088868311411587 *Han, K., Bohnen, J., Peponis, T., Martinez, M., Nandan, A., Yeh, D. D., . . . Kaafarani, H. M. (2017). Surgeon as the second victim? Results of the Boston Intraoperative Adverse Events Surgeons’ Attitude (BISA) Study. Journal of the American College of Surgeons, 224, 1048–1056. https://doi.org/10.1016/ j.jamcollsurg.2016.12.039 *Harmark, L., Van Puijenbroek, E., & Van Grootheest, K. (2013). Intensive monitoring of duloxetine: Results of a web-based intensive monitoring study. European Journal of Clinical Pharmacology, 69, 209–215. https://doi.org/10.1007/s00228-012-1313-7 Haunberger, S. (2011a). To participate or not to participate: Decision processes related to survey non-response. Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique, 109, 39–55. https://doi.org/10.1177/0759106310387721 Haunberger, S. (2011b). Explaining unit nonresponse in online panel surveys: An application of the extended theory of planned behavior. Journal of Applied Social Psychology, 41, 2999–3025. https://doi.org/10.1111/j.1559-1816.2011.00856.x *Hoff, H. S., Crawford, N. M., & Mersereau, J. E. (2015). Mental health disorders in infertile women: Prevalence, perceived effect on fertility, and willingness for treatment for anxiety and depression. Fertility and Sterility, 104, e357. https://doi.org/ 10.1016/j.fertnstert.2015.07.1113 Hofstede, G. (2001). Culture’s consequences: Comparing values, behaviors, institutions and organizations across nations (2nd ed.). Thousand Oaks, CA: Sage. Hox, J. J., & de Leeuw, E. D. (1994). A comparison of nonresponse in mail, telephone, and face-to-face surveys. Quality and Quantity, 28, 329–344. https://doi.org/10.1007/BF01097014
*Kikuchi, T., Uchida, H., Suzuki, T., Watanabe, K., & Kashima, H. (2011). Patients’ attitudes toward side effects of antidepressants: An Internet survey. European Archives of Psychiatry and Clinical Neuroscience, 261, 103–109. https://doi.org/10.1007/ s00406-010-0124-z Krosnick, J. A. (1999). Survey research. Annual Review of Psychology, 50, 537–567. https://doi.org/10.1146/annurev.psych.50.1.537 Little, R. J. A., & Rubin, D. B. (2019). Statistical analysis with missing data (3rd ed.). New York, NY: Wiley. Lozar Manfreda, K., Bošnjak, M., Berzelak, J., Haas, I., & Vehovar, V. (2008). Web surveys versus other survey modes: A metaanalysis comparing response rates. International Journal of Market Research, 50, 79–104. https://doi.org/10.1177/ 147078530805000107 Mavletova, A., & Couper, M. P. (2015). A meta-analysis of breakoff rates in mobile web surveys. In D. Toninelli, R. Pinter, & P. de Pedraza (Eds.), Mobile research methods: Opportunities and challenges of mobile research methodologies (pp. 81–98). London, UK: Ubiquity Press. Mercer, A., Caporaso, A., Cantor, D., & Townsend, R. (2015). How much gets you how much? Monetary incentives and response rates in household surveys. Public Opinion Quarterly, 79, 105– 129. https://doi.org/10.1093/poq/nfu059 Moher, D., Liberati, A., Tetzlaff, J., & Altman, D. G., The PRISMA Group. (2009). Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. PLoS Med, 6, e1000097. https://doi.org/10.1371/journal.pmed.1000097 Nair, C. S., Adams, P., & Mertova, P. (2008). Student engagement: The key to improving survey response rates. Quality in Higher Education, 14, 225–232. https://doi.org/10.1080/ 13538320802507505 Ogihara, Y. (2017). Temporal changes in individualism and their ramification in Japan: Rising individualism and conflicts with persisting collectivism. Frontiers in Psychology, 8, 695. https:// doi.org/10.3389/fpsyg.2017.00695 Ogihara, Y. (2018). The rise in individualism in Japan: Temporal changes in family structure, 1947–2015. Journal of CrossCultural Psychology, 49, 1219–1226. https://doi.org/10.1177/ 0022022118781504 *Paiva, C. E., Martins, B. P., & Paiva, B. S. R. (2018). Doctor, are you healthy? A cross-sectional investigation of oncologist burnout, depression, and anxiety and an investigation of their associated factors. BMC Cancer, 18, 1044. https://doi.org/ 10.1186/s12885-018-4964-7 *Peth, J., Jelinek, L., Nestoriuc, Y., & Moritz, S. (2018). Adverse effects of psychotherapy in depressed patients – First application of the Positive and Negative Effects of Psychotherapy Scale (PANEPS). Psychotherapie Psychosomatik Medizinische Psychologie, 68, 391–398. https://doi.org/10.1055/s-0044101952 *Pforr, K., Blohm, M., Blom, A. G., Erdel, B., Felderer, B., Fräßdorf, A., . . . Rammstedt, B. (2015). Are incentive effects on response rates and nonresponse bias in large-scale, face-toface surveys generalizable to Germany? Evidence from ten experiments. Public Opinion Quarterly, 79, 740–768. https://doi. org/10.1093/poq/nfv014 Pinto-Meza, A., Caseray, X., Soler, J., Puigdemont, D., Perez, V., & Torrubia, R. (2006). Behavioral inhibition and behavioural activation systems in current and recovered major depression participants. Personality and Individual Differences, 40, 215– 226. https://doi.org/10.1016/j.paid.2005.06.021 *Prinz, B., Dvorak, J., & Junge, A. (2016). Symptoms and risk factors of depression during and after the football career of elite female players. BMJ Open Sport & Exercise Medicine, 2, e000124. https://doi.org/10.1136/bmjsem-2016-000124
Ó 2020 Hogrefe Publishing
Zeitschrift für Psychologie (2020), 228(1), 14–24
24
T. Burgard et al., Response Rates in Psychological Online Surveys
Rolstad, S., Adler, J., & Rydén, A. (2011). Response burden and questionnaire length: Is shorter better? A review and metaanalysis. Value in Health, 14, 1101–1108. https://doi.org/ 10.1016/j.jval.2011.06.00 *Salles, A., Wright, R. C., Milam, L., Panni, R. Z., Liebert, C. A., Lau, J. N., Lin, D. T., & Mueller, C. M. (2019). Social belonging as a predictor of surgical resident well-being and attrition. Journal of Surgical Education, 76, 370–377. https://doi.org/10.1016/j. jsurg.2018.08.022 Santos, H. C., Varnum, M. E. W., & Grossmann, I. (2017). Global increases in individualism. Psychological Science, 28, 1228– 1239. https://doi.org/10.1177/0956797617700622 Sax, L. J., Gilmartin, S. K., & Bryant, A. N. (2003). Assessing response rates and nonresponse bias in web and paper surveys. Research in Higher Education, 44, 409–432. https:// doi.org/10.1023/A:1024232915870 Schaefer, D. R., & Dillman, D. A. (1998). Development of a standard E-mail methodology: Results of an experiment. Public Opinion Quarterly, 62, 378–397. https://doi.org/10.1086/297851 *Schuring, N., Kerkhoffs, G., Gray, J., & Gouttebarge, V. (2017). The mental wellbeing of current and retired professional cricketers: an observational prospective cohort study. Physician & Sports Medicine, 45, 463–469. https://doi.org/10.1080/00913847.2017. 1386069 Schwarz, N. (2003). Culture-sensitive context effects: A challenge for cross-cultural surveys. In J. Harkness, F. J. R. Van de Vijver, & P. Mohler (Eds.), Cross-cultural survey methods (pp. 93–100). New Jersey: Wiley. Shi, T., & Fan, X. (2008). Comparing response rates from web and mail surveys. Field Methods, 20, 249–271. https://doi.org/ 10.1177/1525822X08317085 *Shigemura, J., Sato, Y., Yoshino, A., & Nomura, S. (2008). Patient satisfaction with antidepressants: An Internet-based study. Journal of Affective Disorders, 107, 155–160. https://doi.org/ 10.1016/j.jad.2007.08.019 Sterne, J. A. C., Sutton, A. J., Ioannidis, J. P. A., Terrin, N., Jones, D. R., Lau, J., & Higgins, J. P. T. (2011). Recommendations for examining and interpreting funnel plot asymmetry in metaanalyses of randomised controlled trials. BMJ, 343, d4002. https://doi.org/10.1136/bmj.d4002 *Strohmeier, H., Scholte, W. F., & Ager, A. (2018). Factors associated with common mental health problems of humanitarian workers in South Sudan. PLoS One, 13, e0205333. https://doi.org/10.1371/journal.pone.0205333 Twenge, J. M., Dawson, L., & Campbell, W. K. (2016). Still standing out: children’s names in the United States during the Great Recession and correlations with economic indicators. Journal of Applied Social Psychology, 46, 663–670. https://doi.org/ 10.1111/jasp.12409 Van Horn, P. S., Green, K. E., & Martinussen, M. (2009). Survey Response Rates and Survey Administration in Counseling and Clinical Psychology. Educational and Psychological Measurement, 69, 389–403. https://doi.org/10.1177/ 0013164408324462 *Van Overveld, M., De Jong, P. J., Peters, M. L., Van Hout, W. J. P. J., & Bouman, T. (2008). An internet-based study on the relation between disgust sensitivity and emetophobia. Journal of Anxiety Disorders, 22, 524–531. https://doi.org/10.1016/ j.janxdis.2007.04.001 Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48. https://doi.org/10.18637/jss.v036.i03
Weiner, S. P., & Dalessio, A. T. (2006). Oversurveying: Causes, consequences, and cures. In A. I. Kraut (Ed.), Getting action from organizational surveys: New concepts, technologies, and applications (pp. 294–311). Chapter 12. San Francisco, CA: Jossey-Bass. *Williams, E., Martin, S. L., Fabrikan, A., Wang, A., & Pojasek, M. (2018). Rates of depressive symptoms among pharmacy residents. American Journal of Health-System Pharmacy, 75, 292– 297. https://doi.org/10.2146/ajhp161008 World Bank, World Development Indicators. (2018). Individuals using the Internet (% of population) [Data file]. Washington, DC: The World Bank. Retrieved from https://data.worldbank.org/ indicator/IT.NET.USER.ZS WHO. (2017). Depression and other common mental disorders: Global health estimates. Geneva, Switzerland: World Health Organization. Retrieved from https://apps.who.int/iris/handle/ 10665/254610 Zeng, R., & Greenfield, P. M. (2015). Cultural evolution over the last 40 years in China: Using the Google Ngram Viewer to study implications of social and political change for cultural values. International Journal of Psychology, 50, 47–55. https://doi.org/ 10.1002/ijop.12125 *Zimmerman, M., Chelminski, I., Young, D., & Dalrymple, K. (2011). Using outcome measures to promote better outcomes. Clinical Neuropsychiatry, 8, 28–36.
Zeitschrift für Psychologie (2020), 228(1), 14–24
History Received August 13, 2019 Revision received December 4, 2019 Accepted December 4, 2019 Published online March 31, 2020 Open Data Table S1: Inclusion and exclusion criteria for study selection Table S2: Results from database searches Table S4: Descriptive information for each included study Supplement S5: Search strategies Available at https://doi.org/10.23668/psycharchives.2626 Figure S3: Flow chart for the selection process of records for the meta-analysis Available at https://doi.org/10.23668/psycharchives.2627 ORCID Tanja Burgard https://orcid.org/0000-0001-9194-4821 Michael Bošnjak https://orcid.org/0000-0002-1431-8461 Nadine Wedderhoff https://orcid.org/0000-0002-4460-4995
Tanja Burgard Leibniz Institute for Psychology Information (ZPID) Universitätsring 15 54296 Trier Germany burgard@leibniz-psychology.org
Ó 2020 Hogrefe Publishing
Original Article
Dealing With Artificially Dichotomized Variables in Meta-Analytic Structural Equation Modeling Hannelies de Jonge, Suzanne Jak, and Kees-Jan Kan Department of Child Development and Education, University of Amsterdam, The Netherlands
Abstract: Meta-analytic structural equation modeling (MASEM) is a relatively new method in which effect sizes of different independent studies between multiple variables are typically first pooled into a matrix and next analyzed using structural equation modeling. While its popularity is increasing, there are issues still to be resolved, such as how to deal with primary studies in which variables have been artificially dichotomized. To be able to advise researchers who apply MASEM and need to deal with this issue, we performed two simulation studies using random-effects two stage structural equation modeling. We simulated data according to a full and partial mediation model and systematically varied the size of one (standardized) path coefficient (βMX = .16, βMX = .23, βMX = .33), the percentage of dichotomization (25%, 75%, 100%), and the cut-off point of dichotomization (.5, .1). We analyzed the simulated datasets in two different ways, namely, by using (1) the pointbiserial and (2) the biserial correlation as effect size between the artificially dichotomized predictor and continuous variables. The results of these simulation studies indicate that the biserial correlation is the most appropriate effect size to use, as it provides unbiased estimates of the path coefficients in the population. Keywords: meta-analytic structural equation modeling, artificially dichotomized variables, point-biserial correlation, biserial correlation
Meta-analysis (Glass, 1976) is a commonly used statistical technique to aggregate sample effect sizes of different independent primary studies in order to draw inferences concerning population effects. The most common used form of meta-analysis is one in which the relationship between two variables is tested (i.e., “univariate meta-analysis”). To extend the range of research questions that can be answered, new meta-analytic models have been developed, such as meta-analytic structural equation modeling (MASEM) (Becker, 1992, 1995; Cheung, 2014, 2015a; Cheung & Chan, 2005; Jak, 2015; Viswesvaran & Ones, 1995). This increasingly popular method tests several relations between various variables simultaneously. For example, Jansen, Elffers, and Jak (2019) applied MASEM to evaluate the mediating effect of socioeconomic status (SES) on achievement by private tutoring in a path model. In primary studies, an effect size may represent the strength and direction of the association between any two variables of interest. Such an effect size can be expressed in different ways, for example as Pearson productmoment correlation, Cohen’s d, biserial correlation, and Ó 2020 Hogrefe Publishing
point-biserial correlation. How an effect size is expressed depends on the nature of the variables (e.g., continuous or dichotomous), but also on the way the variables are measured or analyzed. If one of the two continuous variables is artificially dichotomized, one may express the effect size as a point-biserial correlation. However, this typically provides a negatively biased estimate of the true underlying Pearson product-moment correlation (e.g., Cohen, 1983; MacCallum, Zhang, Preacher, & Rucker, 2002). The biserial correlation on the other hand should generally provide an unbiased estimate (Soper, 1914; Tate, 1955). Bias in the effect size of any primary study may affect meta-analytic results in the same direction (Jacobs & Viechtbauer, 2017). In the current study, we will evaluate how using point-biserial correlations versus biserial correlations from primary studies may affect path coefficients, their standard errors/confidence intervals, and model fit in MASEM using population models that represent full and partial mediation. Based on the results, we expect to be able to inform researchers about which of the two investigated effect sizes is the most appropriate to use in MASEM-applications and under which conditions. Zeitschrift für Psychologie (2020), 228(1), 25–35 https://doi.org/10.1027/2151-2604/a000395
26
Meta-Analytic Structural Equation Modeling MASEM is a statistical technique in which meta-analytical procedures are combined with structural equation modeling (SEM) in order to study aggregated results (Becker, 1992, 1995; Viswesvaran & Ones, 1995). There is a growing interest in MASEM in methodological research (e.g., Cheung & Hafdahl, 2016; Jak & Cheung, 2018a, 2018b; Ke, Zhang, & Tong, 2018) as well as in many different substantive research fields (e.g., Hagger & Chatzisarantis, 2016; Montazemi & Qahri-Saremi, 2015; Rich, Brandes, Mullan, & Hagger, 2015). In contrast to univariate meta-analysis, with MASEM it is possible to test complete hypothesized models including more than two variables, mediational effects, and possibly even latent variables. MASEM does not only provide individual parameter estimates, but also the overall model fit. This model fit provides information on whether the observed data agree with the hypothetical model. In MASEM, a hypothesized model will typically be fitted to a pooled correlation matrix using SEM (see Cheung, 2015a; Jak, 2015). To estimate a pooled correlation matrix, one needs to express the bivariate effect sizes in the primary studies as correlation coefficients. Since primary studies may report different kinds of effect sizes, based on the nature of the variables and the way the variables are measured or analyzed, these effect sizes thus first need to be converted into a correlation coefficient.
Relationship Between a Dichotomous and Continuous Variable Many meta-analyses in educational research express their effect size as Cohen’s d (i.e., the standardized mean difference) or the related unbiased estimate Hedges’ g (Ahn, Ames, & Myers, 2012; De Jonge & Jak, 2018), since they are interested in investigating a relationship between an independent dichotomous variable and dependent continuous variable. A variable can be dichotomous by nature or artificially dichotomized. An example of a variable that can be seen as dichotomous by nature is experimental condition (e.g., intervention group or control group). When a variable is artificially dichotomized, the actual variable is continuous but dichotomized by researchers, often for practical purposes. For example, the continuous variable SES is often artificial dichotomized into high SES and low SES. The dichotomization of continuous variables leads to a loss of information, typically leading to an underestimation of the true underlying correlation between the artificially dichotomized and continuous variable (e.g., Cohen, 1983; MacCallum et al., 2002). However, in models with multiple predictors, one can also obtain an overestimation of effects Zeitschrift für Psychologie (2020), 228(1), 25–35
H. de Jonge et al., Artificially Dichotomized Variables in MASEM
if the continuous predictor(s) are artificially dichotomized (see Maxwell & Delaney, 1993; Vargha, Rudas, Delaney, & Maxwell, 1996). Therefore, the dichotomization of continuous variables is not recommended. Nevertheless, researchers still often artificial dichotomize variables. As a result, meta-analysts frequently have to deal with primary studies in which variables have been artificially dichotomized. Since raw data are difficult to obtain from researchers (see Wicherts, Borsboom, Kats, & Molenaar, 2006), for MASEM, a correlation coefficient has to be calculated from the information provided in the primary study. Researchers often provide the sample size, means, and standard deviations of the continuous variable per group. Based on these summary statistics, one could calculate the point-biserial correlation (Lev, 1949; Tate, 1954) and the biserial correlation (Pearson, 1909). However, meta-analysts may not be aware of the difference between these correlation coefficients and will presumably transform the provided effect size to the point-biserial correlation.
The (Point-)Biserial Correlation The point-biserial correlation is a special case of the Pearson product-moment correlation and intended to express the association between a natural dichotomous and continuous variable (Lev, 1949; Tate, 1954). It can be obtained by applying the equation for the Pearson product-moment correlation coefficient to the dichotomous variable and the continuous variable. However, for the relationship between an artificially dichotomized and continuous variable, the point-biserial correlation does not provide an accurate estimate of the true underlying population correlation (e.g., Cohen, 1983; MacCallum et al., 2002). The point-biserial correlation does not take into account that one of the variables is artificially dichotomized and interest lies in the underlying continuous variables, leading to biased (pooled) effect sizes. Jacobs and Viechtbauer (2017) showed that the bias increases with larger population correlation coefficients, and with larger imbalance of the groups. Using larger samples does not reduce this bias. Only when the population correlation is zero, the point-biserial correlation provides an unbiased estimate of the population correlation. In contrast to the point-biserial correlation, the biserial correlation assumes a continuous, normally distributed variable underlying the dichotomous variable (Tate, 1950). At a fixed threshold, the observations are assigned a 1 if they fall on the right side of this threshold and 0 if they fall on the left. For example, if one has to respond to a dichotomous item on a questionnaire used to indicate a specific neurodevelopmental disorder, it is assumed that there exists a continuous normal distribution of attitudes towards answering this item, but a specific threshold Ó 2020 Hogrefe Publishing
H. de Jonge et al., Artificially Dichotomized Variables in MASEM
determines which of the two answer options one chooses. It has been shown mathematically that for the relationship between an artificially dichotomized and continuous variable, the estimated biserial correlation provides an unbiased estimate of the relationship between the two underlying continuous variables (Soper, 1914; Tate, 1955). As may be expected, Jacobs and Viechtbauer (2017) found that pooling biserial correlations in a meta-analysis typically provides unbiased estimates of the correlation between the two underlying continuous variables. If the sample sizes were small and the population correlation was large, the population correlation was marginally underor overestimated depending on the exact dichotomization method (i.e., adaptive or hard cut-off). However, all bias could be considered negligible when the sample sizes were larger or equal to 60. Hence, the biserial correlation accounts for the artificial dichotomization of a variable and is, therefore, preferred over the point-biserial correlation if the correlation between an artificially dichotomized and continuous variable is included in a meta-analysis (see also Hunter & Schmidt, 1990).
Calculating the (Point-)Biserial Correlation for a Meta-Analysis The point-biserial and biserial correlation can be computed in different ways depending on which summary statistics are reported in a primary study (Jacobs & Viechtbauer, 2017). If the sample means of the two groups (defined as y1 and y0 ), the groups sizes (denoted as n1 and n0), and the sample standard deviations (defined as s1 and s0) are provided in the primary study, Cohen’s d can be first computed with
y1 y0 d ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : 2 2 ðn1 1Þs1 þðn0 1Þs0 n1 þn0 2
y1 y0 sy
rffiffiffiffiffiffiffiffiffiffiffi npq ; n 1
ð2Þ
ð3Þ
where p = n1/n and q = 1 – p = n0/n. Alternatively, if only the test statistic of the independent sample t-test (with Ó 2020 Hogrefe Publishing
t r pb ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffi : 2 t þm
ð4Þ
Once a point-biserial correlation coefficient is obtained, one can convert this into the biserial correlation coefficient using the following equation
pffiffiffiffiffi pq r b ¼ r pb : f zp
ð5Þ
In this equation, as f(zp) indicates the density of the standard normal distribution at value zp. The value zp is the point for which P(Z > zp) = p, where Z is a standard normally distributed random variable. The density of the standard normal distribution can be computed using z2 p 1 f z p ¼ pffiffiffiffiffiffi e 2 : 2π
ð6Þ
Aim and Expectations Our aim is to investigate the effects of using (1) the pointbiserial correlation and (2) the biserial correlation for the relationship between an artificially dichotomized variable and a continuous variable on MASEM-parameters and model fit. Specifically, our interest lies in path coefficients, standard errors/confidence intervals of these coefficients, and model fit. We expect that the use of the point-biserial correlation for the relation between an artificially dichotomized and continuous variable biases the path coefficients in MASEM. In contrast, we expect unbiased path coefficients if one uses the biserial correlation instead.
Method
where h is denoted as h = m/n1 + m/n0 and m = n1 + n0 2: If the standard deviations are not provided separately for both groups, but only of the y scores (denoted as sy), the point-biserial correlation coefficient can be computed by
r pb ¼
test statistic t) and the two group sizes are reported, one can transform this into the point-biserial correlation coefficient with
ð1Þ
Next, one can convert this Cohen’s d into the point-biserial correlation coefficient using
d r pb ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi ; 2 d þh
27
We performed two simulation studies to evaluate the effects of using the point-biserial and biserial correlation in a full and partial mediation model. Simulation Study 1 – Full Mediation In the first simulation study, we simulated meta-analytic data according to a full mediation (hence overidentified) population model (see Figure 1), with a continuous predictor variable X, continuous mediator M, and a continuous variable Y as outcome. Depending on the condition, the predictor variable X is artificially dichotomized in all or a given percentage of the primary studies. For comparison, we also analyzed the original datasets in which we did not dichotomize the predictor variable X at all. We chose this population model because in educational research the Zeitschrift für Psychologie (2020), 228(1), 25–35
28
H. de Jonge et al., Artificially Dichotomized Variables in MASEM
Figure 1. Population model with fixed parameter values.
median number of variables in a “typical” meta-analysis is three (De Jonge & Jak, 2018) and because mediation is a popular research topic. Under this population model, random meta-analytic datasets were generated under different conditions. We systematically varied the following: (1) the size of the (standardized) path coefficient between X and M (βMX = .16, βMX = .23, βMX = .33), (2) the percentage of primary studies in which X was artificially dichotomized (25%, 75%, 100%), and (3) the cut-off point at which X was artificially dichotomized (at the median value, so a proportion of .5, or when groups become unbalanced, at a proportion of .1). These choices were mainly based on typical situations in educational research. The size of the path coefficient reflects the minimum, mean/median, and maximum pooled Pearson product-moment correlations in a “typical” metaanalysis in educational research (De Jonge & Jak, 2018). The 75% primary studies that artificially dichotomize the variable X, is based on a comparable example of a metaanalysis in educational research (Jansen et al., 2019). We chose a cut-off proportion of .5 because a median split, leading to balanced groups, is a common way to dichotomize variables. The cut-off proportion of .1 leads to unbalanced groups, and is representative of clinical cut-offs for attributes like dyslexia or depression. To ensure that the model-implied covariance matrix was a correlation matrix, we fixed the residual variances of all three variables at values that lead to total variances of one. We used between-study variances of .01. The number of primary studies in a meta-analysis was fixed at the median number of a “typical” meta-analysis, which is 44 (De Jonge & Jak, 2018). Because we use a random-effects MASEM-method, the assumption is thus that the population comprises 44 subpopulations from which the 44 samples are drawn, and that the weighted mean of the subpopulation parameters equals the population parameter. Given a specific condition and the fixed number of 44 primary studies, we randomly sampled the within primary study sample sizes from a positively skewed distribution as used in Hafdahl (2007) with a mean of 421.75, yielding “typical” sample sizes (De Jonge & Jak, 2018) for every iteration. We imposed 39% missing correlations (Sheng, Kong, Cortina, & Hou, 2016) by (pseudo)randomly deleting either variable M or Y from 26 of the 44 studies. Zeitschrift für Psychologie (2020), 228(1), 25–35
Simulation Study 2 – Partial Mediation In the second simulation study, we included an extra fixed direct effect of .16 between the predictor variable X and outcome variable Y (βYX) in the population model. This value represents a small effect in a “typical” meta-analysis in educational research (De Jonge & Jak, 2018). We used the same conditions as in our first simulation.
Analyses In each condition, we generated 2,000 meta-analytic datasets drawn from the 44 subpopulations, which we analyzed using (1) the point-biserial and (2) the biserial correlation as effect size between the artificially dichotomized predictor X and the continuous variables (M or Y). To simulate the data and to carry out the analyses we used R (version 3.5.3; R Core Team, 2019). The R scripts to reproduce the results can be found in De Jonge, Jak, and Kan (2019a). The full or partial mediation models were fitted using random-effects two stage structural equation modeling (TSSEM) (Cheung, 2014). Unless stated otherwise, we used default settings of the tssem1() and tssem2() functions within the R package metaSEM (Cheung, 2015b). In the first stage of the random-effects TSSEM, a pooled correlation matrix is estimated using maximum likelihood estimation. In the second stage, the hypothesized structural equation model is fitted to this pooled correlation matrix using weighted least squares estimation. As recommended (Becker, 2009; Hafdahl, 2007), we used the weighted mean correlation across the included primary studies to estimate the sampling variances and covariances of the correlation coefficients in the primary studies. To make sure that the possible differences in outcomes when analyzing point-biserial versus biserial correlations are due to the use of a different kind of correlation coefficient, these analyses were performed using the same datasets.
Evaluation Criteria We estimated the relative percentage bias in all path coef^ β =β . If the estimation ficients, calculated as 100 β bias was less than 5%, we considered it as negligible Ó 2020 Hogrefe Publishing
H. de Jonge et al., Artificially Dichotomized Variables in MASEM
(Hoogland & Boomsma, 1998). Additionally, we estimated the bias in the standard error of the direct effects. We calculated the relative percentage bias of the standard errors ^ ^ ^ as 100 SE β SD β =SDðβÞ. In this equation, ^ SE β is the average standard error of the parameter esti^ is the standard deviamate across replications and SDðβÞ tion of the parameter estimate across replications. If the relative percentage bias of the standard errors was less than 10% we considered it as acceptable (Hoogland & Boomsma, 1998). We tested the significance of the indirect effects using 95% likelihood-based confidence intervals, which are recommended over the Wald confidence intervals in this case (Cheung, 2009). Over the 95% likelihood-based confidence intervals we calculated the coverage percentages, which is the percentage confidence intervals that includes the population parameter. For comparison, we also calculated the coverage percentages of the 95% Wald confidence intervals and likelihood-based confidence intervals of the direct effects. To evaluate model fit, we calculated the rejection rates (i.e., proportion of significant test results) of the chi-square statistic of the full mediation model of Stage 2 (df = 1, α = .05). We tested whether the rejection rate significantly differed from the nominal α-level with the proportion test (using α = .05). By means of QQplots and the Kolmogorov-Smirnov test (using α = .05), we compared the theoretical chi-square distribution with one degree of freedom with the empirical chi-square distributions.
Results The simulation results are based on the datasets that converged in Stage 1 and Stage 2 of the random-effects TSSEM. In most conditions, there were no convergence problems, as noted in Tables 1–4.
Simulation Study 1 – Full Mediation Direct Effects Table 1 shows that when the point-biserial correlation was used, the relative percentage bias in the estimated path coefficient between the predictor and mediator (βMX) was between 41.70% and 5.05%. The relative percentage bias in this path coefficient exceeded the set boundary of 5% in all conditions, representing substantial bias. This bias was negative in all conditions. When the percentage of primary studies in which the predictor variable X was artificially dichotomized increased, the bias in βMX also increased. In the conditions with a cut-off proportion of .1, the relative percentage bias in βMX was always larger compared to the conditions in which the cut-off proportion was .5. There was no systematic difference in the relative Ó 2020 Hogrefe Publishing
29
percentage bias in this path coefficient between conditions with different population values of βMX. The relative percentage bias in the path coefficient between the continuous mediator and continuous outcome variable (βYM) was below 5% in all conditions, which can be considered negligible. The relative percentage bias in the standard errors of both path coefficients was below the set boundary of 10% in all conditions. However, note that the bias in the standard error of βYM was negative in all conditions. The coverage percentages of the 95% likelihood-based confidence interval of βMX were between 0.00% and 92.10% and of βYM between 91.15% and 93.38% (see Table 3/SIM1 in De Jonge, Jak, & Kan, 2019b). The coverage percentages of the 95% Wald confidence intervals were roughly the same for both effects (see Table 4/SIM1 in De Jonge et al., 2019b). The low coverage percentages of the confidence intervals in some conditions are not surprising given the bias in the point estimates caused by using the pointbiserial correlation coefficient. When the biserial correlation was used instead of the point-biserial correlation, the relative percentage bias in both path coefficients in the model (βMX and βYM) was below 5% in all conditions (see Table 1). The relative percentage bias in the standard errors of both path coefficients was less than the set boundary of 10% in all conditions. Therefore, there was no substantial bias in the parameter estimates or the standard errors according to the criteria that were applied. However, note that the relative percentage bias in the standard errors of both path coefficients was negative in all conditions. In accordance, the coverage percentages of the 95% likelihood-based as well as the Wald confidence intervals were slightly below 95% for both path coefficients (see Tables 3–4/SIM1 in De Jonge et al., 2019b). Indirect Effect Since indirect effects are the product of two direct effects, any bias in the direct effects will induce bias in the indirect effect. Indeed, Table 2 shows that if the point-biserial correlation was used, the bias in the indirect effect was always negative and above the set boundary of 5%. The bias in the indirect effect increased according to the same patterns as when the bias in βMX increased. The coverage percentages of the 95% likelihood-based confidence intervals of the indirect effect were between 0.05% and 92.50%. If the biserial correlation was used instead, the bias in the indirect effect was always below 5%. The coverage percentages were between 92.55% and 95.05%. Model Fit The rejection rates of the chi-square test of model fit at Stage 2 of the random-effects TSSEM (df = 1, α = .05) are Zeitschrift für Psychologie (2020), 228(1), 25–35
30
H. de Jonge et al., Artificially Dichotomized Variables in MASEM
Table 1. Simulation Study 1 (full mediation): Percentages estimation bias in the direct effects and their standard errors Condition
Bias in SE of βMX
Bias in βMX
Converged
Bias in βYM
Bias in SE of βYM
DICH
CO
ES
rpb
rb
rpb
rb
rpb
rb
rpb
rb
rpb
rb
25
.1
.16
2,000
2,000
10.656
0.190
2.093
3.538
0.305
0.294
6.815
6.809
.23
2,000
2,000
10.513
0.118
0.426
4.790
0.491
0.457
5.715
5.553
.33
2,000
2,000
10.619
0.217
4.126
3.512
0.289
0.391
5.131
5.581
.16
2,000
1,999
5.046
0.066
1.814
2.328
0.065
0.068
6.535
6.629
.23
2,000
2,000
5.197
0.159
0.762
2.259
0.046
0.044
8.897
8.837
.33
2,000
2,000
5.375
0.311
0.382
2.189
0.109
0.102
7.459
7.472
.16
1,999
2,000
31.412
0.259
3.576
6.438
0.114
0.113
4.621
4.373
.23
2,000
1,999
31.382
0.164
1.123
3.948
0.354
0.384
3.723
3.790
.33
2,000
2,000
31.224
0.010
8.956
3.507
0.137
0.240
5.210
5.303
.16
2,000
1,999
14.976
0.211
1.497
2.176
0.283
0.276
6.588
6.634
.23
2,000
2,000
15.164
0.032
2.383
3.944
0.161
0.157
3.516
3.509
.33
2,000
2,000
15.175
0.026
0.769
2.368
0.393
0.403
6.381
6.368
.16
1,994
2,000
41.290
0.350
3.651
4.248
0.508
0.513
5.186
5.245
.23
1,995
2,000
41.509
0.121
4.528
4.674
0.293
0.293
4.608
4.817
.33
1,990
1,999
41.702
0.355
1.463
2.131
0.028
0.034
5.601
5.518
.16
2,000
2,000
20.091
0.129
2.645
2.627
0.076
0.083
9.569
9.580
.23
1,999
1,999
20.251
0.058
6.689
6.500
0.254
0.247
7.018
6.953
.33
2,000
2,000
20.473
0.340
3.935
4.009
0.111
0.116
6.086
6.067
.5
75
.1
.5
100
.1
.5
Note. DICH = percentage of primary studies in which X was artificially dichotomized; CO = cut-off point at which X was artificially dichotomized; ES = size of the systematically varied (standardized) path coefficient between X and M; Converged = number of datasets that converged in Stage 1 and Stage 2 of the random-effects TSSEM; Bias in βMX = relative percentage bias in the path coefficient between X and M; Bias in SE of βMX = relative percentage bias in the standard error of the path coefficient between X and M; Bias in βYM = relative percentage bias in the path coefficient between M and Y; Bias in SE of βYM = relative percentage bias in the standard error of the path coefficient between M and Y; rpb = point-biserial correlation; rb = biserial correlation.
Table 2. Simulation Study 1 (full mediation): Percentages estimation bias in the indirect effect and the coverage percentages of the 95% likelihood-based confidence interval Condition
Converged
Bias in indirect
Coverage
DICH
CO
ES
rpb
rb
rpb
rb
rpb
rb
25
.1
.16
2,000
2,000
10.956
0.506
86.550
93.150
.5
75
.1
.5
100
.1
.5
.23
2,000
2,000
10.972
0.588
83.100
92.800
.33
2,000
2,000
10.402
0.140
82.100
92.750
.16
2,000
1,999
5.046
0.078
92.500
94.097
.23
2,000
2,000
5.199
0.158
90.400
92.550
.33
2,000
2,000
5.529
0.463
89.950
93.400
.16
1,999
2,000
31.416
0.230
35.568
93.150
.23
2,000
1,999
31.184
0.160
19.450
93.647
.33
2,000
2,000
31.178
0.171
8.500
93.550
.16
2,000
1,999
15.232
0.086
78.750
94.147
.23
2,000
2,000
15.116
0.020
74.300
95.050
.33
2,000
2,000
15.558
0.433
66.000
93.050
.16
1,994
2,000
41.597
0.178
6.770
93.750
.23
1,995
2,000
41.725
0.489
1.303
93.350
.33
1,990
1,999
41.750
0.446
0.050
93.747
.16
2,000
2,000
20.061
0.176
67.500
92.800
.23
1,999
1,999
20.530
0.403
54.527
93.647
.33
2,000
2,000
20.428
0.279
43.100
92.650
Note. DICH = percentage of primary studies in which X was artificially dichotomized; CO = cut-off point at which X was artificially dichotomized; ES = size of the systematically varied (standardized) path coefficient between X and M; Converged = number of datasets that converged in Stage 1 and Stage 2 of the random-effects TSSEM; Bias in indirect = relative percentage bias in the indirect effect of X on Y (βMX βYM); Coverage = percentage of confidence intervals that includes the population parameter of the indirect effect of X on Y; rpb = point-biserial correlation; rb = biserial correlation.
Zeitschrift für Psychologie (2020), 228(1), 25–35
Ó 2020 Hogrefe Publishing
H. de Jonge et al., Artificially Dichotomized Variables in MASEM
31
Table 3. Simulation Study 2 (partial mediation): Percentages estimation bias in the direct effects and their standard errors Condition
Converged
Bias in βMX
rpb
rpb
0.967 0.193
2.039 0.188 5.573 5.611 11.338 0.544 0.539 3.152
.23 2,000 2,000 10.502 0.172
0.715 4.349
2.915 0.360 4.026 4.063 11.621 0.255 0.047 4.018
.33 2,000 2,000 10.660 0.299
5.625 2.604
5.035 0.016 4.300 4.313 12.083
100
.1
rb
rpb
rb
rpb
0.798
rpb
rb
1.260 2.738
.16 2,000 2,000
5.093 0.015 0.537 1.059
1.169
0.053 6.179 6.083
5.299 0.057 2.234 3.302
5.249 0.209 0.237 1.699
1.713
0.072 6.874 6.832
5.679 0.147 2.956 3.630
.33 2,000 2,000
5.417 0.361
1.479 0.373
2.437 0.070 5.324 5.320
6.320 0.131
0.788 0.513
.16 1,999 1,999 31.490 0.424 2.142 4.435
5.991
0.093 4.080 4.165 31.758
.23 1,999 2,000 31.300 0.163
9.072
0.465 3.479 3.443 33.262 0.106
5.826
1.534
0.095 2.392 1.913 35.412
8.289
2.875
.16 2,000 2,000 15.078
2.734 2.228
11.733 1.323 12.831
0.060 1.447 1.982
2.825 0.345 6.084 5.900 15.689
0.320 0.049
2.850 1.643
0.117 3.038 3.367
.23 2,000 2,000 15.200 0.068 1.002 2.605
4.703
0.061 2.979 3.198 16.388
0.043
2.174
.33 1,999 2,000 15.184
0.002 0.056 1.666
6.505 0.467 4.998 4.967 17.718
0.161
0.031 1.366
.16 1,990 2,000 41.399
0.207 2.694 2.929
6.839 0.523 4.376 3.900 42.399
0.163 1.220 1.010
.23 1,993 2,000 41.450
0.008 3.123 2.784 10.413 0.243 4.545 4.803 43.786 0.357 2.264 1.560
.33 1,992 2,000 41.680 0.326 0.554 0.987 15.556 0.077 5.389 6.112 45.845 .5
rb
.23 2,000 2,000
.33 1,999 2,000 31.200 0.069 .5
rpb
Bias in SE of βYX
ES
.1
rb
Bias in βYX
.16 2,000 2,000 10.552 0.134
75
rpb
Bias in SE of βYM
25
.5
rb
Bias in βYM
DICH CO .1
rb
Bias in SE of βMX
0.154
0.077 5.270 4.569
.16 2,000 2,000 19.999
0.247 1.204 1.007
4.261
.23 1,998 1,998 20.201
0.010 5.508 5.169
5.914 0.076 6.728 6.947 22.039 0.295 4.616 4.555
.33 1,999 2,000 20.472 0.336 2.937 2.883
8.994
0.163 7.452 6.997 21.231 0.317 0.061 4.362 4.140 23.464
0.019
0.044
0.146 0.146 0.169
Note. DICH = percentage of primary studies in which X was artificially dichotomized; CO = cut-off point at which X was artificially dichotomized; ES = size of the systematically varied (standardized) path coefficient between X and M; Converged = number of datasets that converged in Stage 1 and Stage 2 of the random-effects TSSEM; Bias in βMX = relative percentage bias in the path coefficient between X and M; Bias in SE of βMX = relative percentage bias in the standard error of the path coefficient between X and M; Bias in βYM = relative percentage bias in the path coefficient between M and Y; Bias in SE of βYM = relative percentage bias in the standard error of the path coefficient between M and Y; Bias in βYX = relative percentage bias in the path coefficient between X and Y; Bias in SE of βYX = relative percentage bias in the standard error of the path coefficient between X and Y; rpb = point-biserial correlation; rb = biserial correlation.
Table 4. Simulation Study 2 (partial mediation): Percentages estimation bias in the indirect effect and the coverage percentages of the 95% likelihood-based confidence interval Condition
Converged
Bias in indirect
Coverage
DICH
CO
ES
rpb
rb
rpb
rb
rpb
rb
25
.1
.16
2,000
2,000
8.824
0.461
90.400
94.600
.5
75
.1
.5
100
.1
.5
.23
2,000
2,000
7.895
0.558
88.400
92.750
.33
2,000
2,000
6.173
0.344
90.300
92.450
.16
2,000
2,000
4.093
0.088
93.250
93.850
.23
2,000
2,000
3.674
0.195
92.250
93.100
.33
2,000
2,000
3.163
0.493
93.000
93.450
.16
1,999
1,999
27.458
0.530
50.875
93.747
.23
1,999
2,000
25.093
0.186
45.323
94.700
.33
1,999
2,000
22.394
0.053
46.573
94.300
.16
2,000
2,000
12.696
0.357
84.000
92.850
.23
2,000
2,000
11.298
0.137
84.750
94.800
.33
1,999
2,000
9.691
0.518
84.592
93.350
.16
1,990
2,000
37.398
0.481
16.382
93.650
.23
1,993
2,000
35.375
0.376
8.028
93.900
.33
1,992
2,000
32.622
0.498
5.422
93.500
.16
2,000
2,000
16.645
0.279
76.850
93.100
.23
1,998
1,998
15.543
0.190
74.224
93.644
.33
1,999
2,000
13.342
0.335
76.438
93.750
Note. DICH = percentage of primary studies in which X was artificially dichotomized; CO = cut-off point at which X was artificially dichotomized; ES = size of the systematically varied (standardized) path coefficient between X and M; Converged = number of datasets that converged in Stage 1 and Stage 2 of the random-effects TSSEM; Bias in indirect = relative percentage bias in the indirect effect of X on Y (βMX βYM); Coverage = percentage of confidence intervals that includes the population parameter of the indirect effect of X on Y; rpb = point-biserial correlation; rb = biserial correlation.
Ó 2020 Hogrefe Publishing
Zeitschrift für Psychologie (2020), 228(1), 25–35
32
provided in Table 5/SIM1 in De Jonge et al. (2019b). The rejection rates of the chi-square test were slightly above the nominal α-level (.05) in almost all conditions, no matter if the biserial or point-biserial correlation was used. For both types of correlations, the difference between the rejection rate and the nominal α-level was statistically significant in 5 of the 18 conditions. Based on chance (with α = .05), we would expect a significant difference in one condition. The results of the Kolmogorov-Smirnov test provided in Table 6/ SIM1 in De Jonge et al. (2019b) and the QQplots in De Jonge et al. (2019b) show that when the biserial correlation was used, there was a statistically significant difference between the empirical chi-square distribution and the theoretical chi-square distribution with one degree of freedom in 5 of the 18 conditions. When the point-biserial correlation was used, there was a significant difference in the same 5 conditions plus in 3 other conditions. There seems to be no systematic pattern in which conditions the distributions differed significantly or not.
Simulation Study 2 – Partial Mediation Direct Effects Table 3 shows that when the point-biserial correlation was used, the relative percentage bias in βMX was between 41.68% and 5.09%. The bias was always negative and exceeded the set boundary of 5% in all conditions. The pattern of the bias across conditions was very similar to the findings in simulation Study 1. The relative percentage bias in βYM was between 1.17% and 15.56%. This bias was positive in all conditions and exceeded 5% in 10 of 18 conditions. When the percentage of primary studies in which the predictor variable X was artificially dichotomized increased, the bias in βYM also increased. The relative percentage bias in βYM was always larger in conditions in which the cut-off proportion was .1 compared to conditions with a cut-off proportion of .5. If the population value of βMX increased the overestimation of βYM also increased. The relative percentage bias in the path coefficient between the predictor X and outcome variable Y (βYX) was between 45.85% and 5.30%. The bias was always negative and exceeded 5% in all conditions. The bias in βYX increases according the same patterns as when the relative percentage bias in βYM increases. Except from one condition, the relative percentage bias in the standard errors of all path coefficients was less than 10%, representing insubstantial bias. However, note that the bias in the standard error of βYM was always negative. The coverage percentages of the 95% likelihood-based confidence intervals of βMX were between 0.00% and 92.15%, of βYM between 70.13% and 92.10%, and of βYX between 0.10% and 92.55% (see Table 3/SIM2 in De Jonge et al., 2019b). The coverage percentages of the 95% Wald confidence Zeitschrift für Psychologie (2020), 228(1), 25–35
H. de Jonge et al., Artificially Dichotomized Variables in MASEM
intervals were very similar to the results from the likelihood-based confidence intervals (see Table 4/SIM2 in De Jonge et al., 2019b). When the biserial correlation was used instead, the relative percentage bias in all path coefficients (βMX, βYM, and βYX) was below 5% (see Table 3). This bias could be considered as negligible according to the criteria that were applied. The relative percentage bias in the standard errors of all path coefficients was less than 10%, representing no substantial bias. However, note that the bias in the standard errors of βMX and βYM was always negative. The coverage percentages of the 95% likelihood-based confidence intervals of βMX and βYM were slightly below 95% and of βYX between 92.85% and 95.10% (see Table 3/SIM2 in De Jonge et al., 2019b). The coverage percentages of the 95% Wald confidence intervals were roughly the same as the likelihood-based confidence intervals (see Table 4/ SIM2 in De Jonge et al., 2019b). Indirect Effect Table 4 shows that the relative percentage bias in the indirect effect of X on Y (βMX βYM) was between 37.40% and 3.16%, if the point-biserial correlation was used. In 15 of the 18 conditions this bias was above the set boundary of 5%, representing substantial bias. Only in conditions in which the percentage of primary studies which artificially dichotomized variable X was 25% and had a cut-off proportion of .5, the bias was below 5%. The bias in the indirect effect increased when the percentage of primary studies that artificially dichotomize variable X increased. The bias in the indirect effect was always larger in conditions with a cut-off proportion of .1, compared with conditions in which the cut-off proportion was .5. When the population value of βMX increased, the bias in the indirect effect decreased. The coverage percentages of the 95% likelihoodbased confidence intervals of this indirect effect were between 5.42% and 93.25%. If the biserial correlation was used instead, the relative percentage bias in the indirect of X on Y was below 5% in all conditions (see Table 4). The coverage percentages of the 95% likelihood-based confidence intervals of the indirect effect were slightly below 95%.
Discussion We performed two Monte Carlo simulation studies in order to advise researchers how to deal with primary studies with artificially dichotomized predictor variables in MASEM. When the point-biserial correlation for the relation between the artificially dichotomized predictor X and the continuous variables was used, the path coefficient between the predictor and mediator (βMX) was systematically underestimated Ó 2020 Hogrefe Publishing
H. de Jonge et al., Artificially Dichotomized Variables in MASEM
no matter if a full or partial mediation model was fitted. When the biserial correlation instead of the point-biserial correlation was used, this path coefficient seems unbiased in each condition. If a full mediation model was fitted, the estimated path coefficient between the two continuous variables (βYM) could be considered unbiased in all conditions no matter if the biserial or point-biserial correlation was used. However, if the partial mediation model was fitted, βYM seems overestimated if the point-biserial correlation was used, while this path coefficient seems unbiased if the biserial correlation was used instead. The extra direct effect in the partial mediation model (βYX), was systematically underestimated if the point-biserial correlation was used, but seems unbiased if the biserial correlation was used. The indirect effect of X on Y seems underestimated if the point-biserial correlation was used, regardless if the full or partial mediation model was fitted. If the biserial correlation was used instead, the indirect effect seems unbiased. Hence, using the point-biserial correlation often leads to bias in the path coefficients, while the biserial correlation leads to unbiased path coefficients.
Underestimation of the Standard Errors The relative percentage bias in the standard errors of all path coefficients could typically be considered as insubstantial. For the MASEM-analyses, the sampling variance and covariance of the correlation coefficients in the primary studies are estimated to meta-analyze the correlation matrices. To estimate the sampling variance of the biserial correlation, Jacobs and Viechtbauer (2017) showed that the socalled Soper’s (1914) exact and approximate methods have the best overall performance for practical use in meta-analysis. However, to our knowledge, there are no existing formulas to estimate the sampling covariance between two biserial correlations, between a biserial and point-biserial correlation, and between a biserial and Pearson productmoment correlation. In our study, we therefore used the existing formulas for the sampling (co)variance of Pearson product-moment correlations, as implemented in the metaSEM package, and plugged in the biserial correlation coefficients instead. Our results suggest that using the formulas for Pearson product-moment correlations for the biserial correlation has, under the investigated conditions, no serious consequences in meta-analytic practice, since the relative percentages bias in the point estimate and standard errors of the direct effects could be considered negligible according to the criteria that were applied. However, we noticed that the relative percentage bias in the standard error of the path coefficient between the predictor and mediator (βMX) seems systematically negatively biased when the biserial correlation was used. In accordance, the coverage percentages of the 95% likelihood-based and Ó 2020 Hogrefe Publishing
33
Wald confidence intervals were always slightly underestimated. This is in accordance with the study of Jacobs and Viechtbauer (2017), who showed that if the biserial correlation is plugged in into the formula for the sampling variance of the Pearson product-moment correlation, this generally leads to an underestimation of the true sampling variance. In contrast, when the point-biserial correlation was used in our simulation studies, the relative percentage bias in the standard error of βMX seems not systematically negative. Additional simulations, in which we did not dichotomize the predictor variable X at all so we could use Pearson product-moment correlations, also showed that the bias in the standard error of βMX seems also not systematically negative (results see Table 1/SIM1 and Table 1/SIM2 in De Jonge et al., 2019b). Future research is needed to further investigate this issue and to develop formulas to calculate the sampling covariance between two biserial correlations, between a biserial and point-biserial correlation, and between a biserial and Pearson product-moment correlation. Possibly, using an “incorrect” formula for the sampling (co)variances is not the only possible explanation of the negative bias in standard errors with the biserial correlation. By inspecting our results, we found that the relative percentage bias in the standard error of the path coefficient between the continuous variables M and Y (βYM) seems also systematically negative regardless whether the point-biserial or biserial correlation was used. In accordance, the coverage percentages of the 95% likelihood-based as well as the Wald confidence intervals were always underestimated. When the data were not dichotomized at all and the Pearson product-moment correlation was used, the relative percentage bias in the standard errors of βYM seems also systematically negative (results see Table 1/SIM1 and Table 1/SIM2 in De Jonge et al., 2019b) and accordingly the coverage percentages were also always slightly below 95% (results see Tables 3–4/SIM1 and Tables 3–4/SIM2 in De Jonge et al., 2019b). We found the same patterns in the relative percentage bias in the standard errors of the pooled correlation coefficients between the continuous variables M and Y in Stage 1 (results see Table 8/SIM1 and Table 6/SIM2 in De Jonge et al., 2019b). One tentative cause of the underestimated standard errors and confidence intervals could be that the sampling (co)variances from the primary studies are treated as known in MASEM (similar to V-known models in univariate meta-analysis), while they are actually estimated. A similar underestimation in standard errors is found in univariate random-effects meta-analysis as a consequence of not taking into account the uncertainty due to estimating the between-study and sampling variance (Sánchez-Meca & Marín-Martínez, 2008; Viechtbauer, 2005). Note, however, that the bias that we found was within the limit of 10% and often even below 5% in all conditions. Future research Zeitschrift für Psychologie (2020), 228(1), 25–35
34
would be needed to verify the robustness of our results and to further investigate this issue.
Model Fit In most conditions, the rejection rate of the chi-square test of model fit at Stage 2 of the random-effects TSSEM was slightly above the nominal α-level, no matter if the pointbiserial or biserial correlation was used. When the predictor was not dichotomized at all and the Pearson productmoment correlation was used, the rejection rate was also in almost all conditions slightly higher than .05 (results see Table 5/SIM1 in De Jonge et al., 2019b). This finding is in accordance with simulation studies about fixed-effects TSSEM (Jak & Cheung, 2018a; Oort & Jak, 2016). Further research is needed to investigate why the chi-square test in TSSEM may provide rejection rates slightly above the nominal α-level.
Strengths, Limitations, and Recommendations This is the first simulation study in which the effect of using the point-biserial correlation versus the biserial correlation for the relation between an artificially dichotomized variable and continuous variable on MASEM-parameters and model fit is investigated. We chose two realistic population models and 18 different realistic conditions which can occur in educational research. This provides a clear first impression of the effect of using the point-biserial correlation versus the biserial correlation in mediation models using MASEM. However, further research is needed to investigate the effect in other research settings in which MASEM will be applied, for example, investigating the effect in more complex models, which include more than three variables or moderation effects. Future research is also needed with regard to further assumptions of the models. Maximum likelihood, for instance, assumes that the population distributions of the endogenous variables are multivariate normally distributed (Kline, 2015). In these simulation studies, we artificially dichotomized the exogenous variable (i.e., predictor variable X), which does not lead to a violation of this assumption. Therefore, further research is needed to investigate the effect of using the point-biserial versus the biserial correlation if an endogenous variable is artificially dichotomized, so that the normality assumption is violated.
Conclusion We advise researchers who want to apply MASEM and want to investigate mediation to convert the effect size between any artificially dichotomized predictor and continZeitschrift für Psychologie (2020), 228(1), 25–35
H. de Jonge et al., Artificially Dichotomized Variables in MASEM
uous variable to a biserial correlation, not to a point-biserial correlation.
References Ahn, S., Ames, A. J., & Myers, N. D. (2012). A review of metaanalyses in education: Methodological strengths and weaknesses. Review of Educational Research, 82, 436–476. https:// doi.org/10.3102/0034654312458162 Becker, B. J. (2009). Model-based meta-analysis. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis (2nd ed., pp. 377–395). New York, NY: Russell Sage Foundation. Becker, B. J. (1992). Using results from replicated studies to estimate linear models. Journal of Educational Statistics, 17, 341–362. https://doi.org/10.2307/1165128 Becker, B. J. (1995). Corrections to “using results from replicated studies to estimate linear models”. Journal of Educational and Behavioral Statistics, 20, 100–102. https://doi.org/10.2307/ 1165390 Cheung, M. W. L. (2009). Comparison of methods for constructing confidence intervals of standardized indirect effects. Behavior Research Methods, 41, 425–438. https://doi.org/10.3758/ BRM.41.2.425 Cheung, M. W.-L. (2014). Fixed-and random-effects meta-analytic structural equation modeling: Examples and analyses in R. Behavior Research Methods, 46, 29–40. https://doi.org/10.3758/ s13428-013-0361-y Cheung, M. W.-L. (2015a). Meta-analysis: A structural equation modeling approach. Chichester, UK: Wiley. Cheung, M. W.-L. (2015b). metaSEM: An R package for metaanalysis using structural equation modeling. Frontiers in Psychology, 5, 1521. https://doi.org/10.3389/fpsyg.2014.01521 Cheung, M. W.-L., & Chan, W. (2005). Meta-analytic structural equation modeling: a two-stage approach. Psychological methods, 10, 40–64. https://doi.org/10.1037/1082-989X.10.1.40 Cheung, M. W.-L., & Hafdahl, A. R. (2016). Special issue on metaanalytic structural equation modeling: Introduction from the guest editors. Research Synthesis Methods, 7, 112–120. https://doi.org/10.1002/jrsm.1212 Cohen, J. (1983). The cost of dichotomization. Applied psychological measurement, 7, 249–253. https://doi.org/10.1177/ 014662168300700301 De Jonge, H., & Jak, S. (2018, June). A meta-meta-analysis: Identifying typical conditions of meta-analyses in educational research. Paper presented at Research Synthesis 2018, Trier, Germany. http://dx.doi.org/10.23668/psycharchives.853 De Jonge, H., Jak, S., & Kan, K. J. (2019a). Electronic supplementary materials belonging to the paper ‘Dealing with artificially dichotomized variables in meta-analytic structural equation modeling’ [R scripts]. PsychArchives. https://doi.org/10.23668/ psycharchives.2618 De Jonge, H., Jak, S., & Kan, K. J. (2019b). Electronic supplementary materials belonging to the paper “Dealing with artificially dichotomized variables in meta-analytic structural equation modeling”. [Supplementary materials]. PsychArchives. https:// doi.org/10.23668/psycharchives.2617 Glass, G. V. (1976). Primary, secondary, and meta-analysis of research. The Educational Researcher, 10, 3–8. https://doi.org/ 10.3102/0013189X005010003 Hafdahl, A. R. (2007). Combining correlation matrices: Simulation analysis of improved fixed-effects methods. Journal of Educational and Behavioral Statistics, 32, 180–205. https://doi.org/ 10.3102/1076998606298041
Ó 2020 Hogrefe Publishing
H. de Jonge et al., Artificially Dichotomized Variables in MASEM
Hagger, M. S., & Chatzisarantis, N. L. (2016). The trans-contextual model of autonomous motivation in education: Conceptual and empirical issues and meta-analysis. Review of Educational Research, 86, 360–407. https://doi.org/10.3102/ 0034654315585005 Hoogland, J. J., & Boomsma, A. (1998). Robustness studies in covariance structure modeling: An overview and a metaanalysis. Sociological Methods & Research, 26, 329–367. https://doi.org/10.1177/0049124198026003003 Hunter, J. E., & Schmidt, F. L. (1990). Dichotomization of continuous variables: The implications for meta-analysis. Journal of Applied Psychology, 75, 334–349. https://doi.org/10.1037/ 0021-9010.75.3.334 Jacobs, P., & Viechtbauer, W. (2017). Estimation of the biserial correlation and its sampling variance for use in meta-analysis. Research synthesis methods, 8, 161–180. https://doi.org/ 10.1002/jrsm.1218 Jak, S. (2015). Meta-analytic structural equation modelling. Springer. Jak, S., & Cheung, M. W.-L. (2018a). Accounting for missing correlation coefficients in fixed-effects MASEM. Multivariate behavioral research, 53, 1–14. https://doi.org/10.1080/00273171.2017. 1375886 Jak, S., & Cheung, M. W.-L. (2018b). Testing moderator hypotheses in meta-analytic structural equation modeling using subgroup analysis. Behavior Research Methods, 50, 1359–1373. https://doi.org/10.3758/s13428-018-1046-3 Jansen, D., Elffers, L., & Jak, S. (2019). The functions of shadow education in school careers: A systematic review. Manuscript submitted for publication. Ke, Z., Zhang, Q., & Tong, X. (2018). Bayesian meta-analytic SEM: A one-stage approach to modeling between-studies heterogeneity in structural parameters. Structural Equation Modeling, 26, 348–370. https://doi.org/10.1080/10705511.2018.1530059 Kline, R. B. (2015). Principles and practice of structural equation modeling. New York, NY: The Guilford Press. Lev, J. (1949). The point biserial coefficient of correlation. The Annals of Mathematical Statistics, 20, 125–126. https://doi. org/10.1214/aoms/1177730103 MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological methods, 7, 19–40. https://doi.org/ 10.1037/1082-989X.7.1.19 Maxwell, S. E., & Delaney, H. D. (1993). Bivariate median splits and spurious statistical significance. Psychological Bulletin, 113, 181. https://doi.org/10.1037/0033-2909.113.1.181 Montazemi, A. R., & Qahri-Saremi, H. (2015). Factors affecting adoption of online banking: A meta-analytic structural equation modeling study. Information & Management, 52, 210–226. https://doi.org/10.1016/j.im.2014.11.002 Oort, F. J., & Jak, S. (2016). Maximum likelihood estimation in meta-analytic structural equation modeling. Research Synthesis Methods, 7, 156–167. https://doi.org/10.1002/jrsm.1203 Pearson, K. (1909). On a new method of determining correlation between a measured character a, and a character b, of which only the percentage of cases wherein b exceeds (or falls short of) a given intensity is recorded for each grade of a. Biometrika, 7, 96–105. https://doi.org/10.2307/2345365 R Core Team. (2019). R: A language and environment for statistical computing. Retrieved from http://www.R-project.org
Ó 2020 Hogrefe Publishing
35
Rich, A., Brandes, K., Mullan, B., & Hagger, M. S. (2015). Theory of planned behavior and adherence in chronic illness: A metaanalysis. Journal of Behavioral Medicine, 38, 673–688. https:// doi.org/10.1007/s10865-015-9644-3 Sánchez-Meca, J., & Marín-Martínez, F. (2008). Confidence intervals for the overall effect size in random-effects meta-analysis. Psychological Methods, 13, 31. https://doi.org/10.1037/1082989X.13.1.31 Tate, R. F. (1950). The biserial and point correlation coefficients (Naval Research Project NR 042031). Retrieved from https:// www4.stat.ncsu.edu/ boos/library/mimeo.archive/ISMS__14.pdf Tate, R. F. (1954). Correlation between a discrete and a continuous variable. Point-biserial correlation. The Annals of mathematical statistics, 25, 603–607. Tate, R. F. (1955). The theory of correlation between two continuous variables when one is dichotomized. Biometrika, 42, 205–216. https://doi.org/10.2307/2333437 Sheng, Z., Kong, W., Cortina, J. M., & Hou, S. (2016). Analyzing matrices of meta-analytic correlations: Current practices and recommendations. Research Synthesis Methods, 7, 187–208. https://doi.org/10.1002/jrsm.1206 Soper, H. E. (1914). On the probable error of the bi-serial expression for the correlation coefficient. Biometrika, 10, 384–390. https://doi.org/10.2307/2331789 Vargha, A., Rudas, T., Delaney, H. D., & Maxwell, S. E. (1996). Dichotomization, partial correlation, and conditional independence. Journal of Educational and Behavioral Statistics, 21, 264–282. https://doi.org/10.3102/10769986021003264 Viechtbauer, W. (2005). Bias and efficiency of meta-analytic variance estimators in the random-effects model. Journal of Educational and Behavioral Statistics, 30, 261–293. Viswesvaran, C., & Ones, D. (1995). Theory testing: Combining psychometric meta-analysis and structural equations modeling. Personnel Psychology, 48, 865–885. https://doi.org/ 10.1111/j.1744-6570.1995.tb01784.x Wicherts, J. M., Borsboom, D., Kats, J., & Molenaar, D. (2006). The poor availability of psychological research data for reanalysis. American Psychologist, 61, 726–728. https://doi.org/10.1037/ 0003-066X.61.7.726 History Received May 14, 2019 Revision received October 13, 2019 Accepted November 16, 2019 Published online March 31, 2020 Open Data R scripts for simulation study 1 (full mediation) and simulation study 2 (partial mediation) are available in De Jonge, H., Jak, S., & Kan, K. J. (2019a). Tables with additional results of both simulation studies as well as QQplots of simulation study 1 are available in De Jonge, H., Jak, S., & Kan, K. J. (2019b). Hannelies de Jonge Department of Child Development and Education University of Amsterdam PO Box 15776 1001NG Amsterdam The Netherlands H.deJonge@uva.nl
Zeitschrift für Psychologie (2020), 228(1), 25–35
Original Article
Assessing the Quality of Systematic Reviews in Healthcare Using AMSTAR and AMSTAR2 A Comparison of Scores on Both Scales Karina Karolina De Santis1
and Ilkay Kaplan2
1
Faculty 3 Social Sciences, City University of Applied Sciences, Bremen, Germany
2
Faculty 11 Human and Health Sciences, Institute of Psychology, University of Bremen, Germany
Abstract: The current study assessed the consistency between A Measurement Tool to Assess Systematic Reviews (AMSTAR) and its updated version (AMSTAR2) applied to the same systematic reviews in healthcare. Data from k = 10 systematic reviews were coded by two raters using AMSTAR and AMSTAR2. AMSTAR and AMSTAR2 perfectly agreed on a subset of nine individual items and strongly correlated based on the total scores (percentage scores: ρ = .84, p = .002, k = 10; absolute scores: Spearman ρ = .83, p = .003, k = 10). The overall review quality was medium to high on AMSTAR, while the overall confidence in the results was low to critically low on AMSTAR2 for the same systematic reviews. AMSTAR2 can identify the sources of strengths, weaknesses, and biases in systematic reviews. However, the interpretation of the overall confidence in results of systematic reviews requires additional guidelines for users. Keywords: AMSTAR, AMSTAR2, systematic review, quality appraisal
Background Systematic reviews are frequently used in healthcare to guide evidence-based decision-making and future research. However, the quality of such reviews is often inadequate leading to poor replicability of their conclusions (Kedzior & Seehoff, 2018; Lakens, Hilgard, & Staaks, 2016; Matthias et al., 2019). It has even been suggested that some 60% of the currently available systematic reviews in healthcare are misleading, redundant or flawed (Ioannidis, 2016). Therefore, the quality assessment of systematic reviews is necessary to determine the confidence in the available scientific evidence. One method of assessing the quality of systematic reviews is A Measurement Tool to Assess Systematic Reviews (AMSTAR) (Shea, Grimshaw, et al., 2007). AMSTAR is an 11-item scale designed to evaluate the quality of systematic reviews of randomized-controlled trials (RCTs) in terms of literature searches, data coding, data synthesis, and quality of primary data. The items are scored as Yes, No, Can’t answer, or Not applicable (Shea, Grimshaw, et al., 2007). The total AMSTAR score is typically based on the sum of Yes ratings (Pieper, Koensgen, Breuing, Ge, & Wegewitz, 2018) that range from 0 to 11 points indicating minimum to maximum quality. Although the psychometric Zeitschrift für Psychologie (2020), 228(1), 36–42 https://doi.org/10.1027/2151-2604/a000397
properties of AMSTAR were shown to be acceptable by its developers (Shea, Bouter, et al., 2007; Shea, Grimshaw, et al., 2007; Shea et al., 2009), the scale was criticized by the users (Burda, Holmer, & Norris, 2016; Faggion, 2015; Pollock, Fernandes, & Hartling, 2017; Wegewitz, Weikert, Fishta, Jacobs, & Pieper, 2016). In general, the users noted that AMSTAR should focus on the methodological rather than the reporting quality, that item wording and scoring instructions should be revised, that the total quality score should not be computed, and that the scale should be updated and extended. The limited guidance regarding the use of AMSTAR in terms of reporting, interpreting, and applying the quality scores (Pollock, Fernandes, Becker, Featherstone, & Hartling, 2016) probably contributed to the poor reporting of AMSTAR assessment (Pieper et al., 2018) and the need for additional decision rules (Pollock et al., 2017). In order to address these limitations, a revised version of AMSTAR (AMSTAR2) was developed 10 years after the original scale (Shea et al., 2017). AMSTAR2 consists of 16 items (reworded original items and new items) and new guidelines for scoring the items and for rating the overall quality of systematic reviews. The items are scored as Yes, Partial Yes, No, or No Meta-analysis Conducted (Shea et al., 2017). Unlike AMSTAR, AMSTAR2 assesses the Ó 2020 Hogrefe Publishing
K. K. De Santis & I. Kaplan, AMSTAR vs. AMSTAR2
quality of systematic reviews of not only RCTs but also of non-randomized studies of healthcare interventions. The new items on AMSTAR2 reflect the current developments designed to improve reporting and quality of research. For example, one item tests if research questions and inclusion criteria include the Population, Intervention, Control, Outcome (PICO) components required for healthcare interventions. Other new items address issues, such as the presence of review protocol, justification for excluding studies, appropriate risk of bias assessment, and the sources of funding for the primary studies. In terms of its psychometric properties, AMSTAR2 has moderate to high interrater reliability for most items (Cohen’s κ of .31–.92), although the kappa values vary depending on the pair of assessors and the intervention type (Shea et al., 2017). Another independent group reported similar results (Fleiss κ of .31–.80) for all items except for Item 1 (components of PICO included in the research questions and inclusion criteria) that had a low κ of .15 (Lorenz et al., 2019). The convergent validity on AMSTAR2 is high based on strong correlations with the Yes ratings on AMSTAR and on another tool, the Risk of Bias in Systematic Reviews (ROBIS) and a high agreement between AMSTAR2 overall confidence ratings and the risk of bias assessment on ROBIS (Lorenz et al., 2019). AMSTAR2 has some advantages relative to AMSTAR. First, the wording of items has improved and the doublebarrelled items on AMSTAR are included as separate items on AMSTAR2. Second, the new scoring criteria include the Partial Yes rating for the partial adherence to a specific domain, exclude the Can’t Answer rating, and replace the Not Applicable rating with No Meta-Analysis Conducted rating. Thus, the new scoring criteria (Yes, Partial Yes, No, No Meta-Analysis Conducted) are more precise relative to AMSTAR. Furthermore, if no information is provided or the information is inadequate the item receives the No rating on AMSTAR2 rather than the ambiguous Can’t Answer rating on AMSTAR. Third, AMSTAR2 is not designed to focus on the total score but rather on fulfilling the critical requirements for the overall confidence in the results of the systematic review (Shea et al., 2017). Since AMSTAR and AMSTAR2 differ substantially, it remains unclear if they produce similar quality ratings for systematic reviews in healthcare. The comparison of such ratings is necessary because AMSTAR is likely to remain in use. This is because AMSTAR2 is still relatively new (published in late 2017) and the Cochrane Collaboration guidelines for conducting of overviews of systematic reviews still mention AMSTAR as one of the available tools for conducting the quality assessment of systematic reviews (Becker & Oxman, 2008). Therefore, the aim of the current study was to assess the consistency between AMSTAR and AMSTAR2 scores applied to the same systematic reviews of a complementary intervention (Tai Chi) for Parkinson’s Ó 2020 Hogrefe Publishing
37
disease. While the current study was under peer-review, another group published a study comparing AMSTAR and AMSTAR2 scores in systematic reviews of pharmacological or psychological interventions for major depression (Lorenz et al., 2019). Thus, the further aim of the current study was to compare our results to the findings of the other study (Lorenz et al., 2019).
Method The current study utilizes the data from the overview of systematic reviews of a complementary intervention (Tai Chi) for Parkinson’s disease (Kedzior & Kaplan, 2019).
Data Source A search of electronic databases PubMed and PsycInfo for studies with terms “Parkinson’s Disease” and “Tai Chi” and “review” in titles/abstracts conducted up to February 2018 (Kedzior & Kaplan, 2019) identified k = 10 relevant systematic reviews (available at http://dx.doi.org/10. 23668/psycharchives.2650, Table S1). Both authors conducted the search and selected the systematic reviews for the current study. Of the k = 10 systematic reviews, all included mostly RCTs and half included one to four nonRCTs (observational primary studies) (Kedzior & Kaplan, 2019).
Instruments AMSTAR consists of 11 items scored as Yes (1 point), No, Can’t Answer, or Not Applicable all assigned 0 point (Shea, Grimshaw, et al., 2007). The total AMSTAR score for each systematic review was expressed as an absolute score (sum of all Yes ratings) and a percentage score (sum of all Yes ratings/11 100). The overall quality was rated based on the absolute AMSTAR score as high (8–11 points), medium (4–7 points), or low (< 4 points) according to the classification system used by others (Lorenz et al., 2019). AMSTAR2 consists of 16 items (Shea et al., 2017). Eight items are scored as Yes or No and eight items are scored as Yes, No, Partial Yes, or No Meta-analysis Conducted (Shea et al., 2017). We scored the items according to the classification system used by others (Lorenz et al., 2019): Yes = 1 point, Partial Yes = .5 point, No or No Meta-analysis Conducted = 0 point. The total AMSTAR2 score for each systematic review was expressed as an absolute score (sum of all Yes and Partial Yes ratings) and a percentage score (sum of all Yes and Partial Yes ratings/13 100 for systematic reviews without meta-analysis or / 16 100 for systematic reviews with meta-analysis). The interpretation of AMSTAR2 is Zeitschrift für Psychologie (2020), 228(1), 36–42
38
based on the non-critical and critical items necessary for the rating of the overall confidence in the results of the systematic review and ranges between “high confidence” (one non-critical weakness), “moderate confidence” (more than one non-critical weakness), “low confidence” (one critical weakness), and “critically low confidence” (more than one critical weakness) (Shea et al., 2017).
Procedure All k = 10 systematic reviews were rated using AMSTAR in March 2018 and AMSTAR2 in June 2018 by both authors independently. Item scores were compared between both authors during two discussion sessions for each scale separately (AMSTAR in March and AMSTAR2 in June). Any inconsistencies in item scoring were resolved by consensus. There were only minor disagreements on AMSTAR or AMSTAR2 scores between both authors. The data in all systematic reviews (including location in the review) were coded into self-developed tables consisting of AMSTAR and AMSTAR2 items. AMSTAR2 scores for the k = 10 systematic reviews have already been reported elsewhere (Kedzior & Kaplan, 2019, Table 3, p. 148) for other purposes (to assess the scientific quality of evidence presented in the reviews). The individual scores for each item in the k = 10 systematic reviews are reported in Tables S1–S3 for AMSTAR and Tables S4–S7 for AMSTAR2 (available at http://dx.doi.org/10.23668/ psycharchives). AMSTAR and AMSTAR2 were compared on two levels: (1) ratings of the individual items and (2) ratings of the overall quality of each systematic review. The ratings of the individual items were compared for nine items with the same or similar content because the number and the wording of items have substantially changed from AMSTAR to AMSTAR2 (the list of the nine items is shown in Table S8 (available at http://dx.doi.org/10.23668/psycharchives. 2650). This was done by descriptively comparing the percentage of systematic reviews (out of k = 10) that received Yes ratings on AMSTAR and Yes and Partial Yes ratings on AMSTAR2 on each of the nine items individually. The ratings of the overall quality were compared by correlating the total AMSTAR scores (sum of Yes ratings) and the total AMSTAR2 scores (sum of Yes and Partial Yes ratings) in each systematic review using bivariate, two-tailed, Pearson and Spearman correlations in IBM-SPSS25. We report the outcomes of the Spearman correlations to make the current results comparable to those reported in the recent study (Lorenz et al., 2019). Furthermore, the overall quality ratings on AMSTAR (high, medium or low) and the overall confidence ratings on AMSTAR2 (high, moderate, low or critically low) were compared descriptively for all systematic reviews. Zeitschrift für Psychologie (2020), 228(1), 36–42
K. K. De Santis & I. Kaplan, AMSTAR vs. AMSTAR2
Results AMSTAR Versus AMSTAR2: Individual Items There was a perfect agreement between AMSTAR and AMSTAR2 on all nine items with the same or similar content (available at http://dx.doi.org/10.23668/psycharchives. 2650, Table S8). Both scales agreed on the following aspects of the systematic reviews: (1) presence of review protocol, (2) comprehensive literature search strategy, (3) duplicate study selection and data coding, (4) excluded studies reported, (5) adequate study details reported, (6) quality of data/risk of bias assessed, (7) quality of data/risk of bias discussed, (8) conflict of interest in review reported.
AMSTAR Versus AMSTAR2: Overall Quality AMSTAR and AMSTAR2 were compared based on the total scores and the interpretation of the overall quality in the same k = 10 systematic reviews (Table 1). The k = 10 systematic reviews obtained the total AMSTAR scores (sum of Yes ratings) of 4–9 points. The Yes ratings were assigned to more than half of all 11 AMSTAR items in 80% of the systematic reviews. The total AMSTAR scores tended to be higher for systematic reviews with meta-analysis (mode: 7; range: 7–9) relative to systematic reviews without meta-analysis (mode: 4; range: 4–8). The same k = 10 systematic reviews obtained the total AMSTAR2 scores (sum of Yes and Partial Yes ratings) of 4–11 points. The Yes and Partial Yes ratings were assigned to more than half of all 16 AMSTAR2 items in 70% of the systematic reviews. The agreement between the total AMSTAR and AMSTAR2 scores assigned to the same k = 10 systematic reviews was high. There were strong positive correlations between the total AMSTAR scores (sum of Yes ratings) and the total AMSTAR2 scores (sum of Yes and Partial Yes ratings): percentage scores: ρ = .84, p = .002, k = 10 and absolute scores: Spearman ρ = .83, p = .003, k = 10 (available at http://dx.doi.org/10.23668/psycharchives.2650, Table S9, and Figures S1 and S2, respectively). There was a disagreement in the interpretation of the overall quality of the same k = 10 systematic reviews between AMSTAR and AMSTAR2. The overall quality was either high (8–9 points) in 40% or medium (4–7 points) in 60% of the systematic reviews on AMSTAR. AMSTAR2 showed that all k = 10 systematic reviews had between one to five critical weaknesses (Table 1). The following critical weaknesses were observed: (1) no review protocol (in 9/10 reviews), (2) no list of excluded studies (in 9/10 reviews), (3) inadequately reported methods of meta-analysis (in 6/6 reviews with meta-analysis), (4) no Ó 2020 Hogrefe Publishing
K. K. De Santis & I. Kaplan, AMSTAR vs. AMSTAR2
39
Table 1. Overall quality of k = 10 systematic reviews on AMSTAR and AMSTAR2 AMSTAR Systematic reviewa No meta-analysis
Total scoreb
Overall quality rating
AMSTAR2 Yes
Partial Yes
Total score
(%/11)
c
Total critical weaknesses
Overall confidence rating
Critical weaknesses items
(%/13)
Review 1
8 (73%)
High
8
1
8.5 (65%)
1
Low
2
Review 2
4 (36%)
Medium
3
2
4 (31%)
3
Critically low
2, 7, 13
Review 3
7 (64%)
Medium
8
1
8.5 (65%)
2
Critically low
2, 7
Review 4
4 (36%)
Medium
5
2
6 (46%)
3
Critically low
2, 7, 13
With meta-analysis
(%/11)
(%/16)
Review 5
9 (82%)
High
10
1
10.5 (66%)
3
Critically low
7, 11, 15
Review 6
7 (64%)
Medium
8
1
8.5 (53%)
5
Critically low
2, 7, 11, 13, 15
Review 7
8 (73%)
High
9
1
9.5 (59%)
4
Critically low
2, 7, 11, 15
Review 8
8 (73%)
High
11
0
11 (69%)
4
Critically low
2, 7, 11, 15
Review 9
7 (64%)
Medium
7
1
7.5 (47%)
4
Critically low
2, 7, 11, 15
Review 10
7 (64%)
Medium
9
1
9.5 (59%)
4
Critically low
a
2, 7, 11, 15 b
Note. The list of k = 10 systematic reviews is reported in Table S1 (available at http://dx.doi.org/10.23668/psycharchives.2650). AMSTAR: Yes = 1 point; Total AMSTAR Score = sum of Yes (0–11 points indicating minimum–maximum quality); Overall quality rating: high 8–11 points, medium 4–7 points, low < 4 points. cAMSTAR2: Yes = 1 point; Partial Yes = .5 point; Total AMSTAR2 Score = sum of Yes + Partial Yes; Overall confidence rating: high (one non-critical weakness), moderate (> one non-critical weakness), low (one critical weakness), critically low (> one critical weakness). AMSTAR = A Measurement Tool to Assess Systematic Reviews (original and revised version 2); k = Number of Systematic Reviews.
Figure 1. Critical item scores on AMSTAR2 in k = 10 systematic reviews. Numbers in brackets show the percentage of systematic reviews (out of k = 10) that fulfilled each critical item. AMSTAR2 = A Measurement Tool to Assess Systematic Reviews (revised version 2); k = Number of Systematic Reviews.
Item 2
• Review protocol available (10%)
Item 4
• Comprehensive literature search conducted (100%)
Item 7
• List of excluded studies provided (10%)
Item 9
• Risk of bias assessed (100%)
Item 11
• Appropriate (meta-analytical) methods applied (0%)
Item 13
• Risk of bias discussed (70%)
Item 15
• Publication bias assessed (0%)
discussion of the risk of bias (in 3/10 reviews), (5) no publication bias assessment if meta-analysis was conducted (in 6/6 reviews with meta-analysis; Figure 1). Based on the critical weaknesses, the overall confidence in the results was low in 10% or critically low in 90% of the systematic reviews on AMSTAR2.
agreements between the total AMSTAR and AMSTAR2 scores, medium to high quality of 80–100% of the systematic reviews on AMSTAR, and low to critically overall confidence in the results of 95–100% of the systematic reviews on AMSTAR2.
Comparison Between Current Results and Another Study (Lorenz et al., 2019)
Discussion
There was a remarkably high concordance in results between the current study and another recent (independent) study (Lorenz et al., 2019) using two samples of systematic reviews of different healthcare interventions (Table 2). Regardless of different sample sizes (k = 60 or a subset of 30 systematic reviews in Lorenz et al., 2019, and k = 10 systematic reviews in our study), both studies reported high Ó 2020 Hogrefe Publishing
The current study shows that AMSTAR and AMSTAR2 perfectly agreed on a subset of individual items testing similar content and highly agreed on their total scores in the same k = 10 systematic reviews. However, while all systematic reviews had medium to high quality on AMSTAR, the confidence in results of the same reviews was rated as low to critically low. Interestingly, the results of our study with a small sample of systematic reviews closely resemble Zeitschrift für Psychologie (2020), 228(1), 36–42
40
K. K. De Santis & I. Kaplan, AMSTAR vs. AMSTAR2
Table 2. Comparison to the study by Lorenz et al. (2019) Lorenz et al. (2019)c
Current study
k = 60
k = 10
Systematic reviews Field
Correlation AMSTAR vs. AMSTAR2 (absolute Yes + Partial Yes)
Pharmacological or psychological interventions for the treatment of major depression ρ = .91, p < .001, k = 30
Complementary intervention (Tai Chi) in the treatment of Parkinson’s disease ρ = .83, p = .003, k = 10
Overall quality rating (AMSTAR)a Low
6/30
20%
0/10
0%
Medium
15/30
50%
6/10
60%
High
9/30
30%
4/10
40%
Critically low
53/60
88%
9/10
90%
Low
4/60
7%
1/10
10%
Moderate
1/60
2%
0/10
0%
High
2/60
3%
0/10
0%
Overall confidence rating (AMSTAR2)b
a
Note. Scoring: Yes = 1 point, Partial Yes = .5 point; Overall quality rating (total AMSTAR score = sum of Yes ratings): high 8–11 points, medium 4–7 points, low < 4 points. bOverall confidence rating (AMSTAR2): high (one non-critical weakness), moderate (> one non-critical weakness), low (one critical weakness), critically low (> one critical weakness). cOnly a subset of 30 systematic reviews was scored using AMSTAR. AMSTAR = A Measurement Tool to Assess Systematic Reviews (original and revised version 2); k = Number of Systematic Reviews; ρ = Spearman’s Correlation Coefficient.
another (independent) study with 60 systematic reviews of different healthcare interventions (Lorenz et al., 2019). Since the quality of items on AMSTAR2 has substantially improved relative to AMSTAR, AMSTAR2 may be the preferred instrument to assess the quality of systematic reviews. However, additional decision rules and further revisions to item wording, content, and scoring guidelines may be required to improve the interpretation of the overall confidence in systematic reviews on AMSTAR2. Due to the improvements in item wording and the inclusion of new items AMSTAR2 is probably better than AMSTAR at assessing the quality of some aspects of systematic reviews. These include the review preparation, literature search and study selection, data coding and reporting, risk of bias assessment methods and interpretation, and conflict of interest statement. Although items addressing the presence of the review protocol and the formulation of the research questions along the PICO criteria are necessary to determine if the review was well-planned and designed, the selection of study designs is often not explicitly stated in systematic reviews and thus difficult to rate on AMSTAR2 (Matthias et al., 2019). A clear advantage of AMSTAR2 over AMSTAR is the notion that the empirical evidence regarding some healthcare interventions relies on real world observational designs. Thus, especially helpful are the extensive guidelines regarding the assessment of the risk of bias in non-randomized intervention studies, although it is acknowledged that multiple instruments are available for such purposes (Shea et al., 2017). Furthermore, AMSTAR2 assesses not only the conflict of interest in the systematic reviews but also the potential of a carryover conflict from the primary studies to the systematic review. Thus, AMSTAR2 is a useful tool to assess some Zeitschrift für Psychologie (2020), 228(1), 36–42
methodological aspects of the past systematic reviews and to guide the future systematic reviews. One major limitation of AMSTAR2 is that it does not appear to adequately assess the quality of data synthesis in systematic reviews for three reasons. First, AMSTAR2 focuses on the quality of the quantitative synthesis (metaanalysis) alone (Item 11). However, the qualitative data synthesis should also be assessed as one critical domain for determining review quality. The qualitative data synthesis in systematic reviews typically involves reporting the number of primary studies with statistically significant vs. non-significant outcomes and assuming that the statistical tests conducted in the primary studies were appropriately chosen. However, the presence of statistical significance often depends on the sample size, statistical power, and the type of statistical test used in the primary studies. Thus, qualitatively pooling of primary studies with similar effect sizes or assessing trends in descriptive data for specific patters may be more useful than focusing on the statistical significance alone in a qualitative data synthesis. Therefore, a systematic assessment of how the results of primary studies were qualitatively synthesized is necessary to decide about the quality or the confidence in the results of systematic reviews without meta-analysis. Second, AMSTAR2 does not investigate in detail how the data were synthesized quantitatively (Item 11). Although some rating guidelines for Item 11 are provided, the rating of this item relies on subjective decisions that do not adequately reflect the complexity of the statistical meta-analysis. For example, a correct meta-analytic model with an appropriate weighing technique can be applied to pool incorrectly computed study effect sizes. The choice and the computation of effect sizes is particularly heterogeneous in studies of healthcare Ó 2020 Hogrefe Publishing
K. K. De Santis & I. Kaplan, AMSTAR vs. AMSTAR2
interventions that include multiple treatment groups compared to the same control group and multiple (pre vs. post vs. follow-up) measurements (Kedzior & Seehoff, 2018). Third, AMSTAR2 focuses on the assessment of publication bias in a meta-analysis (Item 15) although such bias is likely to equally affect quantitative and qualitative data synthesis. There is currently no consensus on how to correctly assess publication bias neither in qualitative data synthesis nor in quantitative meta-analyses. Taken together, AMSTAR2 is similar to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) checklist (Moher, Liberati, Tetzlaff, Altman, & PRISMA Group, 2009) that assesses the quality of reporting but not the methodological quality of data synthesis in systematic reviews. Thus, AMSTAR2 requires additional items and/or additional guidelines for rating the quality of descriptive (qualitative) data synthesis, quantitative data synthesis (meta-analysis), and publication bias assessment. The results of the current study and the study by Lorenz et al., 2019, suggest that the interpretation of the overall quality ratings of systematic reviews is not consistent on AMSTAR and AMSTAR2. On the one hand, there is a high agreement between the total scores on AMSTAR and AMSTAR2 indicating that both scales agree on items fulfilling the sufficient quality. On the other hand, there is a high disagreement on the overall quality ratings based on items not fulfilling the sufficient quality. One explanation for such discrepancy in the interpretation of the overall ratings could be that “quality” and “confidence” are different constructs that should not be compared in the first place. Furthermore, it appears that AMSTAR is too lenient, while AMSTAR2 is too conservative at rating the quality of systematic reviews. AMSTAR assesses the degree of quality (from low to high) (Shea, Grimshaw, et al., 2007) and the presence of No scores on AMSTAR reduces the quality of systematic reviews without downgrading them to the lowest quality category (Lorenz et al., 2019). In contrast, AMSTAR2 assesses the overall confidence in the results of the systematic review (from critically low to high) based on weaknesses in critical or non-critical item domains (Shea et al., 2017). High confidence is assigned for appropriate literature search, risk of bias assessment and interpretation, quantitative data synthesis, and publication bias analysis. However, one critical weakness is sufficient to downgrade a systematic review to the lowest confidence category (critically low). In case of our study, all k = 10 systematic reviews had the same two critical weaknesses: no review protocol and/or no list of excluded studies. Since preregistering of review protocol and reporting of all coding decisions are measures designed to reduce the Questionable Research Practices both issues should be included in the journal guidelines for authors to enhance the quality of systematic
Ó 2020 Hogrefe Publishing
41
reviews and to prevent the downgrading of such reviews to the lowest confidence category on AMSTAR2. One solution to prevent the disagreement between AMSTAR and AMSTAR2 is to avoid computing of the overall quality ratings as recommended by the users (Burda et al., 2016; Lorenz et al., 2019; Pieper et al., 2018) and the developers of AMSTAR2 (Shea et al., 2017). Although, the interpretation of weaknesses can be applied flexibly by assessors on AMSTAR2 (Shea et al., 2017), it remains unclear how to choose the appropriate critical items for the quality assessment and if confidence and quality are the same or different constructs. The individual item ratings are helpful to systematically identify the sources of strengths, weaknesses, and possible biases in systematic reviews (Kedzior & Kaplan, 2019). In contrast, assigning the overall quality ratings appears less helpful and requires additional guidelines for users. The current study has a number of limitations. First, the quality assessment was conducted using a small number of systematic reviews on one topic in one clinical field. Therefore, the current study has poor generalizability to other clinical fields and/or to the Cochrane reviews that were absent in our study. Interestingly, our results are in line with other studies that appraised the quality of 60 systematic reviews of interventions for major depression (Lorenz et al., 2019; Matthias et al., 2019). Furthermore, our study further supports the argument that systematic reviews are mass-produced and have little clinical relevance (Ioannidis, 2016). This is because there is little scientific need for k = 9 systematic reviews published within three years only (2014– 2017) in a field with relatively slow knowledge progress and based on a low volume of highly overlapping primary data (Kedzior & Kaplan, 2019). Second, the rating of the same systematic reviews was performed only 3 months apart by the same authors. Thus, the perfect agreement on the nine items could be explained by the memory effect. Third, we have not tested the interrater reliability but instead reached a consensus between two independent raters on AMSTAR and AMSTAR2 during discussion. Therefore, the perfect agreement on nine items between AMSTAR and AMSTAR2 is probably due to our methods and reflects the high similarity of these items on both scales. Finally, similarly to others (Lorenz et al., 2019), we have applied AMSTAR to systematic reviews including mostly RCTs but also some observational studies (Kedzior & Kaplan, 2019). This procedure might have contributed to some discrepancy between ratings on both scales since AMSTAR was designed to appraise the quality of systematic reviews of RCTs only. In conclusion, since the quality of items on AMSTAR2 has substantially improved relative to AMSTAR, AMSTAR2 may be the preferred instrument to assess the quality (or confidence in the results) of systematic reviews.
Zeitschrift für Psychologie (2020), 228(1), 36–42
42
The individual items on AMSTAR2 are useful to systematically identify the sources of strengths, weaknesses, and possible biases in systematic reviews. However, the interpretation of the overall confidence in the results of systematic reviews requires additional guidelines for users. AMSTAR2 is also a useful checklist for conducing of future systematic reviews. Thus, AMSTAR2 could be incorporated into journal guidelines for authors to improve the transparency regarding the methods of systematic reviews for authors, journal editors, and readers.
References Becker, L. A., & Oxman, A. D. (2008). Chapter 22: Overviews of reviews. In H. JPT & S. E. Green (Eds.), Cochrane handbook for systematic reviews of interventions (pp. 607–631). Hoboken, NJ: Wiley. Burda, B. U., Holmer, H. K., & Norris, S. L. (2016). Limitations of A Measurement Tool to Assess Systematic Reviews (AMSTAR) and suggestions for improvement. Systematic Reviews, 5, 58. https://doi.org/10.1186/s13643-016-0237-1 Faggion, C. M. Jr. (2015). Critical appraisal of AMSTAR: Challenges, limitations, and potential solutions from the perspective of an assessor. BMC Medical Research Methodology, 15, 63. https://doi.org/10.1186/s12874-015-0062-6 Ioannidis, J. P. A. (2016). The mass production of redundant, misleading, and conflicted systematic reviews and metaanalyses. The Milbank Quarterly, 94, 485–514. https://doi.org/ 10.1111/1468-0009.12210 Kedzior, K., & Kaplan, I. (2019). Tai Chi and Parkinson’s disease (PD): A systematic overview of the scientific quality of the past systematic reviews. Complementary Therapies in Medicine, 46, 144–152. https://doi.org/10.1016/j.ctim.2019.08.008 Kedzior, K. K., & Seehoff, H. (2018, June). Common problems with meta-analysis in published reviews on major depressive disorders (MDD): A systematic review. Paper presented at the Research Synthesis Conference, Trier, Germany. Lakens, D., Hilgard, J., & Staaks, J. (2016). On the reproducibility of meta-analyses: Six practical recommendations. BMC Psychology, 4, 24. https://doi.org/10.1186/s40359-016-0126-3 Lorenz, R. C., Matthias, K., Pieper, D., Wegewitz, U., Morche, J., Nocon, M., . . . Jacobs, A. (2019). A psychometric study found AMSTAR 2 to be a valid and moderately reliable appraisal tool. Journal of Clinical Epidemiology, 114, 133–140. https://doi.org/ 10.1016/j.jclinepi.2019.05.028 Matthias, K., Rissling, O., Nocon, M., Jacobs, A., Morche, J., Pieper, D., Wegewitz, U., Schirm, J., & Lorenz, R. (2019, May). Appraisal of the methodological quality of systematic reviews on pharmacological and psychological interventions for major depression in adults using the AMSTAR 2. Paper presented at the Research Synthesis Conference, Dubrovnik, Croatia. Moher, D., Liberati, A., Tetzlaff, J., & Altman, D., PRISMA Group. (2009). Preferred reporting items for systematic reviews and meta-analyses: The PRISMA Statement. PLoS Medicine, 6, e1000097. https://doi.org/10.1371/journal.pmed.1000097 Pieper, D., Koensgen, N., Breuing, J., Ge, L., & Wegewitz, U. (2018). How is AMSTAR applied by authors – a call for better reporting. BMC Medical Research Methodology, 18, 56. https://doi.org/ 10.1186/s12874-018-0520-z Pollock, M., Fernandes, R. M., Becker, L. A., Featherstone, R., & Hartling, L. (2016). What guidance is available for researchers
Zeitschrift für Psychologie (2020), 228(1), 36–42
K. K. De Santis & I. Kaplan, AMSTAR vs. AMSTAR2
conducting overviews of reviews of healthcare interventions? A scoping review and qualitative metasummary. Systematic Reviews, 5, 190. https://doi.org/10.1186/s13643-016-0367-5 Pollock, M., Fernandes, R. M., & Hartling, L. (2017). Evaluation of AMSTAR to assess the methodological quality of systematic reviews in overviews of reviews of healthcare interventions. BMC Medical Research Methodology, 17, 48–48. https://doi. org/10.1186/s12874-017-0325-5 Shea, B. J., Bouter, L. M., Peterson, J., Boers, M., Andersson, N., Ortiz, Z., . . . Grimshaw, J. M. (2007). External validation of a Measurement Tool to Assess Systematic Reviews (AMSTAR). PLoS One, 2, e1350. https://doi.org/10.1371/journal.pone. 0001350 Shea, B. J., Grimshaw, J. M., Wells, G. A., Boers, M., Andersson, N., Hamel, C., . . . (2007). Development of AMSTAR: A measurement tool to assess the methodological quality of systematic reviews. BMC Medical Research Methodology 7, 1–7. https://doi.org/10.1186/1471-2288-7-10 Shea, B. J., Hamel, C., Wells, G. A., Bouter, L. M., Kristjansson, E., Grimshaw, J., . . . Boers, M. (2009). AMSTAR is a reliable and valid measurement tool to assess the methodological quality of systematic reviews. Journal of Clinical Epidemiology, 62, 1013– 1020. https://doi.org/10.1016/j.jclinepi.2008.10.009 Shea, B. J., Reeves, B. C., Wells, G., Thuku, M., Hamel, C., Moran, J., . . . Henry, D. A. (2017). AMSTAR 2: A critical appraisal tool for systematic reviews that include randomised or nonrandomised studies of healthcare interventions, or both. British Medical Journal, 358, j4008. https://doi.org/10.1136/bmj.j4008 Wegewitz, U., Weikert, B., Fishta, A., Jacobs, A., & Pieper, D. (2016). Resuming the discussion of AMSTAR: What can (should) be made better? BMC Medical Research Methodology, 16, 111. https://doi.org/10.1186/s12874-016-0183-6 History Received May 7, 2019 Revision received November 19, 2019 Accepted December 8, 2019 Published online March 31, 2020 Conflict of Interest The authors declare no conflicts of interest. Authorship Some results reported in this manuscript were presented by the first author at the Research Synthesis Conference, Dubrovnik, Croatia (May 2019). Open Data All data are available at http://dx.doi.org/10.23668/psycharchives.2650. Funding There was no external funding for this study. ORCID Karina Karolina De Santis https://orcid.org/0000-0001-7647-6767 Karina Karolina De Santis Faculty 3 Social Sciences City University of Applied Sciences Neustadtswall 30 28199 Bremen Germany Karina-Karolina.De-Santis@hs-bremen.de
Ó 2020 Hogrefe Publishing
Original Article
Power-Enhanced Funnel Plots for Meta-Analysis The Sunset Funnel Plot Michael Kossmeier, Ulrich S. Tran, and Martin Voracek Department of Basic Psychological Research and Research Methods, School of Psychology, University of Vienna, Austria
Abstract: Currently, dedicated graphical displays to depict study-level statistical power in the context of meta-analysis are unavailable. Here, we introduce the sunset (power-enhanced) funnel plot to visualize this relevant information for assessing the credibility, or evidential value, of a set of studies. The sunset funnel plot highlights the statistical power of primary studies to detect an underlying true effect of interest in the well-known funnel display with color-coded power regions and a second power axis. This graphical display allows meta-analysts to incorporate power considerations into classic funnel plot assessments of small-study effects. Nominally significant, but low-powered, studies might be seen as less credible and as more likely being affected by selective reporting. We exemplify the application of the sunset funnel plot with two published meta-analyses from medicine and psychology. Software to create this variation of the funnel plot is provided via a tailored R function. In conclusion, the sunset (power-enhanced) funnel plot is a novel and useful graphical display to critically examine and to present study-level power in the context of meta-analysis. Keywords: funnel plot, statistical power, meta-analysis, publication bias, small-study effects
The Funnel Plot in Meta-Analysis The funnel plot is the most widely used diagnostic plot for meta-analytic data (Schild & Voracek, 2013). In essence, the funnel plot is a scatter plot, showing the effect-size estimates of all studies in the meta-analysis on the abscissa and a measure of their precision on the ordinate (Figure 1). The most common choices for the ordinate are either study precision, defined as one divided by the standard error, or the standard error itself on an inverted scale, such that studies with smaller standard errors (higher precision) are at the top of the funnel plot (Sterne & Egger, 2001). The principle idea is that studies should scatter randomly and symmetrically around the meta-analytic summary effect, with less precise effect-size estimates (those with large standard errors) at the bottom of the funnel plot showing a higher variability than more precise studies (those with smaller standard errors) at the top of the funnel plot. For a set of unbiased studies, the plot is expected to be characteristically funnel-shaped, hence its name. Certain deviations of this expected shape have been proposed as indicative for studies missing due to publication bias, that is, due to a tendency of negative and nonsignificant results not getting published (Light & Pillemer, 1984). Smaller studies (i.e., those with larger standard errors) need to obtain
Ó 2020 Hogrefe Publishing
larger effect sizes, in order to reach statistical significance. Therefore, if smaller studies (those with larger standard errors) at the bottom of the funnel plot seem to report effect sizes of larger magnitude, this leads to an asymmetric funnel plot and might indicate studies missing due to publication bias. Such small-study effects can be examined by checking for positive associations of effect sizes with standard errors in the funnel plot. A number of formal statistical tests have been developed to check for funnel plot asymmetry (e.g., Begg & Mazumdar, 1994; Duval & Tweedie, 2000; Egger, Smith, Schneider, & Minder, 1997; Rücker, Schwarzer, & Carpenter, 2008). However, for meta-analyses of a typical (i.e., limited) number of studies, these tests tend to show low power to detect publication bias (Renkewitz & Keiner, 2019; Sterne, Gavaghan, & Egger, 2000). In addition, while small-study effects can be indicative for studies missing due to publication bias, they also might arise due to true effect heterogeneity or chance alone (Sterne et al., 2011). Therefore, visual and explorative examinations of the funnel plot still remain important to assess patterns of asymmetry and other peculiarities in meta-analyses. In particular, contours showing conventional limits of statistical significance in the funnel plot allow visualizing which studies have reached nominal statistical significance.
Zeitschrift für Psychologie (2020), 228(1), 43–49 https://doi.org/10.1027/2151-2604/a000392
44
M. Kossmeier et al., Sunset Funnel Plot
Figure 1. Classic funnel plot showing data from a published meta-analysis (Mathie et al., 2017), comparing homeopathic treatment with placebo. Confidence contours are shown by black lines symmetric around the meta-analytic summary effect, which is marked by a black vertical reference line (fixed-effect model). For a fixed-effect model and assuming the meta-analytic summary effect and standard errors as fixed, 95% of all studies are expected to fall within these confidence contours. Significance contours at the .05 and .01 levels are indicated through dark shaded areas symmetric around the null effect. Studies lying within the dark shaded area show nominal statistical significance at the .05 level, while studies lying outside the dark shaded area show significance at the .01 level (based on a two-sided Wald test). R code to reproduce the figure is available at https://osf.io/wy29f/
These significance contours have been proposed to check whether small-study effects are evidently driven by conventional criteria of statistical significance. If potential gaps in a funnel plot yielding an asymmetric impression correspond to regions of nonsignificance, then publication bias driven by statistical significance should be regarded as the likely explanation for this observation, and preferred to other explanations (Peters, Sutton, Jones, Abrams, & Rushton, 2008). A typical classic funnel plot including significance contours is exemplified in Figure 1. Shown are data from a published meta-analysis (Mathie et al., 2017), comparing homeopathic treatment with placebo (see figure caption for further details).
Study-Level Power in Meta-Analysis and Meta-Science The statistical power of studies to detect an effect of interest is commonly recognized as a valuable information for assessing the credibility, or evidential value, of a set of empirical findings in the context of meta-analysis and meta-science. The test for excess significance (Ioannidis & Trikalinos, 2007) is a widely used exploratory test to check whether there are a larger number of statistically significant individual studies than expected, considering
Zeitschrift für Psychologie (2020), 228(1), 43–49
their study-level power to detect an effect of interest of a presumed certain magnitude. Such an excess of significant findings indicates bias in the set of studies under consideration and weakens the credibility of the corresponding set of study findings. Furthermore, some authors have argued that only appropriately powered studies should be combined in a meta-analysis, because low-powered studies are more likely to bias the meta-analytic summary effect (Muncer, Craigie, & Holmes, 2003). Statistical significance of study findings still is widely seen as an important criterion to be worth publishing and thus low-powered studies might be especially prone to publication bias, selective reporting, or other evidence-distorting practices. In addition, significant effects observed in low-powered studies more likely are false positive findings, as compared to significant effects observed in studies with higher power (Forstmeier, Wagenmakers, & Parker, 2017). Power can therefore be seen as an indicator for the replicability of research findings. Indeed, for a set of studies, the deviation of (or gap between) the proportion of actually observed significant studies and twice the median study power has been proposed as the R-index of replicability (Schimmack, 2016). All in all, study-level power is one useful information to assess the credibility, or evidential value, of a set of studies included in a meta-analysis. Consequently, a powerenhanced funnel plot is one means to visualize and communicate information on study-level power in the well-known, classic funnel-plot display.
Ó 2020 Hogrefe Publishing
M. Kossmeier et al., Sunset Funnel Plot
45
Figure 2. Sunset (power-enhanced) funnel plot, using data from a published meta-analysis (Mathie et al., 2017), comparing homeopathic treatment with placebo. The 95% confidence contours are shown, with the black vertical reference line marking the observed summary effect (fixed-effect model) used for power analysis. Significance contours at the .05 and .01 levels are indicated through dark shaded areas. Power estimates are computed for a two-tailed test with significance level .05 and shown as discrete color-coded power regions in the color version of this figure available with the online version of this article. R code to reproduce the figure is available at https://osf.io/t2he9/
A Power-Enhanced Funnel Plot: The Sunset Funnel Plot Numerous variants of the funnel plot have been proposed to visualize small-study effects, heterogeneity, and the sensitivity of the meta-analytic summary estimates to new evidence (Chevance, Schuster, Steele, Ternès, & Platt, 2015; Langan, Higgins, Gregory, & Sutton, 2012). What is missing is a variant of the funnel plot which incorporates information on study-level statistical power to detect an effect of interest. To fill this gap, we propose the sunset funnel plot, which, in essence, is a power-enhanced funnel plot (Figure 2). Showing study-level power in the funnel plot allows to assess the plausibility of the number of significant studies, considering their individual statistical power. For a set of underpowered studies, a large number of statistically significant results is less plausible under the given assumptions and therefore might indicate bias. Similar to the type of study-outcome information captured by the test of excess significance (TES; Ioannidis & Trikalinos, 2007), an implausible large number of significant, but at the same time
Ó 2020 Hogrefe Publishing
underpowered, studies may potentially drive small-study effects in the funnel plot and may weaken the credibility of the meta-analytic results. In addition, significant, but low-powered, individual study results might be seen as more likely being affected by selective reporting and thus judged as less credible. The sunset funnel plot assumes normally distributed effect sizes and regards variances of these effect sizes as fixed. These assumptions also underlie classic funnel plots with statistical significance contours (Figure 1). Therefore, all effect-size measures suitable for standard funnel plots in meta-analysis are suitable for the sunset funnel plot as well. This includes all approximately normally distributed effect-size measures for which an effect size of zero represents a null effect, such as standardized mean differences (e.g., Cohen d, Hedges g), appropriately transformed effect sizes for binary data (e.g., log OR, log RR), correlations (Fisher’s z-transformed r), and effect sizes for survival data (log HR). To compute the power of each study under these assumptions, the standard error, significance level, and the underlying effect of interest have to be determined.
Zeitschrift für Psychologie (2020), 228(1), 43–49
46
M. Kossmeier et al., Sunset Funnel Plot
For a (common) true population effect size δ, the power of a two-sided Wald test with significance level α testing the null hypothesis δ = 0 is given by
δ Power ¼ 1 Φ z 1 α2 SEðdÞ δ þ Φ z 1 α2 ; SEðdÞ
with Φ the cumulative distribution function of the standard normal distribution, z1-α/2 the 1 α/2 quantile of the standard normal distribution, and SE(d) the standard error of the study effect size d. The sunset funnel plot visualizes these power estimates corresponding to specific standard errors on a second ordinate and with color-coded power regions (Figure 1). A second ordinate on the right-hand side of the plot displays power values directly corresponding to the standard errors (or precisions) shown on the traditional ordinate on the left-hand side of the funnel plot. Color regions range from an alarming dark red for highly underpowered studies to a relaxing dark green for appropriately powered studies to detect the underlying true effect of interest. The color palette used in the graphic display is vividly remindful of a colorful sunset; hence, the denomination sunset funnel plot. Alternative color palettes for the visualization of power regions within the sunset funnel plot might prove useful as well. The underlying true population effect size can either be determined theoretically (e.g., by assuming a smallest effect of interest), or empirically, using meta-analytic estimates of the summary effect. For the latter, the fixedeffect model estimator is one natural default choice, giving less weight (and therefore being less sensitive) to small, potentially biased studies, as compared to random-effects meta-analytic modeling. Using a meta-analytic estimate for the summary effect (which may well be exaggerated by publication bias) for power analysis arguably leads to optimistic estimates of study-power values. This is not necessarily the case for user-specified underlying true population effect sizes. For these latter ones, the validity of corresponding power values, and interpretations based on these, directly depends on the meaningfulness of the user-specified underlying true effect. A number of related power-based statistics can be presented alongside the power-enhanced funnel plot and support its evaluation. These include: (1) the median power of studies, (2) the true underlying effect size necessary for achieving certain levels of median power (e.g., 33% or 66%), (3) the results of the test for excess significance (Ioannidis & Trikalinos, 2007), and (4) the R-index as a measure for the expected replicability of findings (Schimmack, 2016).
Zeitschrift für Psychologie (2020), 228(1), 43–49
Software to Create Sunset (Power-Enhanced) Funnel Plots To create sunset funnel plots and to compute statistics related to these, we provide the tailored function viz_sunset in the package metaviz (Kossmeier, Tran, & Voracek, 2018) within the statistical software R (R Core Team, 2018) and a corresponding online application, available at https:// metaviz.shinyapps.io/sunset/. The function viz_sunset was developed using ggplot2 (Wickham, 2016), a widely used R package to plot statistical data. The main input of viz_sunset is a dataframe with the effect sizes and corresponding standard errors of all studies. For a dataframe named homeopath with a column d containing all effect sizes and a column se containing all standard errors, a sunset funnel plot with sensible default options can be readily created with the following example R code: viz_sunset(homeopath[, c(“d”, “se”)]). Key features of the function viz_sunset include: (1) the choice of an arbitrary underlying true effect and significance level alpha to compute power values; (2) the computation of power-related statistics (including the results of the test of excess significance and the R-index of replicability); and (3) numerous options to customize the appearance of the funnel plot (including significance and confidence contours, different choices for the ordinate, and the display of discrete vs. continuous power regions).
Example Applications of Sunset (Power-Enhanced) Funnel Plots For the first illustration example, we use data from a recent published meta-analysis on the effect of homeopathic treatment versus placebo for numerous medical conditions (Mathie et al., 2017). In this systematic review and metaanalysis, bias assessment suggested high risk of bias for the majority of the 54 randomized controlled trials (RCTs) considered for meta-analysis; only three RCTs were judged as reliable evidence. For illustration purposes and in the sense of a sensitivity check, we use the sunset funnel plot to examine the entirety of these 54 effect sizes (standardized mean differences), despite the increased risk of bias and potentially low quality of evidence. Visual examination of the corresponding funnel plot shows clear small-study effects, such that imprecise, smaller studies (those with larger standard errors) report larger effects in favor of homeopathy than more precise, larger studies (those with smaller standard errors). This association seems to be particularly driven by studies reporting imprecise, but nominally significant, estimates. Incorporating
Ó 2020 Hogrefe Publishing
M. Kossmeier et al., Sunset Funnel Plot
power information in these considerations (with the fixed-effect estimate δ = 0.25 in favor of homeopathy) additionally reveals that a non-trivial, implausible high, and thus worrisome, number of the significant studies evidently are drastically underpowered (with power values lower than 10%) to detect this effect of interest, thus further suggesting bias (Figure 2). Accordingly, there is an excess of significant findings among the primary studies included in this meta-analysis (15 nominally significant studies observed; but, under these circumstances, only 9.5 significant studies to be expected; pTES = .047). The median power of this set of primary studies merely amounts to 14.3% (IQR: 11.1–20.6%), and the true effects needed to reach typical (i.e., median) power levels of 33% or 66% would be substantial (absolute δ values of 0.43 or 0.67, respectively). The expected replicability of findings, as quantified with the R-index, is extremely low (0.8%). In the second illustration example, we use data from a published meta-analysis on the association between human brain volume and intelligence (measured full-scale IQ) in samples of healthy participants (Pietschnig, Penke, Wicherts, Zeiler, & Voracek, 2015). Correlation coefficients
47
of 83 studies have been included in this meta-analysis and are shown in the respective funnel plot (Figure 3). Instead of the meta-analytic fixed-effect summary estimate (r = .24), we might also apply a user-specified underlying true effect for power analysis. For instance, one might suspect that using the meta-analytic summary effect as underlying true effect leads to too optimistic power values due to potential overestimation driven by publication bias. We might therefore decide to specify a more conservative true underlying effect of r = .2 as a smallest effect of interest. In this scenario, the sunset funnel plot reveals a large number of insufficiently powered individual studies to detect this smallest effect of interest (Figure 3). One half of the primary studies have 22.4% power or less (IQR: 14.1–35.6%). Among all studies, an implausibly large number of studies reports (just) significant results, and this is in particular true for a large number of low-powered studies. This is also formally confirmed by the test of excess significance, with 23.5 significant studies expected, given their assumed power, but 41 significant studies observed (pTES < .001). Looking at the more optimistic scenario using the meta-analytic fixed-effect summary estimate as underlying
Figure 3. Sunset (power-enhanced) funnel plot, using data from a published meta-analysis (Pietschnig, Penke, Wicherts, Zeiler, & Voracek, 2015) of associations between brain volume and IQ. Correlation coefficients are Fisher z-transformed for meta-analysis, but labeled on their original scale on the abscissa. Significance contours at the .05 and .01 levels are indicated through dark shaded areas, and the meta-analytic fixed-effect summary estimate is indicated by a solid vertical line. The user-specified smallest effect of interest of r = .2 is used for power analysis and is indicated by a dotted line. Power estimates are computed for a two-tailed test with significance level .05 and shown as continuous color-coded power regions in the color version of this figure available with the online version of this article. R code to reproduce the figure is available at https://osf.io/5r9yf/
Ó 2020 Hogrefe Publishing
Zeitschrift für Psychologie (2020), 228(1), 43–49
48
true effect, does not substantially change these results (median power 30.1%; 29.6 significant studies expected, but 41 significant studies observed, pTES = .009; sunset funnel plot with corresponding power values not additionally shown). These results combined might raise concerns about the credibility and replicability of study findings (as also indicated by an R-index of essentially 0% and 10.8% for assumed true effects of r = .2 and .24, respectively). Accordingly, only cautious interpretations of the metaanalytic results based on these studies might be warranted.
Conclusions and Implications Statistical power (of primary studies) traditionally has been widely ignored in standard applications of meta-analytic methodology (Muncer et al., 2003). Indeed, until now, no statistical data visualization display has been proposed to depict study-level power in the context of meta-analysis and meta-scientific research. This is surprising, given the potential of study-level power information to support the critical assessment of the credibility of study findings within a meta-analysis. In particular, nominally significant, but low-powered, studies might be seen as less credible and as more likely being affected by selective reporting. We have introduced the sunset funnel plot as a dedicated visual display to depict study-level power within metaanalyses. The sunset funnel plot conveys different potentially useful information for meta-analysts. First, the sunset funnel plot allows incorporating power considerations into classic funnel plot assessments of small-study effects. In the same spirit as testing for an excess prevalence of significant findings among primary studies (Ioannidis & Trikalinos, 2007), the credibility of findings can further be critically examined by checking whether small-study effects are especially driven by an implausibly large number of significant, but at the same time underpowered, studies. Second, the display allows to visually explore and communicate the distribution and typical values of study power for an effect of interest. This visualization is not only informative for meta-analyses, but also in the broader context of meta-scientific investigations into the power of studies of whole scientific fields (e.g., see Ioannidis, Stanley, & Doucouliagos, 2017; Szucs, & Ioannidis, 2017). Third, changes of power values for a set of studies can be visually examined by varying the true underlying effect. This directly corresponds to vital questions regarding the true effect size that would be necessary, such that the power of individual or typical studies would reach desired levels. One intrinsic limitation of the sunset funnel plot is the assumption of a common (shared) mean effect for all studies to compute the study-level power for the effect of interest. Therefore, when displaying heterogeneous sets of Zeitschrift für Psychologie (2020), 228(1), 43–49
M. Kossmeier et al., Sunset Funnel Plot
studies potentially powered for drastically different true effects, the sunset funnel plot should be interpreted cautiously. However, even for a heterogeneous set of studies the distribution of power levels for a typical or optimistic effect size arguably is informative from a meta-scientific perspective. Visual funnel plot examinations to assess small-study effects have been described as lacking objectivity, and the validity of visual examinations of classic funnel plots has been questioned variously (Lau, Ioannidis, Terrin, Schmid, & Olkin, 2006; Simmonds, 2015; Terrin, Schmid, & Lau, 2005). These points of critique also apply to the sunset funnel plot, when used to assess small-study effects. Newly proposed applications of visual inference with funnel plots may well have merit to further increase the objectivity and validity of funnel-plot examinations in general and of the sunset funnel plot in particular (Kossmeier, Tran, & Voracek, 2019). Lastly, experimental studies to test the ability of users to detect bias in meta-analytic data with the help of sunset funnel plots, as well as surveys querying user experience with the sunset funnel plot, might be promising research endeavors in the future. In summary, the sunset funnel plot is a new, useful display for the meta-analytic visualization toolbox. The sunset funnel is the first dedicated display for meta-analysis depicting study-level power and has the potential to support the assessment of the credibility of study findings within a meta-analysis. Software to create sunset funnel plots is available for meta-analysts in the form of a tailored function within R package metaviz (Kossmeier et al., 2018).
References Begg, C. B., & Mazumdar, M. (1994). Operating characteristics of a rank correlation test for publication bias. Biometrics, 50, 1088–1101. https://doi.org/10.2307/2533446 Chevance, A., Schuster, T., Steele, R., Ternès, N., & Platt, R. W. (2015). Contour plot assessment of existing meta-analyses confirms robust association of statin use and acute kidney injury risk. Journal of Clinical Epidemiology, 68, 1138–1143. https://doi.org/10.1016/j.jclinepi.2015.05.030 Duval, S., & Tweedie, R. (2000). Trim and fill: A simple funnel-plotbased method of testing and adjusting for publication bias in meta-analysis. Biometrics, 56, 455–463. https://doi.org/ 10.1111/j.0006-341X.2000.00455.x Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test. British Medical Journal, 315, 629–634. https://doi.org/10.1136/ bmj.315.7109.629 Forstmeier, W., Wagenmakers, E. J., & Parker, T. H. (2017). Detecting and avoiding likely false-positive finding: A practical guide. Biological Reviews, 92, 1941–1968. https://doi.org/ 10.1111/brv.12315 Ioannidis, J. P., Stanley, T. D., & Doucouliagos, H. (2017). The power of bias in economics research. The Economic Journal, 127, F236–F265. https://doi.org/10.1111/ecoj.12461 Ó 2020 Hogrefe Publishing
M. Kossmeier et al., Sunset Funnel Plot
Ioannidis, J. P., & Trikalinos, T. A. (2007). An exploratory test for an excess of significant findings. Clinical Trials, 4, 245–253. https://doi.org/10.1177/1740774507079441 Kossmeier, M., Tran, U. S., & Voracek, M. (2018). metaviz. [R software package]. Retrieved from https://CRAN.R-project.org/ package=metaviz Kossmeier, M., Tran, U., & Voracek, M. (2019). Visual inference for the funnel plot in meta-analysis. Zeitschrift für Psychologie, 227, 83–89. https://doi.org/10.1027/2151-2604/a000358 Langan, D., Higgins, J. P., Gregory, W., & Sutton, A. J. (2012). Graphical augmentations to the funnel plot assess the impact of additional evidence on a meta-analysis. Journal of Clinical Epidemiology, 65, 511–519. https://doi.org/10.1016/j.jclinepi. 2011.10.009 Lau, J., Ioannidis, J. P., Terrin, N., Schmid, C. H., & Olkin, I. (2006). Evidence based medicine: The case of the misleading funnel plot. British Medical Journal, 333, 597–600. https://doi.org/ 10.1136/bmj.333.7568.597 Light, R. J., & Pillemer, D. B. (1984). Summing up: The science of reviewing research. Cambridge, MA: Harvard University Press. Mathie, R. T., Ramparsad, N., Legg, L. A., Clausen, J., Moss, S., Davidson, J. R., . . . McConnachie, A. (2017). Randomised, double-blind, placebo-controlled trials of non-individualised homeopathic treatment: Systematic review and meta-analysis. Systematic Reviews, 6, 63. https://doi.org/10.1186/s13643017-0445-3 Muncer, S. J., Craigie, M., & Holmes, J. (2003). Meta-analysis and power: Some suggestions for the use of power in research synthesis. Understanding Statistics, 2, 1–12. https://doi.org/ 10.1207/S15328031US0201_01 Peters, J. L., Sutton, A. J., Jones, D. R., Abrams, K. R., & Rushton, L. (2008). Contour-enhanced meta-analysis funnel plots help distinguish publication bias from other causes of asymmetry. Journal of Clinical Epidemiology, 61, 991–996. https://doi.org/ 10.1016/j.jclinepi.2007.11.010 Pietschnig, J., Penke, L., Wicherts, J. M., Zeiler, M., & Voracek, M. (2015). Meta-analysis of associations between human brain volume and intelligence differences: How strong are they and what do they mean? Neuroscience & Biobehavioral Reviews, 57, 411–432. https://doi.org/10.1016/j.neubiorev.2015.09.017 R Core Team. (2018). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org/ Renkewitz, F., & Keiner, M. (2019). How to detect publication bias in psychological research: A comparative evaluation of six statistical methods. Zeitschrift für Psychologie, 227, 261–279. https://doi.org/10.1027/2151-2604/a000386 Rücker, G., Schwarzer, G., & Carpenter, J. (2008). Arcsine test for publication bias in meta-analyses with binary outcomes. Statistics in Medicine, 27, 746–763. https://doi.org/10.1002/ sim.2971 Schild, A. H. E., & Voracek, M. (2013). Less is less: A systematic review of graph use in meta-analyses. Research Synthesis Methods, 4, 209–219. https://doi.org/10.1002/jrsm.1076 Schimmack, U. (2016). The replicability-index: Quantifying statistical research integrity. Retrieved from https://replicationindex.
Ó 2020 Hogrefe Publishing
49
wordpress.com/2016/01/31/a-revised-introduction-to-the-rindex/ Simmonds, M. (2015). Quantifying the risk of error when interpreting funnel plots. Systematic Reviews, 4, 24. https://doi.org/ 10.1186/s13643-015-0004-8 Sterne, J. A., & Egger, M. (2001). Funnel plots for detecting bias in meta-analysis: Guidelines on choice of axis. Journal of Clinical Epidemiology, 54, 1046–1055. https://doi.org/10.1016/S08954356(01)00377-8 Sterne, J. A., Gavaghan, D., & Egger, M. (2000). Publication and related bias in meta-analysis: Power of statistical tests and prevalence in the literature. Journal of Clinical Epidemiology, 53, 1119–1129. https://doi.org/10.1016/S0895-4356(00)00242-0 Sterne, J. A., Sutton, A. J., Ioannidis, J. P., Terrin, N., Jones, D. R., Lau, J., . . . Tetzlaff, J. (2011). Recommendations for examining and interpreting funnel plot asymmetry in meta-analyses of randomised controlled trials. British Medical Journal, 343, d4002. https://doi.org/10.1136/bmj.d4002 Szucs, D., & Ioannidis, J. P. (2017). Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLoS Biology, 15, e2000797. https://doi.org/10.1371/journal.pbio.2000797 Terrin, N., Schmid, C. H., & Lau, J. (2005). In an empirical evaluation of the funnel plot, researchers could not visually identify publication bias. Journal of Clinical Epidemiology, 58, 894–901. https://doi.org/10.1016/j.jclinepi.2005.01.006 Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Cham, Switzerland: Springer. https://doi.org/10.1007/978-3319-24277-4 History Received May 30, 2019 Revision received August 5, 2019 Accepted August 15, 2019 Published online March 31, 2020 Conflicts of Interest The authors have no conflict of interest to report. Open Data The software to create sunset (power-enhanced) funnel plots is available at https://metaviz.shinyapps.io/sunset/. R code to reproduce Figure 1: https://osf.io/wy29f/ R code to reproduce Figure 2: https://osf.io/t2he9/ R code to reproduce Figure 3: https://osf.io/5r9yf/ Michael Kossmeier Department of Basic Psychological Research and Research Methods School of Psychology University of Vienna Liebiggasse 5 A-1010 Vienna Austria michael.kossmeier@univie.ac.at
Zeitschrift für Psychologie (2020), 228(1), 43–49
Original Article
Addressing Publication Bias in Meta-Analysis Empirical Findings From Community-Augmented Meta-Analyses of Infant Language Development Sho Tsuji1,2 , Alejandrina Cristia2, Michael C. Frank3, and Christina Bergmann4 1
International Research Center for Neurointelligence, Institutes for Advanced Studies, The University of Tokyo, Japan
2
Ecole Normale Supérieure, Laboratoire de sciences cognitives et de psycholinguistique, Département d’études cognitives, ENS, EHESS, CNRS, PSL University, Paris, France
3
The Stanford Language and Cognition Lab, Department of Psychology, Stanford University, Stanford, CA, USA
4
Language Development Department, Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
Abstract: Meta-analyses are an indispensable research synthesis tool for characterizing bodies of literature and advancing theories. One important open question concerns the inclusion of unpublished data into meta-analyses. Finding such studies can be effortful, but their exclusion potentially leads to consequential biases like overestimation of a literature’s mean effect. We address two questions about unpublished data using MetaLab, a collection of community-augmented meta-analyses focused on developmental psychology. First, we assess to what extent MetaLab datasets include gray literature, and by what search strategies they are unearthed. We find that an average of 11% of datapoints are from unpublished literature; standard search strategies like database searches, complemented with individualized approaches like including authors’ own data, contribute the majority of this literature. Second, we analyze the effect of including versus excluding unpublished literature on estimates of effect size and publication bias, and find this decision does not affect outcomes. We discuss lessons learned and implications. Keywords: meta-analysis, developmental psychology, effect sizes, gray literature
Meta-analyses are an indispensable research synthesis tool for characterizing bodies of literature and advancing theories. In typical meta-analyses, noisy measurements from multiple independent samples are normalized onto a single scale (typically a measure of effect size) and combined statistically to produce a more accurate measurement. Effects for meta-analysis can come from the published literature, unpublished data, or even the author’s own work, but different strategies for identifying datapoints for inclusion can have major consequences for the interpretation of the meta-analytic estimate. In particular, the exclusion of unpublished work can lead to a bias for positive findings and hence compromise validity. Thus, it is important to assess the utility – and impact – of strategies for including unpublished data. In the present article, we describe our successes and failures with gathering unpublished data for meta-analyses within developmental psychology, and assess how the addition of these datapoints changes the conclusions from our sample of metaanalyses.
Zeitschrift für Psychologie (2020), 228(1), 50–61 https://doi.org/10.1027/2151-2604/a000393
Community-Augmented Meta-Analyses and MetaLab Community-augmented meta-analyses (CAMAs, Tsuji, Bergmann, & Cristia, 2014) are a tool for countering some problems faced by traditional meta-analyses. In the original proposal, CAMAs were imagined as open-access, online meta-analyses: living documents that can be openly accessed, updated, and augmented (Tsuji et al., 2014). Their dynamic nature avoids a key problem of traditional meta-analyses, which are crystallized at the time of publication and quickly become outdated. Additionally, CAMAs were set up to allow the addition of unpublished datapoints. Although we initially aimed for authors and others to add studies to extant meta-analyses, we now favor a system where a single curator is responsible for updating a given meta-analysis. This preserves the original goal of having up-to-date meta-analyses, and further ensures internal consistency in all meta-analyses. This change in the concept of curation (from crowd-sourcing to centralized), however,
Ó 2020 Hogrefe Publishing
S. Tsuji et al., Addressing Publication Bias in Meta-Analysis
does not affect the topics that are broached in this paper, and thus will not be discussed further. MetaLab is a database and browsable web interface that instantiates the CAMA idea (http://metalab.stanford.edu/; Bergmann et al., 2018). The database’s focus is Developmental Psychology, and the goal is to eventually cover all subfields on which there are experimental results bearing on infant and child cognition. At present, MetaLab hosts 20 meta-analyses (containing a total of 1,686 effect sizes), covering diverse topics ranging from sensitivity to vowel contrasts (e.g., the sound difference between “ship” and “sheep”; Tsuji & Cristia, 2014) to children’s preference for prosocial over anti-social agents (Margoni & Surian, 2018). Most meta-analyses, however, bear on language development, and focus on children aged 5 years or younger. In the present paper, we analyze 12 meta-analyses in MetaLab for which efforts like search strategy and contact with authors were well documented and accessible to us (containing a total of 1,232 effect sizes; Bergmann & Cristia, 2016; Black & Bergmann, 2017; Carbajal, 2018; Cristia, 2018; Fort et al., 2018; Rabagliati, Ferguson, & LewWilliams, 2019; Tsuji & Cristia, 2014; Tsui, Byers-Heinlein, & Fennell, 2019; Von Holzen & Bergmann, 2018). Some of these meta-analyses were co-authored by authors of the present article. We discuss below to what extent our results may generalize to other meta-analyses and fields of psychology.
Unpublished Literature in Meta-Analyses Since meta-analyses largely build on publicly accessible literature, they face some of the same challenges as primary literature in the context of the replication crisis (Lakens et al., 2017). One key issue concerns the inclusion of unpublished data, that is, results that do not appear in the published literature (and hence may not be indexed by all libraries and academic search engines), but are either reported in theses, dissertations, conference abstracts, white papers or internal reports, or not reported publicly at all (i.e., studies that are “file-drawered”). Attempting to access unpublished data is difficult and time-consuming. To begin with, reports on these data, if they exist, tend to not be indexed as carefully as published data and thus are harder to discover. For instance, a search on PubMed would not reveal theses or dissertations, whereas Google Scholar does index some (but not all) thesis archives. Even if a meta-analyst uses Google’s Scholar engine, conference abstracts and proceedings in many fields are not indexed, and thus need to be searched manually. In some cases, for instance when conferences
Ó 2020 Hogrefe Publishing
51
in a field favor very short abstracts, one may discover the existence of a study, but be unable to integrate it because there is insufficient information reported. In this case, as well as in the case of studies for which reports do not exist, author contact is the only way to secure the information needed to integrate a study into a quantitative analysis. One may try to write to all authors who have published on the topic, and ask for data in their file drawers. This is likely a biased approach, however, since authors of filedrawer studies that have never published on the topic cannot be accessed by this strategy. Those authors might, however, be the ones that have collected and failed to successfully published data that go against the main direction of findings for the field. To work against such biased collection and access also data collected by others, one can publicly post a call for data, for example, via fieldspecific mailing lists. Thus, meta-analysts who intend to include gray literature can be led to make a significant investment in time to be able to discover and integrate such results. To our knowledge, there is no previous research documenting the effectiveness of these modes of gray literature integration for psychological research. Therefore, in Part one below, we have undertaken to document the efficacy of these diverse methods, (i) database searches, (ii) citation searches, (iii) mailing list calls, (iv) cases where authors’ work was known (v) inclusion of own data. Relatedly, we also document the success rate in gathering data based on emailing authors with a request for information. Although discovering and integrating unpublished data is costly, it is often part of standard meta-analytic practice recommendations (e.g., White, 1994), in the hope that it will reduce publication bias. Indeed, published literature is widely assumed not to constitute an unbiased sample of the data, in turn yielding an overestimation of effect sizes in meta-analyses that only include published literature (e.g., Guyatt et al., 2011). Ferguson and Heene (2012) note that at least 25% – and possibly as many as 80% – of metaanalyses in psychology suffer from significant bias. A vast body of evidence confirms that this is a concern for psychological science in particular. For instance, Bakker, van Dijk, and Wicherts (2012) show convincingly that researchers in psychology typically use small sample sizes (with an inordinate proportion of statistically significant results), rather than larger sample sizes (whose higher precision reduces the likelihood of false positives and negatives). The problem is so widespread that item 15 of the PRISMA checklist specifically asks meta-analysts to “Specify any assessment of risk of bias that may affect the cumulative evidence (e.g., publication bias...)” (Moher, Liberati, Tetzlaff, Altman, & The PRISMA Group, 2009) and a systematic review of systematic reviews on the effects of all sources of bias
Zeitschrift für Psychologie (2020), 228(1), 50–61
52
identifies the inclusion of gray literature among its key recommendations (Tricco et al., 2008). Most meta-analyses, and therefore most recommendations for meta-analyses, are based on the medical intervention literature, however. Studies of publication biases from this field may or may not generalize to psychological research. One previous study investigated bias and unpublished data inclusion for 91 meta-analyses published in psychological journals (Ferguson & Brannick, 2012). Surprisingly, they concluded that meta-analyses including unpublished data were more, rather than less, biased than studies based purely on published data. These authors recognize the validity of the gray literature inclusion approach for medical meta-analyses, where registries allow for unbiased discovery of studies, and mandatory preregistration of studies further precludes analyses that favor specific results (Huić, Marušić, & Marušić, 2011). Since neither of these factors exist for psychology, it may be unwise for psychological meta-analyses to include gray literature because (1) the effort will be too large for the number of effect sizes that can be included ultimately (with a median of fewer than 5% of effect sizes stemming from unpublished data; Ferguson & Brannick, 2012); and (2) unpublished data will be biased because they are discovered mainly via a biased network: the meta-analysis’ authors and close colleagues, and prominent authors in the field, all of whom may contribute data that favor a given outcome. In view of these contrasting results between the psychological literature and the body of meta-analytic best practices research, we revisit the question of what the effects of adding unpublished data are based on our CAMAs. In Part two, we follow previous literature (e.g., Tricco et al., 2008) and report: (i) effect size estimates for samples with and without unpublished literature; (ii) bias estimates with and without unpublished literature; and (iii) potential correlates of study quality. We note that study quality is much harder to measure objectively in basic psychological research than in interventions. In interventions, randomized control trials with a double-blind procedure are undoubtedly better quality evidence for causal links than correlational research. Such hierarchy can be harder to establish for some types of laboratory experiments, where procedures like experimenter blinding or randomization exist, but might be implemented much less systematically and consistently than in intervention studies. However, we can at least inspect some general features that may correlate with data quality, for instance a study’s sample size. Some previous work suggests that unpublished data are lower quality by being based on smaller samples (e.g., Tricco et al., 2008). Finally, we dedicate a third part to in-depth case studies and summaries of lessons learned. Zeitschrift für Psychologie (2020), 228(1), 50–61
S. Tsuji et al., Addressing Publication Bias in Meta-Analysis
Methods All data reported in the Results section will be based on a subset of meta-analyses openly available on MetaLab. We include those meta-analyses that are based on a systematic literature search, made efforts to include unpublished data, and documented their data gathering efforts systematically. We define as unpublished anything that is not in a peerreviewed journal, including work that has appeared only in theses, proceedings, and books or book chapters. Cohen’s d is a standardized effect size based on sample means and their variance. We rely here on d values computed in the MetaLab pipeline, which uses standard formulae to convert the measurements reported in papers to d values (details are reported elsewhere; Bergmann et al., 2018, see also http://metalab.stanford.edu/). Data and analyses scripts are shared on our Open Science Framework project site (https://osf.io/g6abn/) and on PsychArchives (data: https://www.psycharchives. org/handle/20.500.12034/2185, code: https://www. psycharchives.org/handle/20.500.12034/2186). Analyses have been conducted with the tidyverse (Wickham, 2017) and the metafor (Viechtbauer, 2010) packages in R (R Core Team, 2019).
Results Of the 20 meta-analyses included in MetaLab at time of writing, 14 meta-analyses (70%) include unpublished data. This proportion is comparable to previous reports, where 63% of recent meta-analyses in Psychology made efforts to include gray literature (Ferguson & Brannick, 2012). Of those meta-analyses that did not restrict their search to published data, 12 fit our additional criteria for inclusion in the present analysis, namely being based on a systematic literature search, and systematically documenting data gathering efforts and/or making those efforts accessible to us. Concretely, the meta-analyses included in our final sample needed to have made available their search procedure in a document and/or provided it to us for the purpose of the present studies. A literature search was deemed systematic if it included and documented a keyword or seed search and details on the databases searched and search dates. Authors further needed to have documented a number of records found and included, their inclusion and exclusion criteria, as well as provide an exhaustive list of other sources consulted to gather information and data. For the purpose of the present study, we aggregate two pairs of meta-analyses into single meta-analyses, since the systematic literature review in both cases had originally been conducted on the pair, and the datasets were only Ó 2020 Hogrefe Publishing
S. Tsuji et al., Addressing Publication Bias in Meta-Analysis
53
Table 1. Overview of meta-analyses and number of published and unpublished effect sizes N unpublished effect sizes
Citation Carbajal (2018)
Meta-analysis Familiar word recognition
N data obtained through N N effect Database Citation Author author email papers sizes Overall search search known Own 15
33
12
2
4
4
2
4
Bergmann and Cristia (2016) Natural word segmentation
73
315
25
23
0
0
2
12
Von Holzen and Bergmann (2018)
Mispronunciation sensitivity
32
251
32
0
0
33
0
14
Black and Bergmann (2017)
StatSeg
26
91
10
10
NA
0
0
2
Tsuji and Cristia (2014)
Vowel discrimination
38
194
11
4
NA
6
1
6
Fort et al. (2018)
Sound symbolism
11
44
20
2
NA
10
8
6
Rabagliati et al. 2019
Abstract rule learning
20
94
4
4
NA
0
NA
3
Cristia (2018)
Phonotactic learning
15
47
11
0
0
1
10
2
Cristia (2018)
Statistical sound category learning
11
20
9
0
0
5
2
2
Tsui et al. (2019)
Switch task
47
143
11
2
0
0
9
3
28.8
123.2
14.4
4.7
0.4
5.9
3.4
37.7
3.3
31.8
27.5
Mean % of unpublished effect sizes
later thematically separated in MetaLab.1 The resulting 10 meta-analyses ranged in size from 11 to 73 papers (Mdn = 23), with 20–315 datapoints included (Mdn = 92). Table 1 gives an overview of the meta-analyses, including citations, descriptive statistics on the number of effect sizes by publication status, and how this unpublished literature was found.
Evaluation of Data-Gathering Efforts Overall, of the total of 1,232 effect sizes contained in our MetaLab subset, 144 effect sizes, or 11.7% of data, were based on unpublished literature. Similar to previous reports (Ferguson & Brannick, 2012), the distribution of unpublished study percentages shows a positive skew, with most meta-analyses in our sample having a low percentage of unpublished studies – around 10% or less (Figure 1). Database Searches All database searches followed standard meta-analytic practice, wherein a set of pre-determined keywords were entered into a search engine, and titles of hits and the abstracts of potentially relevant papers were scanned to arrive at the sample of papers eligible for inclusion in the respective meta-analysis. In all meta-analyses included in our study, Google Scholar was the search engine of choice, which performs equivalent to a combined search of multiple databases (Gehanno, Rollin, & Darmoni, 2013), although exact replicability of Google Scholar searches might be compromised since it saves users’ history. 1
Figure 1. Percentage of unpublished studies included per metaanalysis.
Crucially for us, Scholar includes unpublished work (i.e., pre-prints, conference proceedings, or unpublished manuscripts) in its search results as long as they are available online and indexed. We found that an average of 4.7 datapoints or 37.3% (SD = 42.9) of unpublished data was found by Google Scholar searches (Figure 2).
“Word segmentation” and “Function word segmentation” are aggregated into “Natural word segmentation”; “Native vowel discrimination” and “Non-native vowel discrimination” are aggregated into “Vowel discrimination”.
Ó 2020 Hogrefe Publishing
Zeitschrift für Psychologie (2020), 228(1), 50–61
54
S. Tsuji et al., Addressing Publication Bias in Meta-Analysis
Figure 2. Number of unpublished datapoints obtained through different sources of recovering gray literature, by meta-analysis.
Citation Searches Some unpublished studies might not be available online and thus not detectable by search engines. They might, however, be discoverable by searching the reference lists of available studies. The percentage of unpublished datapoints gathered based on citation searches was an average of 0.4 datapoints or 3.3% (SD = 10.5). Mailing List Requests In order to reach a relevant audience to recover potential gray literature, authors of six of the included meta-analysts requested contributions via professional email lists. Strikingly, these attempts did not lead to a single reply with information that could be added to the meta-analyses. Author’s Work Known In addition to the more formal routes described above, a meta-analyst can get to know an author’s eligible work at a conference, or via informal communication with experts in the field. Our estimate of datapoints added via this route is an average of 5.9 datapoints or 31.8% (SD = 35.9). Since a meta-analyst is often an expert in the topic of their metaanalysis, this can be a very fruitful route – one of our special cases below will illustrate how helpful this strategy can be for the data gathering process. Own Data A meta-analyst might also contribute their own unpublished data to their meta-analysis. In MetaLab, unpublished data from meta-analysts’ own research accounted for an average of 3.4 datapoints or 27.5% (SD = 33.7) of total unpublished Zeitschrift für Psychologie (2020), 228(1), 50–61
datapoints. Previous reports have documented a difference in own published and unpublished contributions, with an average of 5.89% of total published datapoints, and 12.94% of total unpublished datapoints, being based on meta-analysts’ own data (Ferguson & Brannick, 2012). If we also assess published datapoints with this metric, we find that an average of 9.6% of published datapoints are based on own data in MetaLab. Although this ratio suggests more addition of unpublished own datapoints, note, that there are much more total published than unpublished datapoints in the MetaLab datasets. If we look at the absolute number of datapoints added, meta-analysts added more published (an average of 6.1) than unpublished (an average of 3.4) own datapoints to their datasets. Given that the absolute number of published versus unpublished datapoints in the previous literature (Ferguson & Brannick, 2012) is comparable or lower than what is found the present study, we can conclude that MetaLab contains a relatively high proportion of authors’ own unpublished data. Emails to Authors Meta-analysts can chose to contact authors of papers eligible for inclusion in their meta-analysis with request for additional information from eligible literature (whether published or unpublished), and whether they are aware of any gray literature. It was impossible to recover the number of effect sizes added based on these requests, since metaanalysis authors did not consistently document how many datapoints of a given study were affected by their request (e.g., only one experiment of a study or all experiments could have been affected), and whether the information Ó 2020 Hogrefe Publishing
S. Tsuji et al., Addressing Publication Bias in Meta-Analysis
gathered was necessary to compute published or unpublished effect sizes. We therefore instead counted the number and ratio of authors contacted, and how many of these contacts were responsive and lead to data that could be added to the meta-analysis. An average of 12.7 authors was contacted. Out of all authors contacted, an average of 9.8 or 85.1% (SD = 19.5) were responsive and 5.4 or 49.6% (SD = 28.1) provided data that could be added to the meta-analysis. Community Contributions Although the original CAMA idea entailed that, ultimately, the research community would take over the curation of datasets and infrastructure in a bottom-up fashion, this model has not proven feasible. Two issues were a general lack of community contributions and difficulty in curating entries that contained errors. Instead, MetaLab now consists of a governing board that, aided by external funding,2 maintains and expands the general infrastructure. Dedicated curators for each dataset are then in charge of updating individual meta-analyses.
Meta-Analyses With Published and Unpublished Results The second part of our analysis evaluated the effect of including gray literature on publication bias. We evaluate the effect of including or excluding unpublished literature on effect size and bias estimates in our samples. We also assess potential correlates of study quality. Effect Size Estimates by Publication Status The mean Cohen’s d effect size across meta-analyses is d = 0.22 (range: 0.34 to 0.77). The mean is d = 0.24 (range: 0.17 to 0.66) for published, and d = 0.15 (range: 0.34 to 0.77 for unpublished studies, consistent with publication bias (greater effects for journal-published studies) as well as a difference in data quality (lower for not journal-published studies). This analysis ignores factors that are known to vary across studies, however. Therefore, in order to more specifically assess the effect of inclusion of unpublished datapoints into a meta-analysis, we constructed meta-analytic regression models for the full dataset as well as datasets including only the published or unpublished datapoints (see, e.g., Tricco et al., 2008). The model for each meta-analysis included infant age as a predictor, since this factor has been consistently found to explain variance in effect sizes (Bergmann et al., 2018). While testing method is another such factor, we refrained from including it in our analyses, since data subsets differ 2
55
in the number of testing methods included, with some subsets being comprised of data stemming from only one testing method. Our random effects structure allowed shared variance for datapoints stemming from the same paper, and accounted for the dependence between datapoints stemming from the same infant participants contributing multiple effect sizes (see Konstantopoulos, 2011; R model = rma.mv(d, d_var, mods = age, random = 1| paper/same_infant/data_point)). We had to exclude one data subset for which the regression model did not converge. Note that meta-analytic regression estimates become imprecise with small datasets, and we could have addressed this issue by excluding those data subsets with small amounts of datapoints. However, since any such cut-off would be arbitrary, we opted for including regression analyses for all datasets that did converge. Figure 3 shows the resulting effect size estimates and associated confidence intervals. There is no clear pattern in terms of higher meta-analytic effect size estimates for published datasets, consistent with previous reports (e.g., Chow & Ekholm, 2018; Guyatt et al., 2011), and confidence intervals for the respective sets mostly overlap. Thus, when known factors structuring variance (age, method, metaanalysis) are accounted for, there does not seem to be a clear pattern as to the direction in which the inclusion of gray literature affects meta-analytic conclusions. Note, however, that effect sizes do change based on the inclusion of gray literature for the majority of datasets, and that including or excluding these studies would likely affect the overall conclusions of any given meta-analysis. Given the overall small sample sizes, it is impossible to estimate to what extent the fact that we include specifically gray literature – as opposed adding literature in general – affects these effect size estimates. Bias Estimates With and Without Unpublished Literature In order to evaluate the impact of inclusion or exclusion of unpublished literature on bias estimates, we assessed each individual meta-analysis by means of funnel plot asymmetry, a classical diagnostic for identifying potential publication bias (Egger, Smith, Schneider, & Minder, 1997). We included as moderators infant age, a factor that explains variance in most meta-analyses in MetaLab. The distribution of test statistics for Egger’s test for funnel plot asymmetry did not differ whether we assessed datasets under exclusion of unpublished studies (zmin = 1.89, zmean = 3.33, zmax = 13.41), or when gray literature was included (zmin = 1.90, zmean = 3.20, zmax = 12.44) (see Figure 4). In both subsets, the same 4 out of 10 datasets showed significant funnel plot asymmetry, suggesting that adding gray
https://www.bitss.org/projects/metalab-paving-the-way-for-easy-to-use-dynamic-crowdsourced-meta-analyses/
Ó 2020 Hogrefe Publishing
Zeitschrift für Psychologie (2020), 228(1), 50–61
56
S. Tsuji et al., Addressing Publication Bias in Meta-Analysis
Figure 3. Mean meta-analytic effect size estimates and associated confidence intervals. Values are based on meta-analytic regression models for each dataset. Shapes represent different subsets per meta-analysis based on their publication status, lines indicate the 95% CI. Numbers in italics on the right side indicate the number of effect sizes going into each regression model.
literature did not improve this indicator of publication bias. A more quantitative evaluation of this difference proved difficult. For instance, a non-parametric bootstrapping approach is unreliable for meta-analyses with fewer than about 50–100 studies, which is the case for the majority of the meta-analyses included in the present analysis. Although funnel plot asymmetry is a classic diagnostic for assessing publication bias, it tends to have low power and can fail to detect significant publication bias (e.g., Lau, Ioannidis, Terrin, Schmid, & Olkin, 2006; Macaskill, Walter, & Irwig, 2001). In order to evaluate the effect of studies’ publication status more directly, we ran a random-effects meta-analytic regression for each meta-analysis, using publication status (published, unpublished) as a moderator. We again included infant age and testing method as additional moderators. We assessed whether publication status had a significant effect in each of these meta-analyses, and found that this was the case in two cases. In both cases, gray literature had a negative impact on effect sizes (see Table 2). Thus, even though including gray literature did not cause a significant change in the results of Egger’s test for publication bias, it did explain a significant proportion of variance in several datasets. Further, gray literature inclusion may also be relevant in other cases; it is plausible that we failed to detect such an effect due to the small samples included in our meta-analyses leading to low statistical power. Zeitschrift für Psychologie (2020), 228(1), 50–61
Correlates of Study Quality There are no uniformly agreed-upon markers of study quality in experimental psychology research that can easily be assessed. Since unpublished effect sizes are often based on smaller samples (Tricco et al., 2008), we assessed whether sample size could serve as a proxy of quality. However, a descriptive assessment of sample size revealed no difference in the sample size for published (M = 21.7, SD = 9.9) and unpublished studies (M = 22.5, SD = 10.3) in our sample of meta-analyses. In addition to sample sizes, we attempted to assess one other potential indicator of study quality, namely, internal correlations in within-participant designs. Since weighted meta-analytic regression requires an estimate of these correlations, some meta-analysts have gathered this measure. A higher internal correlation might suggest less noise in the measure, thus potentially indicating higher study quality. We first checked whether the degree of internal correlation correlated positively with child age (since measure precision tends to improve as children age). However, our assessment showed that internal correlation and child age were negatively correlated significantly [r = 0.28, t(417) = 5.88, p < .001]. Since this result contradicted our initial assumption, we did not pursue this possibility further. Overall, therefore, we were not able to show any relationship between potential measures of study quality and Ó 2020 Hogrefe Publishing
S. Tsuji et al., Addressing Publication Bias in Meta-Analysis
57
Figure 4. Funnel plots by dataset. Effect size estimates for published data are represented by black points, and estimates for unpublished data by yellow points in the color version of this figure available with the online version of this article. The mean effect size for the full dataset is shown as a red dashed line (invisible when the means for full and published data sets overlap), and the gray shaded funnel corresponds to a 95% CI around this mean. The mean effect size for the subset of published studies is shown as an orange dashed line, and the transparent dashed funnel corresponds to a 95% CI around this mean. The gray dashed line shows an effect size of zero. In the absence of bias, we should expect all points to fall symmetrically inside the funnel.
publication status, likely due in part to the lack of objective criteria for study quality in experimental psychology.
Case Studies Case Study 1: Expanding the Pool of Meta-Analyses In addition to gathering unpublished data and missing datapoints for extant CAMAs, it was also sought to expand the pool of CAMAs available on MetaLab. For this purpose, a call for contributions to MetaLab was issued, with a $1,000 cash prize for the top three most extensive
Ó 2020 Hogrefe Publishing
meta-analyses submitted. To advertise this challenge, announcements were sent to professional mailing lists, the literature was searched for extant meta-analyses fitting the scope and their authors were contacted. Information on ongoing meta-analyses efforts were informally gathered and distributed. These efforts resulted in six eligible submissions for the challenge, four of which are already integrated in MetaLab. Considering that these will finally constitute 27.2% (6/22) of meta-analyses on MetaLab, this strategy substantially expanded the database at relatively low cost compared with the cost of performing new meta-analyses.
Zeitschrift für Psychologie (2020), 228(1), 50–61
58
S. Tsuji et al., Addressing Publication Bias in Meta-Analysis
Table 2. Meta-analytic regression coefficients for the effect of publication status by dataset Effect sizes
β
SE
z
p
ci.lb
ci.ub
Abstract rule learning
94
0.051
0.178
0.284
.776
0.399
0.298
Familiar word recognition
33
0.055
0.088
0.626
.532
0.118
0.228
Mispronunciation sensitivity
251
0.476
0.144
3.297
.001*
0.758
0.193
Natural word segmentation
315
0.071
0.044
1.638
.101
0.157
0.014 0.244
Phonotactic learning
47
0.068
0.09
0.762
.446
0.107
Sound symbolism
44
0.127
0.062
2.032
.042*
0.249
0.005
Statistical sound learning
20
0.01
0.171
0.056
.955
0.345
0.326
Statistical word segmentation
91
0.02
0.111
0.185
.854
0.197
0.237
Switch task
143
0.036
0.089
0.402
.688
0.138
0.209
Vowel discrimination
181
0.177
0.209
0.848
.397
0.232
0.586
Note. SE = standard error; ci.lb = lower boundary of confidence interval; ci.ub = upper boundary of confidence interval. Asterisk indicates statistical significance.
Case Study 2: Gathering Gray Literature Through In-Person Author Contact One of the MetaLab datasets on sound symbolism in infancy (Fort et al., 2018) assesses the development of the bouba-kiki effect, whereby humans associate pseudowords like “bouba” with round objects, and pseudowords like “kiki” with spiky objects. Experiments examining this effect have had mixed results in infant populations. Anecdotally, several researchers have failed to find a consistent boubakiki effect, but have faced problems publishing these null results. Encountering others researchers with similar results through conference presentations, they encouraged each other to share their unpublished data and decided to conduct a meta-analysis on the phenomenon with both published and unpublished results. This meta-analysis reveals that there is overall evidence for a bouba-kiki effect in infants, however, it is smaller than suggested by the published literature alone. This small effect size, combined with habitually small sample sizes in infants’ studies, likely explains the divergence in findings between attempts to elicit the bouba-kiki effect. In the case of this meta-analysis, presenting null results at a relevant conference, and thus making others aware of their existence, proved a highly effective way to assemble gray literature.
Discussion Publication bias is considered a key problem of the metaanalytic literature, and the exclusion of gray literature from meta-analyses is a potential cause. Since the difficulty of accessing such gray literature is a reason for the lack of unpublished studies in meta-analyses, we assessed the amount of gray literature gained based on various strategies in datasets assembled in the open-access database MetaLab (Bergmann et al., 2018). In the following, we will discuss
Zeitschrift für Psychologie (2020), 228(1), 50–61
these strategies from the viewpoint of lessons learned and recommendations for future meta-analysts. We further assessed the impact of including such gray literature on publication bias. These analyses show that our efforts had only a moderate impact on publication bias in our datasets, a result we will discuss in light of previous literature and the nature of gray literature gathered in our datasets.
Lessons Learned From Efforts Gathering Missing Data If an article does not report all data necessary for estimating effect sizes, contacting the original authors of the article is the only way to possibly obtain these missing data. Although this is an effortful endeavor, our analysis of data gathering efforts shows that it is a successful strategy to gather missing information. Authors contacted individually by email were highly responsive and sent data useful for computing effect sizes in almost half of the cases. Of course, no reply was forthcoming in the other half of cases, and the fact that this outcome should still be considered a success illustrates the difficulty of data gathering during meta-analyses, especially considering that our mailing list calls failed to point us to any missing data. While we do not have comparative data for other approaches, the successful author contacts by metaanalysts in MetaLab are based on highly individualized emails to the respective first and/or last authors of an article. Along with outlining the general aim of the metaanalysis, meta-analysis authors would mention both the authors’ and their article’s name, and explain in detail the nature of data needed from them. Habitually, we would send one follow-up email in case we did not get a reply. In addition to these efforts, it has been shown that adding data-sharing agreements to requests for primary data can improve responsiveness (Polanin & Terzian, 2019).
Ó 2020 Hogrefe Publishing
S. Tsuji et al., Addressing Publication Bias in Meta-Analysis
Lessons Learned From Efforts Gathering Gray Literature On average, 11.4% of effect sizes were based on unpublished studies in the present dataset. Standard tools like database searches, which arguably take less time and effort than more personalized efforts like knowing of an author’s work, can already contribute an important amount of gray literature. Although it is difficult to estimate the exact time effort required, some gray literature would be included in the search automatically depending on the engine chosen, and thus the search itself would not require more time investment. Another standard tool, citation searches, only lead to the discovery of unpublished studies in one database. Although such searches are an added effort, a meta-analyst would be recommended to carefully read the included literature, which is not too many steps away from the citation search itself. Knowing the author of a study (whether in person or not), proved to be a fruitful strategy to gather data. Although this strategy might be more dependent on an individual metaanalysts’ network and the availability of unpublished studies in conferences and other places enabling personal contact, our analyses and case study on sound symbolism illustrate that this is a promising way to include unpublished studies, and should be on the meta-analyst’s mind as a possible way to gather data. Ideally, though, we think it would be desirable that there were better indexing of unpublished literature (by authors and conference organizers uploading their unpublished work to searchable archives). A more reliable index of gray literature would reduce the individual’s dependence on high effort strategies and enable discovery via the standard database search strategies, which at the same time increases transparency. Finally, even to a greater extent than the previous literature (Ferguson & Brannick, 2012), the meta-analyses included here contained a relatively high proportion of effect sizes stemming from the meta-analysts’ own data. A metaanalysts’ own unpublished data, or such data in their network, might induce bias, since owning such data might serve as a motivator to conduct an MA, or own data might be more likely included in an MA before peer review. Thus, such potential bias might counteract the otherwise biasreducing effect of adding unpublished literature to an MA.
The Effect of Adding Unpublished Literature on Effect Sizes and Publication Bias Previous literature has reported higher effect sizes among published effect sizes, potentially leading to an
Ó 2020 Hogrefe Publishing
59
overestimation of effect sizes if unpublished literature is not included. In contrast, other authors have warned against the inclusion of gray literature because it would increase bias (Ferguson & Brannick, 2012). We addressed this question with two types of analyses. Our funnel plot analyses suggested that these problems only played a minor role in our data, showing no differences in publication bias by publication status. Our second analysis for addressing this issue, a meta-analytic regression with publication status as a moderator, showed that, for two datasets, adding unpublished literature had a significant or marginally significant impact on effect sizes. Even the lack of statistical bias in one of our analyses does not necessarily mean that the unpublished literature we include is not biased, especially considering the relatively low power of our sample. As mentioned in the Introduction, detectable gray literature might itself be biased. Indeed, recent large-scale analyses of effect sizes in published versus unpublished studies included in metaanalyses indicated systematically larger effect sizes for the published literature (Polanin, Tanner-Smith, & Hennessy, 2016), and the fact that we do not find such a difference might indicate that our sample of unpublished studies is upwards biased. Similarly, study preregistration has been shown to be strongly associated with more null results (Kaplan & Irvin, 2015), suggesting that, without preregistration, the tendency to publish null findings is lower. Overall, our results thus suggest that unpublished studies should be added to a meta-analysis with great care and transparency, allowing the reader to gain insight into the effect these datapoints have on the overall estimates. Adding gray literature might not be equally illuminating, necessary, or damaging in every field of psychology. With regard to the advantages and disadvantages of including gray literature in the case of infant literature, we suggest that such literature will improve the overall quality of a database, for at least two reasons. First, the infant literature is tremendously underpowered and benefits from a larger body of studies to better estimate true effect sizes. Second, in order to conduct infant experiments, a researcher habitually needs to undergo training and make use of a specialized laboratory, making it unlikely that unpublished data are especially prone to being badly designed or executed. Third, if gray literature is added the way we suggest in the context of CAMAs such that each study is coded based on publication status, meta-analysts and database users can decide for themselves whether or not to include unpublished datapoints in their assessments. A relatively large amount of unpublished data in the present dataset was based on potentially biased data, namely the meta-analyst’s own data or data based on direct contact with study authors. Potentially, reducing this bias in
Zeitschrift für Psychologie (2020), 228(1), 50–61
60
the future might impact the difference between effect sizes and measures of publication bias between published and unpublished studies in the present dataset.
Limitations and Opportunities Meta-analyses in MetaLab are not a random or representative selection of meta-analyses in psychology. Some of their characteristics are atypical of meta-analyses in Psychology; for instance, the number of studies included (Mdn = 23) is larger than the median of 12 reported in larger samples (Van Erp, Verhagen, Grasman, & Wagenmakers, 2017), and include a larger number of effects per study. In fact, two of the included meta-analyses themselves have not yet been published (Carbajal, 2018; Tsui et al., 2019), and two more have been peer-reviewed for proceedings papers (Black & Bergmann, 2017; Von Holzen & Bergmann, 2018), and thus they have not, or to a moderate degree, been affected by the review process. The metaanalysis authors were often students who were doing a systematic review of a literature to which they were planning to contribute, and thus may not have had as much of a vested interest in supporting one or another theory as more established researchers might do. On the other hand, a relatively high proportion of unpublished data stemmed from meta-analysis authors, which might indicate a comparatively high interest in supporting a specific theory. Finally, most of them come from a cluster of researchers (including the authors of this paper) who strived to follow best practices guidelines such as following the PRISMA statement (Moher et al., 2009). While these characteristics might limit comparability with other attempts, the transparent data gathering process by relatively unbiased meta-analysts might enable us to assume that the biases found in the datasets can be attributed to the literature itself more than to the meta-analytic process. Then, considering that all meta-analysts of datasets included in the present sets attempted to gather gray literature, but publication bias was still prevalent, the MetaLab subset re-emphasizes a broader problem of the field, namely the lack of publicly available indexing of gray literature. Another characteristic of MetaLab is its basis on the CAMA approach, which from the outset was meant to function as a natural home for file-drawer studies in addition to published studies. Opening the file-drawer in this way can also help us to estimate how many studies are filtered out by the peer review process in the future. Metalab’s growth to now 20 datasets indicates its success. The website’s visibility has led to numerous conference presentations on the included meta-analyses as well as invitations to provide tutorials, which in turn have inspired others to start their
Zeitschrift für Psychologie (2020), 228(1), 50–61
S. Tsuji et al., Addressing Publication Bias in Meta-Analysis
own meta-analyses and become curators, or else to render more visible extant meta-analyses.
References Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The rules of the game called psychological science. Perspectives on Psychological Science, 7, 543–554. https://doi.org/10.1177/ 1745691612459060 Bergmann, C., & Cristia, A. (2016). Development of infants’ segmentation of words from native speech: A meta-analytic approach. Developmental Science, 19, 901–917. https://doi. org/10.1111/desc.12341 Bergmann, C., Tsuji, S., Piccinini, P. E., Lewis, M. L., Braginsky, M., Frank, M. C., & Cristia, A. (2018). Promoting replicability in developmental research through meta-analyses: Insights from language acquisition research. Child Development, 89, 1996– 2009. https://doi.org/10.1111/cdev.13079 Black, A., & Bergmann, C. (2017). Quantifying infants’ statistical word segmentation: A meta-analysis. In G. Gunzelmann, A. Howes, T. Tenbrink, & E. Davelaar (Eds.), Proceedings of the 39th Annual Meeting of the Cognitive Science Society (pp. 124– 129). Austin, TX: Cognitive Science Society. Carbajal, M. J. (2018). Separation and acquisition of two languages in early childhood: A multidisciplinary approach. (Unpublished doctoral dissertation). Ecole Normale Supérieure, Paris, France. Chow, J. C., & Ekholm, E. (2018). Do published studies yield larger effect sizes than unpublished studies in education and special education? A meta-review. Educational Psychology Review, 30, 727–744. https://doi.org/10.1007/s10648-018-9437-7 Cristia, A. (2018). Can infants learn phonology in the lab? A metaanalytic answer. Cognition, 180, 312–327. https://doi.org/ 10.1016/j.cognition.2017.09.016 Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test. British Medical Journal, 315, 629–634. https://doi.org/10.1136/bmj. 315.7109.629 Ferguson, C. J., & Brannick, M. T. (2012). Publication bias in psychological science: Prevalence, methods for identifying and controlling, and implications for the use of meta-analyses. Psychological Methods, 17, 120–128. https://doi.org/10.1037/ a0024445 Ferguson, C. J., & Heene, M. (2012). A vast graveyard of undead theories: Publication bias and psychological science’s aversion to the null. Perspectives on Psychological Science, 7, 555–561. https://doi.org/10.1177/1745691612459059 Fort, M., Lammertink, I., Peperkamp, S., Guevara-Rukoz, A., Fikkert, P., & Tsuji, S. (2018). SymBouki: A meta-analysis on the emergence of sound symbolism in early language acquisition. Developmental Science, 21, e12659. https://doi.org/ 10.1111/desc.12659 Gehanno, J. F., Rollin, L., & Darmoni, S. (2013). Is the coverage of Google Scholar enough to be used alone for systematic reviews. BMC Medical Informatics and Decision Making, 13, 7. https:// doi.org/10.1186/1472-6947-13-7 Guyatt, G. H., Oxman, A. D., Montori, V., Vist, G., Kunz, R., Brozek, J., . . . Williams, J. W. Jr. (2011). GRADE guidelines: 5. Rating the quality of evidence-publication bias. Journal of Clinical Epidemiology, 64, 1277–1282. https://doi.org/10.1016/ j.jclinepi.2011.01.011 Huić, M., Marušić, M., & Marušić, A. (2011). Completeness and changes in registered data and reporting bias of randomized
Ó 2020 Hogrefe Publishing
S. Tsuji et al., Addressing Publication Bias in Meta-Analysis
controlled trials in ICMJE journals after trial registration policy. PLoS One 6, e25258. https://doi.org/10.1371/journal.pone. 0025258 Kaplan, R. M., & Irvin, V. L. (2015). Likelihood of null effects of large NHLBI clinical trials has increased over time. PLoS One, 10, e0132382. https://doi.org/10.1371/journal.pone.0132382 Konstantopoulos, S. (2011). Fixed effects and variance components estimation in three-level meta-analysis. Research Synthesis Methods, 2, 61–76. https://doi.org/10.1002/jrsm.35 Lakens, D., van Assen, M. A. L. M., Anvari, F., Grange, J. A., Gerger, H., Hasselman, F., . . . Zhou, S. (2017). Examining the reproducibility of meta-analyses in psychology: A preliminary report. BITSS Preprint. https://doi.org/10.31222/osf.io/xfbjf Lau, J., Ioannidis, J. P., Terrin, N., Schmid, C. H., & Olkin, I. (2006). The case of the misleading funnel plot. British Medical Journal, 333, 597–600. https://doi.org/10.1136/bmj.333.7568.597 Macaskill, P., Walter, S. D., & Irwig, L. (2001). A comparison of methods to detect publication bias in meta-analysis. Statistics in Medicine, 20, 641–654. https://doi.org/10.1002/sim.698 Margoni, F., & Surian, L. (2018). Infants’ evaluation of prosocial and antisocial agents: A meta-analysis. Developmental Psychology, 54, 1445–1455. https://doi.org/10.1037/dev0000538 Moher, D., Liberati, A., Tetzlaff, J., & Altman, D. G., The PRISMA Group. (2009). Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. British Medical Journal, 339, b2535. https://doi.org/10.1136/bmj.b2535 Polanin, J. R., Tanner-Smith, E. E., & Hennessy, E. A. (2016). Estimating the difference between published and unpublished effect sizes: A meta-review. Review of Educational Research, 86, 207–236. https://doi.org/10.3102/0034654315582067 Polanin, J. R., & Terzian, M. (2019). A data-sharing agreement helps to increase researchers’ willingness to share primary data: results from a randomized controlled trial. Journal of Clinical Epidemiology, 106, 60–69. https://doi.org/10.1016/j. jclinepi.2018.10.006 Rabagliati, H., Ferguson, B., & Lew-Williams, C. (2019). The profile of abstract rule learning in infancy: Meta-analytic and experimental evidence. Developmental Science, 22, e12704. https:// doi.org/10.1111/desc.12704 R Core Team. (2019). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/ Tricco, A. C., Tetzlaff, J., Sampson, M., Fergusson, D., Cogo, E., Horsley, T., & Moher, D. (2008). Few systematic reviews exist documenting the extent of bias: A systematic review. Journal of Clinical Epidemiology, 61, 422–434. https://doi.org/10.1016/j. jclinepi.2007.10.017 Tsui, A. S. M., Byers-Heinlein, K., & Fennell, C. T. (2019). Associative word learning in infancy: A meta-analysis of the switch task. Developmental Psychology, 55, 934–950. https://doi.org/ 10.1037/dev0000699 Tsuji, S., Bergmann, C., & Cristia, A. (2014). Community-augmented meta-analyses: Toward cumulative data assessment. Perspectives on Psychological Science, 9, 661–665. https:// doi.org/10.1177/1745691614552498 Tsuji, S., & Cristia, A. (2014). Perceptual attunement in vowels: A meta-analysis. Developmental Psychobiology, 56, 179–191. https://doi.org/10.1002/dev.21179
Ó 2020 Hogrefe Publishing
61
Van Erp, S., Verhagen, J., Grasman, R. P. P. P., & Wagenmakers, E.-J. (2017). Estimates of between-study heterogeneity for 705 meta-analyses reported in Psychological Bulletin from 1990– 2013. Journal of Open Psychology Data, 5, 4. https://doi.org/ 10.5334/jopd.33 Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36, 1–48. Retrieved from http://www.jstatsoft.org/v36/i03/ Von Holzen, K., & Bergmann, C. (2018). A meta-analysis of infants’ mispronunciation sensitivity development. In T. T. Rogers, M. Rau, X. Zhu, & C. W. Kalish (Eds.), Proceedings of the 40th Annual Conference of the Cognitive Science Society (pp. 1159–1164). Austin, TX: Cognitive Science Society. White, H. D. (1994). Scientific communication and literature retrieval. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 41–56). New York, NY: Russell Sage Foundation. Wickham, H. (2017). tidyverse: Easily Install and Load the ‘Tidyverse’ (R package version 1.2.1). Retrieved from https:// CRAN.R-project.org/package=tidyverse History Received May 20, 2019 Revision received August 26, 2019 Accepted September 19, 2019 Published online March 31, 2020 Open Data Data and analyses scripts are shared on our Open Science Framework project site (https://osf.io/g6abn/) and on Psych Archives (data: https://www.psycharchives.org/handle/20.500. 12034/2185, code: https://www.psycharchives.org/handle/ 20.500.12034/2186). Analyses have been conducted with the tidyverse (Wickham, 2017) and the metafor (Viechtbauer, 2010) packages in R (R Core Team, 2019). Funding This research was supported by grants from the Berkeley Initiative for Transparency in the Social Sciences, a program of the Center for Effective Global Action (CEGA), with support from the Laura and John Arnold Foundation. The authors were further supported by the H2020 European Research Council [Marie SkłodowskaCurie grant No. 659553], the Agence Nationale de la Recherche [ANR-14-CE30-0003 MechELex, ANR-17-EURE-0017]. ORCID Sho Tsuji https://orcid.org/0000-0001-9580-4500 Sho Tsuji International Research Center for Neurointelligence Institutes for Advanced Studies The University of Tokyo 7-3-1 Hongo, Bunkyo-ku Tokyo 113-0033 Japan tsujish@gmail.com
Zeitschrift für Psychologie (2020), 228(1), 50–61
Call for Papers “Web-Based Research in Psychology” A Topical Open Access Issue of the Zeitschrift für Psychologie Guest Editors: Ulf-Dietrich Reips1 and Tom Buchanan2 1
University of Konstanz, Germany
2
University of Westminster, London, UK
Web-based research in psychology became possible with the development of the World Wide Web in the early 1990s. While at first only a few people began using Web browsers, now more than half of humanity uses a browser every day, providing them with ease of access to participate in science. Traditional laboratory-based and field methods were transformed, adapted, studied, and scaled to properly apply to the online environment. Different methods and new possibilities for research emerged, quantitatively and qualitatively, leading for example directly to Open Access and Open Science. As with other methods in psychology, Web-based research methodology has evolved since and has proliferated and diversified with layers upon layers of new major developments in Internet technology and life (e.g., Google search; “Web 2.0”; social media; smartphones; automated agents; intensive and Big Data; Open Science). With the current call we seek to create a collection of selected state-of-the art pieces on and of Web-based research in Psychology. We are looking for full original or review articles, shorter research notes, and opinion papers that focus on Web-Based Research in Psychology. While we particularly invite experimental work and methodology, we also welcome work using any other empirical method. Our goal is to show how different psychological topic areas best utilize Web-based methodology, broadly defined. Especially welcome is work that advances our current knowledge by addressing new methods, perspectives, theory, and shows the unique potential of Web-based research methods to advance our knowledge in psychology. While investigations of substantive topics in psychology are appropriate, manuscripts should have a strong focus on the methodology and how it enabled the investigation. Studies that simply happen to use methods such as online surveys, for example, would be better published elsewhere. Zeitschrift für Psychologie (2020), 228(1), 62 https://doi.org/10.1027/2151-2604/a000408
How to Submit Interested authors should submit a letter of intent to zfp@uni.kn including: (1) a working title for the manuscript, (2) names, affiliations, and contact information for all authors, and (3) an abstract of no more than 500 words detailing the content of the proposed manuscript to the guest editors Ulf-Dietrich Reips (reips@uni.kn) and Tom Buchanan (T.Buchanan@westminster.ac.uk). There is a two-stage submissions process. Initially, interested authors are requested to submit only abstracts of their proposed papers. Authors of the selected abstracts will then be invited to submit full papers. All papers will undergo blind peer review. Deadline for submission of abstracts is October 1, 2020. Deadline for submission of full papers is February 1, 2021. The journal seeks to maintain a short turnaround time, with the final version of the accepted papers being due by May 15, 2021. The topical issue will be published as issue 4 (2021), with immediate free access to the issue for all readers. For additional information, please contact: zfp@uni.kn
About the Journal The Zeitschrift für Psychologie, founded in 1890, is the oldest psychology journal in Europe and the second oldest in the world. One of the founding editors was Hermann Ebbinghaus. Since 2007 it is published in English and devoted to publishing topical issues that provide state-of-the-art reviews of current research in psychology. For detailed author guidelines, please see the journal’s website at www.hogrefe.com/j/zfp/ Ó 2020 Hogrefe Publishing
Call for Papers “Dark Personality in the Workplace” A Topical Issue of the Zeitschrift für Psychologie Guest Editors: Birgit Schyns,1 Susanne Braun,2 and Pedro Neves3 1
Neoma Business School, Reims, France
2
Durham University Business School, Durham, UK
3
NOVA School of Business and Economics, Lisbon, Portugal
In recent years, organizational and industrial psychology research has seen a shift towards the “dark side,” providing broader insights into the experiences of employees and moving away from an idealization of workplace relationships. It has thus become more common not only to look at positive employee characteristics (e.g., conscientiousness, proactive personality) and their relationship to workplace outcomes (e.g., trust, organizational citizenship behavior, performance), but also to investigate negative traits and their potential consequences. For example, the dark triad (narcissism, Machiavellianism, and psychopathy, e.g., Paulhus & Williams, 2002) or dark tetrad (adding sadism; Chabrol, Melioli, van Leeuwen, Rodgers, & Goutaudier, 2015) have become of increasing interest in workplace research due to their often substantial negative impact on others (e.g., Campbell, Hoffman, Campbell, & Marchisio, 2011) as well as the bottom line (e.g., Chatterjee & Hambrick, 2007). However, there are still many open questions regarding how dark personality traits operate in the workplace as well as how to best conceptualize and measure these. For example, Ong, Roberts, Arthur, Woodman, and Akehurst (2016) found that narcissism relates differently to other ratings over time. But when does this change happen and are there mitigating or enhancing factors? While there is also an interesting amount of research on different narcissist profiles based on Back et al.’s conceptualization (admiration vs. rivalry; 2013), we still know little about its implications in the workplace (for an exception, see Helfrich & Dietl, 2019). Schyns, Wisse, and Sanders (2019) suggested that followers with dark personality traits may use strategic behaviors to cause havoc in organizations, but what are the specific behaviors they use? What should organizations do to avoid hiring individuals with dark personality traits in the workplace? Is training effective in minimizing the expression of these dark traits? These are just some of many possible topics for future research. For this topical issue of Zeitschrift für Ó 2020 Hogrefe Publishing
Psychologie, we seek contributions considering, but not limited to, the following areas: 1. Theory development – Models proposing new or conceptually advanced configurations of dark side personality traits. – Multi-level conceptualizations of the impact that dark side personality traits have at multiple levels of organizations. 2. Underlying processes and contexts – Studies including models that explain why and under which circumstances dark side personality traits affect workplace outcomes. – Studies that compare the impact of dark side personality traits between different contexts (e.g., intercultural or sectorial comparison). 3. Interplay between different actors in organization – Studies looking at dark side personality traits in leaders and followers, and their interaction (e.g., leader and follower narcissism). – Team outcomes of dark side personality traits in teams (e.g., inter-team and intra-team trust vs. creation of control mechanisms). 4. Methodological advancements – Studies that develop and validate new measures of dark side personality traits. – New approaches to measuring dark side personality traits beyond traditional self-report surveys. – Studies that study dark side personality traits in combination with advanced statistical approaches (e.g., growth curve modeling, multilevel modeling). 5. Counterintuitive findings – Studies that examine how and when such traits might not be toxic or might even be helpful in the workplace (e.g., protective mechanism in highly competitive environments). Zeitschrift für Psychologie (2020), 228(1), 63–64 https://doi.org/10.1027/2151-2604/a000391
64
Call for Papers
How to Submit There is a two-stage submissions process. Initially, interested authors are requested to submit only abstracts of their proposed papers. Authors of the selected abstracts will then be invited to submit full papers. All papers will undergo blind peer review. Interested authors should submit a letter of intent including: (1) a working title for the manuscript, (2) names, affiliations, and contact information for all authors, and (3) an abstract of no more than 500 words detailing the content of the proposed manuscript to the guest editors: Birgit Schyns (birgit.schyns@neoma-bs.fr) Susanne Braun (susanne.braun@durham.ac.uk) Pedro Neves (pedro.neves@novasbe.pt) Deadline for submission of abstracts is June 1, 2020 Deadline for submission of full papers is February 1, 2021
Detailed Timeline
Abstract submission: June 1, 2020 Decision on abstract: July 1, 2020 Full paper submission: February 1, 2021 Review round 1: Between February 15 and May 15, 2021 Decision on first revision communicated: June 1, 2021 1st revisions due: October 1, 2021 Review round 2: Between October 15 and December 15, 2021 Decision on first revision communicated: January 15, 2022 2nd revision due: April 1, 2022
Zeitschrift für Psychologie (2020), 228(1), 63–64
Final decision communicated: June 1, 2022 Publication of topical issue as issue 4 (2022): October 2022. For additional information, please contact the guest editors. For detailed author guidelines, please see the journal’s website at https://www.hogrefe.com/j/zfp
References Back, M. D., Küfner, A. C. P., Dufner, M., Gerlach, T. M., Rauthmann, J. F., & Denissen, J. J. A. (2013). Narcissistic admiration and rivalry: Disentangling the bright and dark sides of narcissism. Journal of Personality and Social Psychology, 105, 1013–1037. https://doi.org/10.1037/a0034431 Campbell, W. K., Hoffman, B. J., Campbell, S. M., & Marchisio, G. (2011). Narcissism in organizational contexts. Human Resource Management Review, 21, 268–284. https://doi.org/10.1016/j. hrmr.2010.10.007 Chabrol, H., Melioli, T., van Leeuwen, N., Rodgers, R., & Goutaudier, N. (2015). The dark tetrad: Identifying personality profiles in adolescents. Personality and Individual Differences, 83, 97–101. https://doi.org/10.1016/j.paid.2015.03.051 Chatterjee, A., & Hambrick, D. C. (2007). It’s all about me: Narcissistic chief executive officers and their effects on company strategy and performance. Administrative Science Quarterly, 52, 351–386. https://doi.org/10.2189/asqu.52.3.351 Helfrich, H., & Dietl, E. (2019). Is employee narcissism always toxic? – The role of narcissistic admiration, rivalry, and leaders’ implicit followership theories for employee voice. European Journal of Work and Organizational Psychology, 28, 259–271. https://doi.org/10.1080/1359432X.2019.1575365 Ong, C. W., Roberts, R., Arthur, C. A., Woodman, T., & Akehurst, S. (2016). The leader ship is sinking: A temporal investigation of narcissistic leadership. Journal of Personality, 84, 237–247. https://doiorg/10.1111/jopy.12155 Paulhus, D. L., & Williams, K. (2002). The Dark Triad of personality: Narcissism, Machiavellianism, and psychopathy. Journal of Research in Personality, 36, 556–568. https://doi.org/10.1016/ S0092-6566(02)00505-6 Schyns, B., Wisse, B. M., & Sanders, S. (2019). Shady strategic behavior: Recognizing strategic behavior of dark triad followers. Academy of Management Perspectives, 33, 234–249. https://doi. org/10.5465/amp.2017.0005
Ó 2020 Hogrefe Publishing
Instructions to Authors The Zeitschrift für Psychologie publishes high-quality research from all branches of empirical psychology that is clearly of international interest and relevance, and does so in four topical issues per year. Each topical issue is carefully compiled by guest editors. The subjects being covered are determined by the editorial team after consultation within the scientific community, thus ensuring topicality. The Zeitschrift für Psychologie thus brings convenient, cutting-edge compilations of the best of modern psychological science, each covering an area of current interest. Zeitschrift für Psychologie publishes the following types of articles: Review Articles, Original Articles, Research Spotlights, Horizons, and Opinions. Manuscript submission: A call for papers is issued for each topical issue. Current calls are available on the journal’s website at www.hogrefe.com/j/zfp. Manuscripts should be submitted as Word or RTF documents by e-mail to the responsible guest editor(s). An article can only be considered for publication in the Zeitschrift für Psychologie if it can be assigned to one of the topical issues that have been announced. The journal does not accept general submissions. Detailed instructions to authors are provided at http://www. hogrefe.com/j/zfp Copyright Agreement: By submitting an article, the author confirms and guarantees on behalf of him-/herself and any coauthors that he or she holds all copyright in and titles to the submitted contribution, including any figures, photographs, line drawings, plans, maps, sketches and tables, and that the article and its contents do not infringe in any way on the rights of third parties. The author indemnifies and holds harmless the publisher from any third-party claims. The author agrees, upon acceptance of the article for publication, to transfer to the publisher on behalf of him-/herself and any coauthors the exclusive right to reproduce and distribute the article and its contents, both physically and in nonphysical, electronic, and other form, in the journal to which it
2020 Hogrefe Publishing
has been submitted and in other independent publications, with no limits on the number of copies or on the form or the extent of the distribution. These rights are transferred for the duration of copyright as defined by international law. Furthermore, the author transfers to the publisher the following exclusive rights to the article and its contents: 1. The rights to produce advance copies, reprints, or offprints of the article, in full or in part, to undertake or allow translations into other languages, to distribute other forms or modified versions of the article, and to produce and distribute summaries or abstracts. 2. The rights to microfilm and microfiche editions or similar, to the use of the article and its contents in videotext, teletext, and similar systems, to recordings or reproduction using other media, digital or analog, including electronic, magnetic, and optical media, and in multimedia form, as well as for public broadcasting in radio, television, or other forms of broadcast. 3. The rights to store the article and its content in machinereadable or electronic form on all media (such as computer disks, compact disks, magnetic tape), to store the article and its contents in online databases belonging to the publisher or third parties for viewing or downloading by third parties, and to present or reproduce the article or its contents on visual display screens, monitors, and similar devices, either directly or via data transmission. 4. The rights to reproduce and distribute the article and its contents by all other means, including photomechanical and similar processes (such as photocopying or facsimile), and as part of so-called document delivery services. 5. The right to transfer any or all rights mentioned in this agreement, as well as rights retained by the relevant copyright clearing centers, including royalty rights to third parties. Online Rights for Journal Articles: Guidelines on authors’ rights to archive electronic versions of their manuscripts online are given in the document ‘‘Guidelines on sharing and use of articles in Hogrefe journals’’ on the journal’s web page at www.hogrefe.com/j/zfp July 2017
Zeitschrift für Psychologie (2020), 228(1)
This fourth “Hotspots in Psychology” is devoted to systematic reviews and meta-analyses in research-active fields that have generated a considerable number of primary studies. The common denominator is the research synthesis nature of the contributions, and not a specific psychological topic or theme that all articles have to address. This issue explores methodological advances in research synthesis methods relevant for any subfield of psychology. The contributions include: the application of a network meta-analytic approach to analyze the effect of transcranial direct current stimulation on memory; analyzing the performance of a meta-analytic structural equation modeling approach when variables in primary studies have been artificially dichotomized; assessing quality-related aspects of systematic reviews with AMSTAR and AMSTAR2; as well as a graphical approach to depict study-level statistical power in the context of meta-analysis.
Contents include: Systematic Review and Network Meta-Analysis of Anodal tDCS Effects on Verbal Episodic Memory: Modeling Heterogeneity of Stimulation Locations Gergely Janos Bartl, Emily Blackshaw, Margot Crossman, Paul Allen, and Marco Sandrini Response Rates in Online Surveys With Affective Disorder Participants: A Meta-Analysis of Study Design and Time Effects Between 2008 and 2019 Tanja Burgard, Michael Bosnjak, and Nadine Wedderhoff Dealing With Artificially Dichotomized Variables in Meta-Analytic Structural Equation Modeling Hannelies de Jonge, Suzanne Jak, and Kees-Jan Kan Assessing the Quality of Systematic Reviews in Healthcare Using AMSTAR and AMSTAR2: A Comparison of Scores on Both Scales Karina Karolina De Santis and Ilkay Kaplan Power-Enhanced Funnel Plots for Meta-Analysis: The Sunset Funnel Plot Michael Kossmeier, Ulrich S. Tran, and Martin Voracek Addressing Publication Bias in Meta-Analysis: Empirical Findings From Community-Augmented Meta-Analyses of Infant Language Development Sho Tsuji, Alejandrina Cristia, Michael C. Frank, and Christina Bergmann
Hogrefe Publishing Group Göttingen · Berne · Vienna · Oxford · Paris Boston · Amsterdam · Prague · Florence Copenhagen · Stockholm · Helsinki · Oslo Madrid · Barcelona · Seville · Bilbao Zaragoza · São Paulo · Lisbon www.hogrefe.com
ISBN 978-0-88937-574-1 90000 9 780889 375741