multiple comparisons using R by Jungpin Wu

Multiple Comparisons Using R

W4?

Multiple Comparisons Using R

Frank Bretz Torsten H o t h o r n Peter Westfall

LBE 2 0 7 4 6 5 CRC Press Taylor & Francis Group Boca Raton London New York CRC Press fs an imprint of the Taylor & Francis Group an informa business A CHAPMAN

& HALL

BOOK

Chapman & Hall/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 ÂŠ 2011 by Taylor and Francis Group, LLC Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed in the United States of America on acid-free paper 1098765432 International Standard Book Number: 978-1-58488-574-0 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book, may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright. com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, M A 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Bretz, Frank. Multiple comparisons using R / Frank Bretz, Torsten Hothorn, Peter Westfall. p. cm. Includes bibliographical references and index. ISBN 978-1-58488-574-0 (hardcover: alk. paper) 1. Multiple comparisons (Statistics) 2. R (Computer program language) I . Hothorn, Torsten. I I . Westfall, Peter H., 1957- I I I . Tide. QA278.4.B74 2011 519.5~dc22 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com

2010023538

To Jiamei and my parents — F.B. To Carolin, Elke and Ludwig — T.H.

To Marilyn and Amy — P.W.

Contents

L i s t of F i g u r e s

L i s t of Tables

Preface

xiii

Introduction

General Concepts

2.1

11 11 17 20 20 22 23 29 31 31 32 34 35

2.2

2.3

2.4 3

E r r o r rates and general concepts 2.1.1 E r r o r rates 2.1.2 General concepts Construction methods 2.2.1 Union intersection test 2.2.2 Intersection union test 2.2.3 Closure principle 2.2.4 Partitioning principle Methods based on Bonferroni's inequality 2.3.1 Bonferroni test 2.3.2 Holm procedure 2.3.3 Further topics Methods based on Simes' inequality

Multiple Comparisons in Parametric Models 3.1 General linear models 3.1.1 Multiple comparisons in linear models 3.1.2 T h e linear regression example revisited using R 3.2 Extensions to general parametric models 3.2.1 Asymptotic results 3.2.2 Multiple comparisons in general parametric models 3.2.3 Applications 3.3 T h e m u l t c o m p package 3.3.1 T h e g l h t function 3.3.2 T h e summary method 3.3.3 T h e conf i n t method vii

41 41 41 45 48 48 50 52 53 53 59 64

viii

CONTENTS

Applications 4.1 Multiple comparisons with a control 4.1.1 Dunnett test 4.1.2 Step-down Dunnett test procedure 4.2 A l l pairwise comparisons 4.2.1 Tukey test 4.2.2 Closed Tukey test procedure 4.3 Dose response analyses 4.3.1 A dose response study on litter weight i n mice 4.3.2 Trend tests 4.4 Variable selection in regression models 4.5 Simultaneous confidence bands 4.6 Multiple comparisons under heteroscedasticity 4.7 Multiple comparisons in logistic regression models 4.8 Multiple comparisons i n survival models 4.9 Multiple comparisons i n mixed-effects models

69 70 71 77 82 82 93 99 100 103 108 111 114 118 124 125

Further Topics 5.1 Resampling-based multiple comparison procedures 5.1.1 Permutation methods 5.1.2 Using R for permutation multiple testing 5.1.3 Bootstrap testing - Brief overview 5.2 Group sequential and adaptive designs 5.2.1 Group sequential designs 5.2.2 Adaptive designs 5.3 Combining multiple comparisons with modeling 5.3.1 M C P - M o d : Dose response analyses under model u n c e r tainty 5.3.2 A dose response example

127 127 127 137 140 140 141 150 156

Bibliography Index

157 162 167 183

L i s t of Figures

1.1 1.2 1.3

Probability of committing at least one T y p e I error Scatter plot of the t h u e s e n data Impact of multiplicity on selection bias

2.1 2.2 2.3

Visualization of two null hypotheses i n the parameter space Venn-type diagram for two null hypotheses Schematic diagram of the closure principle for two null hypotheses Schematic diagram of the closure principle for three null hypotheses Partitioning principle for two null hypotheses Schematic diagram of the closure principle for three null hypotheses under the restricted combination condition Rejection regions for the Bonferroni and Simes tests Power comparison for the Bonferroni and Simes tests

2.4 2.5 2.6 2.7 2.8 3.1 3.2 3.3

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8

Boxplots of the warpbreaks data Two-sided simultaneous confidence intervals for the Tukey test i n the warpbreaks example Compact letter display for all pairwise comparisons i n the w a r p b r e a k s example Boxplots of the r e c o v e r y data One-sided simultaneous confidence intervals for the Dunnett test i n the r e c o v e r y example Closed Dunnett test in the r e c o v e r y example Two-sided simultaneous confidence intervals for the Tukey test i n the immer example Compact letter display for all pairwise comparisons i n the immer example Alternative compact letter display for all pairwise comparisons in the immer example Mean-mean multiple comparison and tiebreaker plot for all pairwise comparisons in the immer example Mean-mean multiple comparison plot for selected orthogonal contrasts i n the immer example ix

2 4 8 24 25 25 27 30 33 36 37 54 65 67 70 76 81 87 89 90 92 94

LIST O F FIGURES 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12

Schematic diagram of the closure principle for all pairwise comparisons of four means Summary plot of the Thalidomide dose response study Plot of contrast coefficients for the Williams and modified Williams tests Simultaneous confidence band for the difference of two linear regression models Boxplots of alpha synuclein m R N A expression levels Comparison of simultaneous confidence intervals based on different estimates of the covariance matrix Association between smoking behavior and disease status stratified by gender Simultaneous confidence intervals for the probability of suffering from Alzheimer's disease Layout of the t r e e s 5 1 3 field experiment Probability of roe deer browsing damage Schematic diagram of the closed hypotheses set for the adverse event data Histograms of hypergeometric distributions Wang-Tsiatis boundary examples Hwang-Shih-DeCani spending function examples Stopping boundaries for group sequential design example Power plots for group sequential design example Average sample sizes for group sequential design example Schematic diagram of the closure principle for testing adaptively two null hypotheses Disjunctive power for different adaptive design options Schematic overview of the M C P - M o d procedure Model shapes for the selected candidate model set F i t t e d model for the biom data

96 102 105 114 115 119 121 123 125 126

130 132 143 144 147 149 150 153 156 158 164 166

List of Tables

2.1

T y p e I and T y p e I I errors i n multiple hypotheses testing

2.2

Comparison of the Holm a n d Hochberg procedures

3.1

Multiple comparison procedures available i n m u l t c o m p

4.1

Comparison of several multiple comparison procedures for

4.2 4.3 4.4 5.1 5.2 5.3 5.4 5.5 5.6

the r e c o v e r y d a t a Comparison of several multiple comparison procedures for the immer d a t a Summary data of the T h a l i d o m i d e dose response example S u m m a r y of the a l z h e i m e r d a t a Mock-up adverse event d a t a Adverse event contingency tables Upper-tailed p-values for F i s h e r ' s exact test Mock-up dataset with four treatment groups Summary results of a study w i t h two treatments a n d four outcome variables Dose response models implemented i n the D o s e F i n d i n g package

82 100 101 120 128 131 133 136 139 160

Preface Many scientific experiments subject to rigorous statistical analyses involve the simultaneous evaluation of more than one question. Multiplicity therefore b e comes an inherent problem with various unintended consequences. T h e most widely recognized result is that the findings of an experiment c a n be m i s l e a d ing: Seemingly significant effects occur more often t h a n expected by chance alone and not compensating for multiplicity can have important real world consequences. For instance, when the multiple comparisons involve drug efficacy, they may result in approval of a drug as an improvement over existing drugs, when there is in fact no beneficial effect. O n the other h a n d , when drug safety is involved, it could happen by chance that the new drug appears to be worse for some side effect, when it is actually not worse at all. B y contrast, m u l tiple comparison procedures adjust statistical inferences from a n experiment for multiplicity. Multiple comparison procedures thus enable better decision making and prevent the experimenter from declaring a n effect when there is none. Other books on multiple comparison procedures There are different ways to approach the subject of multiple comparison p r o cedures. Hochberg and Tamhane (1987) provided a n extensive mathematical treatise on this topic. Westfall and Young (1993) focused on resampling-based multiple test approaches that incorporate stochastic dependencies i n the data. Another approach is to introduce multiple comparison procedures by c o m p a r ison type (multiple comparisons'with a control, multiple comparisons w i t h the best, etc.), as done i n H s u (1996). Westfall, Tobias, R o m , Wolfinger, and Hochberg (1999) chose a n example-based approach, starting w i t h simple p r o b lems and progressing to more challenging applications while introducing new methodology as needed. Dudoit and van der L a a n (2008) described multiple comparison procedures relevant to genomics. Finally, Dmitrienko, Tamhane, and Bretz (2009) reviewed multiple comparison procedures used i n p h a r m a ceutical statistics by focusing on different drug development applications. H o w is t h i s b o o k different? T h e emphasis of this book is similar to the previous books i n the development of the theory, but differs in the way that applications of multiple comparison procedures are presented. Too often, potential users of these methods are overxiii

PREFACE

xiv

whelmed by the bewildering number of possible approaches. We adopt a u n i fying theme based on maximum statistics that, for example, shows Dunnett's and Tukey's methods to be essentially no different. T h i s theme is emphasized throughout the book, as we describe the common underlying theory of m u l tiple comparison procedures by use of several examples. I n addition, we give a detailed description of available software implementations i n R, a language and environment for statistical computing and graphics (Ihaka and G e n t l e m a n 1996). T h i s book thus provides a self-contained introduction to multiple comparison procedures, with discussion and analysis of many examples using available software. I n this book we focus on "classical" applications of multiple comparison procedures, where the number of comparisons is moderate a n d / o r where strong evidence is needed. H o w to read this book I n Chapter 1 we discuss the "difficult

and ubiquitous

problems

multiplic-

ity" (Berry 2007). We give several characterizations and provide examples to motivate the discussion i n the subsequent chapters. I n Chapter 2 we give a general introduction to multiple hypotheses testing. We describe different error rates and introduce standard terminology. We also cover basic m u l t i ple comparison procedures, including the Bonferroni method and the Simes test. Chapter 3 provides the theoretical framework for the applications i n Chapter 4. We briefly review standard linear model theory and show how to perform multiple comparisons in this framework. T h e resulting methods require the common analysis-of-variance assumptions and thus extend the b a sic approaches of Bonferroni and Simes. We then extend this framework and consider multiple comparison procedures in parametric models relying only on standard asymptotic normality assumptions. We further give a detailed introduction to the m u l t c o m p package i n R, which provides a convenient i n terface to perform multiple comparisons in this general context. Applications to illustrate the results from this chapter are given i n Chapter 4. Examples include the Dunnett test, Tukey's all-pairwise comparisons, and general m u l tiple contrast tests, both for one-way layouts as well as for more complicated models with factorial treatment structures a n d / o r covariates. We also give e x amples using mixed-effects models and applications to survival data. Finally, we review in Chapter 5 a selection of further multiple comparison procedures, which do not quite fit into the framework of Chapters 3 and 4. T h i s includes the comparison of means w i t h two-sample multivariate data using resamplingbased procedures, methods for group sequential or adaptive designs, and the combination of multiple comparison procedures with modeling techniques. Those who are facing multiple comparison problems for the first time might want to glance through Chapter 2, skip the detailed theoretical framework of Chapter 3 and proceed directly to the applications i n Chapter 4. T h e quick reader may start w i t h Chapter 4, which can be read mostly on its own; n e c essary theoretical results are finked backwards to Chapters 2 and 3. Similarly,

PREFACE

the selected multiple comparison procedures i n Chapter 5 can be read without necessarily having read the previous chapters. W h o should read this book T h i s book is for statistics teachers and students, undergraduate or graduate, who are learning about multiple comparison procedures, whether i n a s t a n d alone course on multiple testing, a course i n analysis-of-variance, a course i n multivariate analysis, or a course i n nonparameterics. Additionally, the. book is helpful for scientists who need to use multiple comparison procedures, i n cluding biometricians, clinicians, medical doctors, molecular biologists, a g r i cultural analysts, etc. I t is oriented towards users (i) who have only limited knowledge of multiple comparison procedures but need to apply those t e c h niques i n R, and (ii) who are already familiar w i t h multiple comparison p r o cedures but lack the related implementations in R. We assume that the reader has a basic knowledge of R and can perform elementary d a t a handling steps. Otherwise, we recommend standard textbooks on R for a quick introduction; see Dalgaard (2002) and Everitt and Hothorn (2009) among others. Conventions used in this book We use bold-face capital letters (e.g., C , R , X , . . . ) to denote matrices and bold-face small letters (e.g., c, x , y , . . . ) to indicate vectors. I f the distinction between column and row vectors matters, we introduce vectors as column vectors. A transpose of a vector or matrix is indicated by the superscript ; for example, c denotes the transpose of the (column) vector c. T o simplify the notation, we do not distinguish between random variables and their observed values, unless noted explicitly. We use many code examples i n R throughout this book. Samples of code that could be entered interactively at the R command line are formatted as follows: T

R> l i b r a r y ( " m u l t c o m p " ) Here, R> denotes the prompt sign from the R command line and the user enters everything else. I n some instances the expressions to be entered will be longer than a single line a n d it will appear as follows: R> s u m m a r y ( g l h t ( r e c o v e r y . a o v , l i n f c t = m c p ( b l a n k e t = c o n t r ) , + alternative = "less")) T h e symbol + indicates additional lines which are appropriately indented. Finally, output produced by function calls is shown below the associated code R> r n o r m ( l O ) [1] [8]

0.0310 0.8262

1.3378 -0.8603

0.4158 -0.7134

0.0413

-1.4383

1.0761

-1.2687

PREFACE

xvi C o m p u t a t i o n a l details a n d reproducibility

I n this book, we use several R packages to access different example datasets (such as I S w R , M A S S , etc.), standard functions for the general parametric analyses (such as aov, lm, nlme, etc.) and the m u l t c o m p package to perform the multiple comparsion procedures. A l l of the packages used i n this book are available at the Comprehensive R Archive Network ( C R A N ) , which can be accessed from h t t p : / / C R A N . R - p r o j e c t . org. T h e source code for the analyses presented in this book is available from the m u l t c o m p package. A demo containing the R code to reproduce the individual results is available for each chapter by invoking R> R> R> R> R> R>

library("multcomp") demo("Ch_Intro") demoC'Ch.Theory") demo("Ch_GLM") demo("Ch_Appl") demo("Ch_Misc")

T h e results presented in this book and obtained w i t h the m u l t c o m p package in general have been validated to the extent possible against the S A S macros described i n Westfall et a l . (1999). T h e m u l t c o m p package also includes SAS code for most of the examples presented in Chapters 1, 3 and 4. I t can be accessed i n R v i a R> f i l e . s h o w ( s y s t e m . f i l e ( " M C M T " , "multcomp.sas", + package = " m u l t c o m p " ) ) Readers this book. truncating differences

should not expect to get exactly the same results as stated in For example, differences in random number generating seeds or the number of significant digits in the results may result i n slight to the output shown here.

Acknowledgments T h i s book started as a presentation at the 2004 U s e R ! conference i n V i enna, Austria. Many individuals have influenced the writing of this book since then. O u r joint work over many years with Werner B r a n n a t h , E d g a r Brunner, A l a n Genz, E k k e h a r d G l i m m , Gerhard Hommel, L u d w i g Hothorn, Jason H s u , Franz Konig, W i l l i Maurer, Jose Pinheiro, Martin Posch, Stephen Senn, K l a u s Strassburger, Ajit Tamhane, R a n d y Tobias, and James Troendle had a direct impact on this book and we are grateful for the many fruitful discussions. We are indebted to Sylvia Davis and K e v i n Henning for their close reading and criticism of an earlier version of this book. T h e i r comments and suggestions have greatly improved the presentation of the material. B j o r n Bornkamp, Richard Heiberger, Wei L i u , Hans-Peter Piepho, and G e m o t Wassmer read sections of the manuscript and provided much needed help during the progress of this book. We are especially indebted to Richard Heiberger who, over many years, provided valuable feedback on new functionality i n m u l t c o m p and do-

PREFACE

xvii

nated code to this project. A c h i m Zeileis gave suggestions on the new user interface and was, is, and hopefully will continue being someone to discuss computational and other problems with. Andre Schutzenmeister developed the code for producing compact letter displays i n m u l t c o m p . We also thank the book's acquisitions editor, R o b Calver, for his support and his patience throughout this book publishing project. F r a n k B r e t z , Torsten H o t h o r n a n d Peter Westfall Basel, Munchen and Lubbock

CHAPTER 1

Introduction Many scientific experiments subject to rigorous statistical analyses involve the simultaneous evaluation of more than one question. F o r example, i n clinical trials one may compare more than one treatment group w i t h a control group, assess several outcome variables, measure at various time points, analyze m u l tiple subgroups or look at any combination of these and related questions; but multiplicity problems occur if we want to make simultaneous inference across multiple questions. Similar problems may arise i n agricultural field e x p e r i ments which simultaneously compare several irrigation systems, investigate the dose response relationship of a fertilizer, involve repeated assessments of growth curves for a particular culture, etc. Recently, high-dimensional s c r e e n ing studies have become widely available i n molecular biology and its a p p l i cations, such as gene expression experiments a n d high throughput screenings in early drug discovery. Those screening studies have i n common the problem of identifying a small subset of relevant variables from a huge set of c a n d i date variables (e.g., genes, compounds, proteins). Scientific research provides many examples of well-designed experiments involving multiple investigational questions. Multiplicity is likely to become important when strong evidence and good decision making is required. I n hypotheses test problems involving a single null hypothesis H the s t a tistical tests are often chosen to control the T y p e I error rate of incorrectly rejecting H at a pre-specified significance level a . I f multiple hypotheses, TO say, are tested simultaneously and the final inferences should be valid across all experimental questions of interest, the probability of declaring non-existing effects significant increases i n TO. Assume, for example, that m = 2 hypotheses Hi and H2 are each tested at level a = 0.05 using independent test statistics. For example, let Hi, i = 1,2, denote the null hypotheses that a drug does not show a beneficial effect over placebo for two primary outcome variables. A s sume further that both Hi and H2 are true. T h e n the probability of retaining both hypotheses is (1 — a ) = 0.9025 under the independence assumption. T h e complementary probability of incorrectly rejecting at least one null hypothesis is 1 — (1 —a ) = 2a — a = 0.0975. T h i s is substantially larger t h a n the i n i tial significance level of a = 0.05. I n general, when testing TO null hypotheses using independent test statistics, the probability of committing at least one T y p e I error is 1 — (1 — a) , which reduces to the previous expression for TO = 2. Figure 1.1 displays the probability of committing at least one T y p e I error for TO = 1 , . . . , 100 and a = 0.01,0.05, and 0.10. Clearly, the probability quickly reaches 1 for sufficiently large values of TO. I n other words, if there is 2

INTRODUCTION

a large number of experimental questions and no multiplicity adjustment, the decision maker will commit a T y p e I error almost surely and conclude for a seemingly significant effect when there is none.

m F i g u r e 1.1

Probability of committing at least one Type I error for different s i g nificance levels a and number of hypotheses m.

T o further illustrate the multiplicity issue, consider the problem of testing whether or not a given coin is fair. One may conclude that the coin was biased if after 10 flips the coin landed heads at least 9 times. Indeed, if one assumes as a null hypothesis that the coin is fair, then the likelihood that a fair coin would come up heads at least 9 out of 10 times is 11 â&#x20AC;˘ ( 0 . 5 ) = 0.0107. T h i s is relatively unlikely, and under common statistical criteria such as whether the p-value is less than or equal to 0.05, one would reject the null hypothesis and conclude that the coin is unfair. While this approach may be appropriate for testing the fairness of a single coin, applying the same approach to test the fairness of many coins could lead to a multiplicity problem. Imagine if one 10

INTRODUCTION

was to test 100 fair coins by this method. Given that the probability of a fair coin coming up heads 9 or 10 times in 10 flips is 0.0107, one would expect that in flipping 100 fair coins 10 times each, it would still be very unlikely to see a particular (i.e., pre-specified) coin come up heads 9 or 10 times. However, it is more likely than not that at least one coin will behave this way, no matter which one. T o be more precise, the likelihood that all 100 fair coins are identified as fair by this criterion is only (1 â&#x20AC;&#x201D; 0 . 0 1 0 7 ) = 0.34. I n other words, the likelihood of declaring at least one of the 100 fair coins as unfair is 0.66; this result can also be approximated from Figure 1.1. Therefore, the application of our single-test coin-fairness criterion to multiple comparisons would more likely t h a n not lead to a false conclusion: W e would mistakenly identify a fair coin as unfair. Let us consider a real data example to illustrate yet a different perspective of the multiplicity problem. Consider the t h u e s e n regression example from A l t m a n (1991) and reanalyzed by Dalgaard (2002). T h e data are available from the I S w R package (Dalgaard 2010) and contain the ventricular s h o r t e n ing velocity and blood glucose measurements for 23 T y p e I diabetic patients with complete observations. I n Figure 1.2 we show a scatter plot of the data including the regression line. Assume that we are interested i n fitting a l i n ear regression model and subsequently in testing whether the intercept or the slope equal 0, resulting i n two null hypotheses of interest. W e will consider this example in more detail in Chapter 3, but we use it here to illustrate some of the multiplicity issues and how to address them in R. Consider the linear model fit 100

R> t h u e s e n . l m < - l m ( s h o r t . v e l o c i t y ~ + data = thuesen) R> summary(thuesen.lm) Call: lm (formula data =

= short.velocity thuesen)

Residuals: Min 1Q Median -0.401 -0.148 -0.022

3Q 0.030

blood.glucose,

Max 0.435

Coefficients: (Intercept) blood.glucose Signif.

codes:

Estimate 1.0978 0.0220 0

'***'

Std.

Error 0.1175 0.0105

0.001

t value 9.34 2.10

'**' 0.01

Pr(>\t\) 6.3e-09 *** 0.048 * 0.05

'.' 0.1

' ' 1

Residual standard error: 0.217 on 21 degrees of freedom (1 observation deleted due to missingness) Multiple R-squared: 0.174, Adjusted R-squared: 0.134 F - s t a t i s t i c : 4.41 on 1 and 21 DF, p-value: 0.04 79

INTRODUCTION

F i g u r e 1.2

Scatter p l o t of the t h u e s e n d a t a w i t h regression line.

F r o m t h e p-values associated w i t h the intercept (p < 0.001) a n d t h e slope (p = 0.048) one m i g h t be t e m p t e d t o conclude t h a t b o t h parameters differ significantly f r o m zero at t h e significance level a = 0.05. However, these a s sessments are based on the marginal p-values, w h i c h do n o t account for a m u l t i p l i c i t y a d j u s t m e n t . A better approach is t o adjust the m a r g i n a l p-values w i t h a suitable m u l t i p l e comparison procedure, such as t h e B o n f e r r o n i test. W h e n a p p l y i n g the B o n f e r r o n i m e t h o d , the m a r g i n a l p-values are essentially m u l t i p l i e d b y the n u m b e r of n u l l hypotheses t o be tested ( t h a t is, by 2 i n o u r example). One possibility t o p e r f o r m the B o n f e r r o n i test i n R is t o use the m u l t c o m p package. U n d e r s t a n d i n g the details of t h e statements below w i l l become clear w h e n i n t r o d u c i n g the m u l t c o m p package f o r m a l l y i n C h a p ter 3. For now we prefer t o illustrate the key ideas w i t h o u t g e t t i n g d i s t r a c t e d by theoretical details.

INTRODUCTION

R> l i b r a r y ( " m u l t c o m p " ) R> t h u e s e n . m c < - g l h t ( t h u e s e n . l m , l i n f c t = d i a g ( 2 ) ) R> summary ( t h u e s e n . mc, t e s t = a d j u s t e d ( t y p e = " b o n f e r r o n i " ) ) Simultaneous

Tests

F i t : lm(formula = short.velocity data = thuesen)

for General ~

Hypotheses: Estimate Std. Error t value 1 == 0 1.0918 0.1115 9.34 2 0 0.0220 0.0105 2.10

Linear

Hypotheses

blood.glucose,

Linear

Pr(>jtl) 1.3e-08 *** 0.096 .

Signif. codes: 0 '***' 0.001 '**' 0.01 (Adjusted p values reported â&#x20AC;&#x201D; bonferroni

'*' 0.05 '.' 0.1 method)

' '1

I n the column entitled P r ( > I t I ) , the slope parameter now fails to be significant. T h e associated adjusted p-value is 0.096, twice as large as the unadjusted p-value from the previous analysis. T h i s is because now the p-value is corrected (i.e., adjusted) for multiplicity using the Bonferroni test. T h u s , if the original aim was to draw a simultaneous inference across both experimental questions (that is, whether intercept or slope equal zero), one would conclude that the intercept is significantly different from 0 but the slope is not. T h e m u l t c o m p package provides a variety of powerful improvements over the Bonferroni test. A s a matter of fact, when calling the g l h t function above, the default approach accounts for the correlations between the parameter e s timates. T h i s leads to smaller p-values compared to the Bonferroni test, as shown i n the output below: R> summary(thuesen.mc) Simultaneous

Tests

F i t : lm(formula = short.velocity data = thuesen)

for General ~

Hypotheses: Estimate Std. Error t value 1 == 0 1.0918 0.1115 9.34 2 == 0 0.0220 0.0105 2.10

Linear

Hypotheses

blood.glucose,

Linear

Pr(>\t\) le-08 *** 0.064 .

Signif. codes: 0 '***' 0.001 '**' 0.01 (Adjusted p values reported â&#x20AC;&#x201D; single-step

'*' 0.05 '.' 0.1 method)

' '1

T h e adjusted p-value for the slope parameter is now 0.064 and thus considerably smaller t h a n the 0.096 from the Bonferroni test, although we still have not achieved significance. During the course of this book we will discuss even more powerful methods, which allow the user to safely claim that the slope is significant at the level OL = 0.05 after having adjusted for multiplicity.

INTRODUCTION

I n summary, this example illustrates (i) the necessity of using proper m u l tiple comparison procedures when aiming at simultaneous inference; (ii) a variety of different multiplicity adjustments, some of which are less powerful than others and should be avoided whenever possible; and (iii) the a v a i l a b i l ity of flexible interfaces i n R, such as the m u l t c o m p package, which provide fast results for data analysts interested in simultaneous inferences of multiple hypotheses. We conclude this chapter with a few general considerations. Broadly s p e a k ing, multiplicity is a concern whenever reproducibility of results is required. T h a t is, if an experiment needs to be repeated, not initially adjusting for multiplicity may increase the likelihood of observing different results i n later experiments. Because scientific research and development can often be r e garded as a sequence of experiments that confirm or add to the understanding of previously established results, reproducibility is an important aspect. A c cording to Westfall and Bretz (2010), replication failure can happen i n at least three ways. Assume that we are interested i n assessing the effect of multiple conditions (treatments, genes, . . . ) . I t may then happen that (i) a condition has a n effect i n the opposite direction t h a n was reported in an experiment; (ii) a condition has an effect, but one that is substantially smaller t h a n was reported i n an experiment; (iii) a condition is ineffective, despite being reported as efficacious i n an e x periment. Such failures to replicate are commonly attributed to flawed study designs and various types of biases; however, multiplicity is as likely a culprit. Further details about the three types of replication errors given above follow. Errors

of declaring

effects

in the "wrong

direction".

T h i s is related to

the

problem of testing two-sided hypotheses with associated directional decisions. Although one might not believe point-zero null hypotheses can truly exist, two-sided tests are common and directional errors are real. I t is likely that directions of the claims are erroneous when multiple comparisons are p e r formed without multiplicity adjustment. For example, if a test of a two-sided hypothesis is done at the 0.05 level, then there is (at most) a 0.025 probability that the test will be declared significant, but i n the wrong direction. W h e n multiple tests are performed, this probability increases, so that if 40 tests are performed, we may expect one directional error (in a worst case). Note that although the probability of committing a directional error may be small, it has a severe impact once it is made because of the decision to the wrong direction. Errors

of declaring

inflated

effect sizes.

A second characterization of the m u l -

tiplicity problem is the impact on selection effects. I n this scenario, we need not postulate directional errors. I n fact, we may believe w i t h a priori certainty that all effects are i n their expected directions. Nevertheless, when we isolate a single, "most significant" comparison from this collection, we can only p r e sume that the estimated effect size is biased upward due to selection effects.

INTRODUCTION

Assume, for example, a n experiment, where we investigate m treatments. I f at the final analysis we select the "best" treatment based on the observed r e sults, then the associated naive treatment estimate will be biased upward. To illustrate this phenomenon, we assume that m random variables x\,... ,x associated w i t h m distinct treatments are independently standard normally distributed. We consider selecting the treatment w i t h the m a x i m u m observed value m a x { a ; i , . . . , x } . Figure 1.3 displays the resulting density curves for different values of m. T h e shift in distribution towards larger values for i n creasing m is evident. I n other words, if we would repeat a n experiment a large number of times for m â&#x20AC;&#x201D; 5 (say) treatment groups a n d at each time report the observed estimate for the selected treatment (that is, the one with the largest observed value), then the average reported estimate would be around 1 instead of 0 (which is the true value). T h e same pattern holds if the true effects are of equal size, but different from 0. Note that i n practice the true bias may even be larger than suggested i n Figure 1.3, if one were to only report the treatment effect estimates from successful experiments w i t h s t a t i s tically significant results. T h a t is, in practice the selection bias observed in Figure 1.3 may be confounded with reporting bias (Bretz, Maurer, and Gallo 2009c). Finally, note from Figure 1.3 that multiplicity does not impact only the location of the distribution, but leads also to a reduction i n the variability and an increase i n the skewness as m increases. m

Errors of declaring effects when none exist. T h e classical characterization of multiplicity is in terms of the "1 in 20" significance criterion: I n 20 tests of hypotheses, all of which are (unknown to the analyst) truly null, we expect to commit one T y p e I error and incorrectly reject one of the 20 null hypotheses. T h u s , multiple testing can increase the likelihood of T V P I errors. T h i s c h a r acterization is closely related to the motivating discussion at the beginning of this chapter and the probabilities displayed i n Figure 1.1. I n fact, much of the material i n this book is devoted to this classical characterization and the description of suitable multiple comparison procedures, which account for the impact of multiple significance testing. While we often attribute lack of replication to poor designs and d a t a collection procedures, we should also consider selection effects related to multiplicity as a cause. I n many cases these effects can be subtle. Consider, for example, a clinical efficacy measure taken one month after administration of a drug. T h e efficacy can be determined (a) using the raw measure, (b) using the percentage change from baseline, (c) using the actual change from baseline or (d) using the baseline covariate-adjusted raw measure. I f we follow a n aggressive strategy and chose the "best" (and most significant) measure, then the reported effect size measure will clearly be inflated, because the m a x i m a l statistic capitalizes (unfairly) on random variations i n the data. I n such a case, it is not surprising that follow-up studies may produce less stellar results; this phenomenon is an example of regression to the mean. I n all three characterizations above, there is a concern that the presentation of the scientific findings from an experiment may be exaggerated. I n some areas e

INTRODUCTION

x F i g u r e 1.3

Density curves of the effect estimate after selecting the "best" treatment with the maximum observed response for different numbers m of treatments.

(especially i n the health care environment) regulatory agencies recognized this problem and released corresponding (international) guidelines to ensure a good statistical practice. I n 1998, the International Conference on Harmonization published a tripartite guideline for statistical principles in clinical trials ( I C H 1998), which reflects concerns with multiplicity: When multiplicity is present, the usual frequentist approach to the analysis of clinical trial data may necessitate an adjustment to the Type I error. Multiplicity may arise, for example, from multiple primary variables, multiple comparisons of treatments, repeated evaluation over time, and/or interim analyses ... [multiplicity] adjustment should always be considered and the details of any adjustment procedure or an explanation of why. adjustment is not thought to be necessary should be set out in the analysis plan. I n addition, the E u r o p e a n Medicines Agency in its Conrrnittee for Proprietary

INTRODUCTION

Medicinal Products document "Points to Consider on Multiplicity Issues i n Clinical Trials" ( E M E A 2002), states that ... multiplicity can have a substantial influence on the rate of false positive c o n clusions whenever there is an opportunity to choose the most favorable results from two or more analyses ... and later echoes the I C H recommendation to state details of the multiple comparison procedure in the analysis p l a n . W h i l e these documents allow that multiplicity adjustment might not be necessary, they also request justifications for such action. A s a result, pharmaceutical companies have routinely begun to incorporate adequate multiple comparison procedures i n their statistical analyses. B u t even if guidelines are not available or do not apply, control of multiplicity is to the experimenter's advantage as it ensures better decision making and safeguards against selection bias.

CHAPTER 2

General Concepts and Basic Multiple Comparison Procedures T h e objective of this chapter is to introduce general concepts related to m u l t i ple testing as well as to describe basic strategies for constructing multiple c o m parison procedures. I n Section 2.1 we introduce relevant error rates used for multiple hypotheses test problems. I n addition, we discuss general concepts, such as adjusted and unadjusted values, single-step and stepwise test p r o c e dures, etc. I n Section 2.2 we review different principles of constructing multiple comparison procedures, including union-intersection tests, intersection-union tests, the closure principle and the partitioning principle. We then review commonly used multiple comparison procedures constructed from marginal p-values. Methods presented here include Bonferroni-type methods and their improvements (Section 2.3) as well as modified Bonferroni methods based on the Simes inequality (Section 2.4). For each method, we describe its a s s u m p tions, advantages and limitations. 2.1 E r r o r rates a n d general concepts I n this section we introduce relevant error rates for multiple comparison p r o c e dures, which extend the familiar error rates used when testing a single null h y pothesis. We also introduce general concepts important to multiple hypotheses testing, including weak and strong error rate control, adjusted and unadjusted p-values, single-step and stepwise test procedures, simultaneous confidence i n tervals, coherence, consonance and,the properties of free and restricted c o m binations. For more detailed discussions of these and related topics we refer the reader to the books by Hochberg and Tamhane (1987) and H s u (1996) as well as to the articles on general multiple test methodology by L e h m a n n (1957a,b); Gabriel (1969); Marcus, Peritz, and Gabriel (1976); Sonnemann (1982); Stefansson, K i m , and H s u (1988); Finner and Strassburger (2002) and the references therein. 2.1.1

Error

rates

Type I error rates L e t m > 1 denote the number of null hypotheses Hi,..., H to be tested. Assume each elementary hypothesis Hi,i = l , . . . , m , is associated w i t h a given experimental question of interest. T h e number m of null hypotheses is m

GENERAL CONCEPTS

a p p l i c a t i o n specific a n d can vary substantially f r o m one case t o another. For example, s t a n d a r d clinical dose finding studies compare a s m a l l n u m b e r of d i s t i n c t dose levels of a new d r u g , m = 4 say, w i t h a c o n t r o l t r e a t m e n t . T h e elementary hypothesis Hi t h e n states t h a t the t r e a t m e n t effect a t dose level i is n o t b e t t e r t h a n the effect under the c o n t r o l , i = 1 , . . . , 4 . T h e n u m b e r m of hypotheses can be i n t h e 1,000s i n a m i c r o a r r a y e x p e r i m e n t , where a n u l l hypothesis Hi m a y state t h a t gene i is not differentially expressed under t w o c o m p a r a t i v e conditions ( D u d o i t and van der L a a n 2008). F i n a l l y , t h e number of hypotheses is infinite w h e n c o n s t r u c t i n g simultaneous confidence bands over a continuous covariate region ( L i u 2010). For any test p r o b l e m , there are three types of errors. A false positive d e c i sion occurs i f we declare an effect when none exists. Similarly, a false negative decision occurs i f we fail t o declare a t r u l y existing effect. I n hypotheses test problems, these errors are denoted as T y p e I and T y p e I I errors, respectively. T h e correct rejection of a n u l l hypothesis coupled w i t h a w r o n g d i r e c t i o n a l decision is denoted as T y p e I I I error. T h e related n o t a t i o n is s u m m a r i z e d i n Table 2 . 1 . L e t M = { 1 , . . . , m } denote t h e index set associated w i t h t h e n u l l hypotheses Hi,..., Hm a n d let M o C M denote the set of mo = | M | t r u e h y potheses, where |A| denotes t h e c a r d i n a l i t y of a set A. I n Table 2 . 1 , V denotes the number of T y p e I errors a n d R t h e number of rejected hypotheses. N o t e t h a t R is a n observable r a n d o m variable, S, T , Uand V are a l l unobservable r a n d o m variables, w h i l e m a n d mo are fixed numbers, where mo is u n k n o w n . 0

Hypotheses

N o t Rejected

Rejected

Total

True False

U T

V S

m m â&#x20AC;&#x201D; mo

Total

Table 2.1

Type I and I I errors i n multiple hypotheses testing.

A s t a n d a r d approach i n u n i v a r i a t e hypothesis t e s t i n g ( m = 1) is t o choose an a p p r o p r i a t e test, w h i c h m a i n t a i n s the T y p e I error r a t e a t a pre-specified significance level a. Different extensions of the u n i v a r i a t e T y p e I error r a t e t o m u l t i p l e test problems are possible. I n t h e following we review several error rate definitions c o m m o n l y used for m u l t i p l e comparison procedures. T h e per-comparison

error

rate P

m is the expected p r o p o r t i o n of T y p e I errors among the m decisions. I f each of the m hypotheses is tested separately at a pre-specified significance level a , then P C E R = a m / m < a. I n m a n y applications, however, c o n t r o l l i n g the per-comparison error rate a t 0

E R R O R RATES AND G E N E R A L CONCEPTS

level a is not considered adequate. Instead, the hypotheses Hi,..., H are considered jointly as a family, where a single T y p e I error already leads to an incorrect decision. T h i s motivates the use of the familywise error rate m

F W E R = ÂĽ(V

> 0),

which is the probability of committing at least one T y p e I error. T h e f a m ilywise error rate is the most common error rate used i n multiple testing, particularly historically, and also in current practice where the number of c o m parisons is moderate a n d / o r where strong evidence is needed. T h e familywise error rate is closely related to the motivating considerations from Chapter 1. Prom Figure 1.1 we can see that F W E R approaches 1 for moderate to large number of hypotheses m if there is no multiplicity adjustment. Note that the familywise error rate reduces to the common T y p e I error rate for m â&#x20AC;&#x201D; 1. W h e n the number m of hypotheses is very large a n d / o r when strong e v i dence is not required (as is typically the case in high-dimensional screening studies in molecular biology or early drug discovery), control of the familywise error rate can be too strict. T h a t is, if the probability of missing differential effects is too high, the use of the familywise error rate may not be a p p r o p r i ate. I n this case, a straightforward extension of the familywise error rate is to consider the probability of committing more t h a n k T y p e I errors for a pre-specified number k: I f the total number of hypotheses m is large, a small number k of T y p e I errors may be acceptable. T h i s leads to the generalized familywise error rate g F W E R = P ( V > A:), see V i c t o r (1982); Hommel and Hoffmann (1988), and L e h m a n n and Romano (2005) for details. Control of the generalized familywise error rate is less stringent t h a n control of the f a m ilywise error rate at a common significance level a: A method controlling the generalized familywise error rate may allow, with high probability, a few T y p e I errors, provided that the number is less t h a n or equal to the pre-specified number k. O n the other hand, methods controlling the familywise error rate ensure that the probability of committing any T y p e I error is bounded by a. A n alternative approach is to relate the number V of false positives to the total number R of rejections. L e t Q = V/R if R > 0 and Q = 0 otherwise. T h e false discovery rate FDR

E(Q) â&#x20AC;&#x201D; j R > 0 j P O R > 0 ) + 0J

^| R>0 ) P(H>0) J

is then the expected proportion of falsely rejected hypotheses among the r e jected hypotheses (Benjamini and Hochberg 1995). Earlier ideas related to the false discovery rate can be found in Seeger (1968) and Soric (1989). T h e i n t r o duction of the false discovery rate has initiated the investigation of alternative error control criteria and many further measures have been proposed recently. Storey (2003), for example, proposed the positive false discovery rate p F D R =

GENERAL

CONCEPTS

E,(V/R\R > 0), w h i c h is defined as the expected p r o p o r t i o n of falsely rejected hypotheses among the rejected hypotheses given t h a t some are rejected. A different concept is t o c o n t r o l t h e p r o p o r t i o n V/R d i r e c t l y : K o r n , Troendle, McShane, a n d Simon (2004) and van der L a a n , D u d o i t , and P o l l a r d (2004) independently i n t r o d u c e d computer-intensive m u l t i p l e comparison procedures t o c o n t r o l the proportion

of false

positives

PFP

= 짜(V/R

> p),0

< g <

For a recent review we refer the reader t o B e n j a m i n i (2010) and t h e references therein. I n general, PCER < FDR <

FWER

for a given m u l t i p l e comparison procedure. T h i s can be seen b y n o t i n g t h a t V/m < Q < l { y > o } > where the indicator f u n c t i o n 1 ^ == 1 i f an event A is t r u e a n d 1 = 0 otherwise, a n d t h a t P C E R = E(V/m), F D R = E ( Q ) , and F W E R = E ( l { v > o } ) - T h u s , a m u l t i p l e comparison procedure w h i c h controls the familywise error rate also controls the false discovery rate and t h e percomparison error r a t e , b u t not vice versa. I n contrast, familywise error rate c o n t r o l l i n g procedures are more conservative t h a n false discovery rate c o n t r o l l i n g procedures i n t h e sense t h a t they lead t o a smaller n u m b e r of rejected hypotheses. I n any case, good scientific practice requires t h e specification of the T y p e I error rate control t o be done p r i o r t o the d a t a analysis. For any of the error concepts above, the error c o n t r o l is denoted as weak, i f the T y p e I error rate is controlled only under t h e global null hypothesis A

p| H

w h i c h assumes t h a t a l l n u l l hypotheses H\,..., H are t r u e . Consequently, i n case of c o n t r o l l i n g t h e familywise error rate i n t h e weak sense, i t is required that M

P ( V > 0\H)

Consider an experiment investigating a new t r e a t m e n t for m u l t i p l e outcome variables. C o n t r o l l i n g the familywise error rate i n t h e weak sense t h e n implies t h e c o n t r o l of the p r o b a b i l i t y of declaring an effect for at least one outcome variable, when there is no effect on any variable. I n practice, however, i t is unlikely t h a t a l l n u l l hypotheses are t r u e and t h e global n u l l hypothesis H is r a r e l y expected t o h o l d . T h u s , a stronger error r a t e c o n t r o l under less r e s t r i c t i v e assumptions is often necessary. I f , for a given m u l t i p l e comparison procedure, the T y p e I error rate is controlled under any p a r t i a l c o n f i g u r a t i o n of t r u e a n d false n u l l hypotheses, the error control is denoted as strong. For example, t o c o n t r o l the familywise error rate strongly i t is required t h a t m a x P \V /CM \

> 0

H **i

) < a,

where the m a x i m u m is t a k e n over a l l possible configurations 0 ^ I C M of t r u e n u l l hypotheses. I n t h e previous example, c o n t r o l l i n g t h e familywise error

E R R O R R A T E S AND G E N E R A L C O N C E P T S

rate i n the strong sense implies the control of the probability of declaring an effect for at least one outcome variable, regardless of the effect sizes for any of the outcome variables. Note that if m = 0, then V = 0 and F D R = 0. I f m = m, then F D R = E(l|jR > 0)F(R > 0) = H?(R > 0) = F W E R . Hence, any false discovery rate controlling multiple comparison procedure also controls the familywise error rate i n the weak sense. Having introduced different measures for T y p e I errors in multiple test p r o b lems, we are now able to formally define a multiple comparison procedure as any statistical test procedure designed to account for and properly control the multiplicity effect through a suitable error rate. I n this book, we focus on a p plications where the number of comparisons is moderate a n d / o r where strong evidence is needed. T h u s , we restrict our attention to multiple comparison procedures controlling the familywise error rate i n the strong sense. T h e appropriate choice of null hypotheses being of primary interest is a controversial question. T h a t is, it is not always clear which set of hypotheses should constitute the family Hi,..., H . T h i s topic has often been i n dispute and there is no general consensus. A n y solution will necessarily be application specific and at its best serve as an example for other areas. Westfall arid Bretz (2010), for example, provided some guidance, on when and how to adjust for multiplicity at different stages of drug development. 0

Type I I error

rates

A common requirement for any statistical test is to maximize the power and thereby to minimize the T y p e I I error rate for a given T y p e I error criterion. Power considerations are thus an integral part of designing a scientific e x p e r i ment. Analogous to extending the T y p e I error rate, power can be generalized in various ways when moving from single to multiple hypotheses test problems. Power concepts to measure an experiment's success are then associated with the probability of rejecting an elementary null hypothesis Hi,i â&#x201A;Ź M , when in fact Hi is not true. T h e problem is that the individual events Hi is rejected" ,

i G M,

can be combined i n different ways, thus leading to different measures of s u c cess. Below we use the notation from Table 2.1 to briefly review some common power concepts. T h e individual power 7rj

= P(reject Hi),

i â&#x201A;Ź Mi = M \

M, 0

is the rejection probability for a false hypothesis Hi. T h e concept of average power is closely related to individual power. I t is defined as the average expected number of correct rejections among all false null hypotheses, that is,

where mi = \Mi\ denotes the number of false null hypotheses. Alternatively,

GENERAL CONCEPTS

16 t h e disjunctive

power

dis

= P ( 5 > 1)

is t h e p r o b a b i l i t y of rejecting a t least one false n u l l hypothesis. A n appealing feature of t h e d i s j u n c t i v e power is t h a t 7 r decreases t o t h e f a m i l y w i s e error r a t e as t h e effect sizes related t o t h e false n u l l hypotheses Hi,i ÂŁ Mi, decrease. dis

I n c o n t r a s t , t h e conjunctive

power

c o n

=P(5 = m ) 1

is t h e p r o b a b i l i t y of rejecting a l l false n u l l hypotheses. N o t e t h a t d i s j u n c t i v e and c o n j u n c t i v e power have also been referred t o as m u l t i p l e (or m i n i m a l ) a n d t o t a l (or complete) power, respectively; see M a u r e r a n d M e l l e i n (1988) a n d W e s t f a l l et a l . (1999). B u t since 7 r > 7 r , these n a m i n g conventions o f t e n lead t o confusion (Senn a n d B r e t z 2007). W h e n t h e f a m i l y of tests consists of pairwise mean comparisons, t h e previously m e n t i o n e d power measures have been i n t r o d u c e d as p e r - p a i r power, any-pair power, a n d all-pairs power ( R a m sey 1978). F i n a l l y , i t should be noted t h a t these power definitions are readily extended t o any subset M[ C M i of false n u l l hypotheses. N o t e also t h a t a l l of these p r o b a b i l i t i e s are c o n d i t i o n a l on w h i c h n u l l hypotheses are t r u e a n d w h i c h are false. T h e relevant p r a c t i c a l question is t o determine t h e a p p r o p r i a t e power c o n cept t o use for a given study. One m a y argue t h a t c o n j u n c t i v e power s h o u l d be used i n studies t h a t a i m at detecting a l l existing effects, such as i n i n t e r section u n i o n settings, see Section 2.2.2. D i s j u n c t i v e power is recommended i n studies t h a t a i m a t detecting at least one t r u e effect, such as i n u n i o n i n tersection settings, see Section 2.2.1. I n d i v i d u a l power is appealing i n clinical trials w i t h m u l t i p l e secondary outcome variables ( B r e t z , M a u r e r , a n d H o m m e l 2010) a n d average power can be useful for c o m p a r i n g different m u l t i p l e comparison procedures. I n general, however, a suitable power d e f i n i t i o n can be given o n l y on a case-by-case basis by choosing power measures t a i l o r e d t o t h e s t u d y objectives. dls

Directional

c o n

errors

A p a r t i c u l a r issue arises i n two-sided test problems, w h e n t h e elementary h y potheses H\, .. ., H are p o i n t - n u l l hypotheses. H a v i n g rejected Hi, t h e n a t u r a l i n c l i n a t i o n is t o make a directional decision on t h e sign of t h e effect b e i n g tested. T h i s requires c o n t r o l of b o t h T y p e I errors a n d errors i n d e t e r m i n i n g t h e sign of n o n - n u l l effects. A directional error (also k n o w n as T y p e I I I e r r o r ) is defined as t h e r e j e c t i o n of a false n u l l hypotheses, where t h e sign of t h e t r u e effect parameter is opposite t o t h e one of i t s sample estimate. m

L e t Ai denote t h e event of a t least one T y p e I error such t h a t P(^4i) = F W E R . L e t further A ^ denote t h e event t h a t there is at least one sign error a m o n g t h e t r u e n o n - n u l l effects. T h e p r o b l e m becomes how t o c o n t r o l t h e combined error rate P ( A i U A>ÂŁ) at a pre-specified level. Stepwise test procedures (see Section 2.1.2 for a definition) are powerful m e t h o d s for c o n t r o l l i n g t h e familywise error rate, b u t do not necessarily c o n t r o l t h e combined error rate.

ERROR RATES AND GENERAL CONCEPTS

Shaffer (1980) gave a counterexample involving shifted C a u c h y distributions; however, she also noted that for independent test statistics satisfying certain distributional conditions (which include the normal but rule out the Cauchy case), the combined error rate is controlled by the H o l m procedure from S e c tion 2.3.2 (Holm 1979a). Moreover, Holm (1979b) noted the control of the combined error rate for conditionally independent tests including noncentral multivariate t with identity dispersion matrix. F i n n e r (1999) further extended the class of stepwise tests that control the combined error rate to include some step-up tests, closed F tests, and modified S method tests. He also noted that while specialized procedures guaranteeing combined error rate control have been developed (see, for example, Bauer, Hackl, Hommel, and Sonnemann (1986)), they are often less powerful than standard closed and stepwise tests. Westfall, Tobias, and Bretz (2000) systematically investigated combined e r ror rates of stepwise test procedures relevant to analysis-of-variance models involving correlated comparisons, using both analytic and simulation-based methods. No cases of excess directional error were found for typical a p p l i c a tions involving noncentral multivariate t distributions. 2.1.2

General

Single-step

concepts

and stepwise

procedures

One possibility of classifying multiple comparison procedures is to divide them into single-step and stepwise procedures. Single-step procedures are c h a r a c t e r ized by the fact that the rejection or non-rejection of a null hypothesis does not take the decision for any other hypothesis into account. T h u s , the order in which the hypotheses are tested is not important and one c a n think of the multiple inferences as being performed in a single step. A well-known e x a m ple of a single-step procedure is the Bonferroni test. I n contrast, for stepwise procedures the rejection or non-rejection of a null hypothesis may depend on the decision of other hypotheses. T h e equally well-known H o l m procedure is a stepwise extension of the Bonferroni test using the closure principle u n der the free combination condition (see Section 2.3 for a description of these procedures). Stepwise procedures are further divided into step-down and step-up p r o c e dures. B o t h types of procedures assume a sequence of hypotheses H\ -<...-< H , where the ordering " X " of the hypotheses can be d a t a dependent. Stepdown procedures start testing the first ordered hypothesis Hi and step down through the sequence while rejecting the hypotheses. T h e procedure stops at the first non-rejection (at Hi, say), and Hi,..., H i - i are rejected. T h e Holm procedure is a n example of a step-down procedure. Step-up procedures start testing Hm and step up through the sequence while retaining the hypotheses. T h e procedure stops at the first rejection (at Hi, say), and Hi, ...,Hi are all rejected. Single-step procedures are generally less powerful t h a n their stepwise e x tensions in the sense that any hypothesis rejected by the former will also be m

GENERAL CONCEPTS rejected b y t h e l a t t e r , b u t not vice versa. T h i s w i l l become clear w h e n i n t r o ducing t h e closure p r i n c i p l e i n Section 2.2.3. T h e power advantage of stepwise test procedures, however, comes a t the cost of increased difficulties i n c o n s t r u c t i n g compatible simultaneous confidence intervals for the p a r a m e t e r of interest, w h i c h have a j o i n t coverage p r o b a b i l i t y of a t least 1 — a (see below). Adjusted

p-values

T h e c o m p u t a t i o n of p-values is a c o m m o n exercise i n u n i v a r i a t e hypothesis test problems. Therefore, i t is desirable t o also compute adjusted p-values for a given m u l t i p l e comparison procedure, w h i c h are d i r e c t l y comparable w i t h the significance level a. A n adjusted p-value qi is defined as t h e smallest significance level for w h i c h one s t i l l rejects the elementary hypothesis Hi,i € M , given a p a r t i c u l a r m u l t i p l e comparison procedure. I n case of t h e f a m i l y w i s e error r a t e , qi — i n f { o : E (0, l)\Hi is rejected a t level a } , i f such a n a exists, a n d qi = 1 otherwise; see W e s t f a l l a n d Y o u n g (1993) and W r i g h t (1992). A d j u s t e d p-values capture by c o n s t r u c t i o n t h e m u l t i p l i c i t y a d j u s t m e n t induced t h r o u g h a given m u l t i p l e comparison procedure a n d incorporate t h e p o t e n t i a l l y complex u n d e r l y i n g decision rules. Consequently, whenever qi < a, the associated elementary n u l l hypothesis Hi can be rejected w h i l e c o n t r o l l i n g the familywise error rate at level a. Examples of c o m p u t i n g adjusted p-values are given later when we describe t h e i n d i v i d u a l m u l t i p l e comparison procedures. T h e m a r g i n a l p-values pi are denoted as unadjusted p-values i n t h i s book. Simultaneous

confidence

intervals

T h e d u a l i t y between t e s t i n g and confidence intervals is well established i n u n i v a r i a t e hypotheses test problems ( L e h m a n n 1986). A general m e t h o d for c o n s t r u c t i n g a confidence set f r o m a significance test is as follows. L e t 0 d e note the parameter of interest. For each parameter p o i n t 0Q, test t h e hypothesis H: 0 = $Q a t level a. T h e set of all parameter p o i n t s 0Q, for w h i c h 'H: 6 = 0Q is accepted, constitutes a confidence set for t h e t r u e value of 0 w i t h coverage p r o b a b i l i t y 1 — a. T h i s m e t h o d is essentially based on p a r t i t i o n i n g t h e p a rameter space i n t o subsets, where each subset consists of a single parameter point. T h e p a r t i t i o n i n g p r i n c i p l e described i n Section 2.2.4 provides a n a t u r a l e x tension for d e r i v i n g c o m p a t i b l e simultaneous confidence intervals i n m u l t i p l e test problems, w h i c h have a j o i n t coverage p r o b a b i l i t y of at least 1 — a for the parameters of interest ( F i n n e r and Strassburger 2002). Here, c o m p a t i b i l i t y b e tween a m u l t i p l e comparison procedure and a set of simultaneous confidence intervals means t h a t i f a n u l l hypothesis is rejected w i t h t h e test procedure, t h e n the associated m u l t i p l i c i t y corrected confidence i n t e r v a l excludes a l l p a rameter values for w h i c h the n u l l hypothesis is t r u e ( H a y t e r a n d H s u 1994). A p p l y i n g t h e p a r t i t i o n i n g p r i n c i p l e , t h e p a r a m e t e r space is p a r t i t i o n e d i n t o

E R R O R RATES AND G E N E R A L CONCEPTS

small disjoint subhypotheses, where each is tested appropriately. T h e union of all non-rejected hypotheses then yields a confidence set C for the parameter vector of interest. Note that the finest possible partition is given by a pointwise partition such that each point of the parameter space represents an element of the partition. Most of the classical (simultaneous) confidence intervals can be derived by using the finest partition and a n appropriate family of oneor two-sided tests. I n general, however, this may not be the case, although a confidence set C can always be used to construct simultaneous confidence intervals by simply projecting C on the coordinate axes. Compatibility can be ensured by enforcing mild conditions on the partition and the test family (Strassburger, Bretz, and Hochberg 2004). Simultaneous confidence intervals are available i n closed form for many s t a n dard single-step procedures and will be used i n Chapters 3 and 4. Simultaneous confidence intervals for stepwise procedures, however, are usually more difficult to derive and often have limited practical use. We refer to Strassburger and Bretz (2008) and Guilbaud (2008, 2009) for recent results and discussions.

Free and restricted

combinations

A family of null hypotheses Hi,i â&#x201A;Ź M , satisfies the free combination condition if for any subset / C M the simultaneous t r u t h of Hi,i â&#x201A;Ź / , and falsehood of the remaining hypotheses is a plausible event. Otherwise, the hypotheses Hi,..., Hm satisfy the restricted combination condition (Holm 1979a; Westfall and Young 1993). A s an example of a hypotheses family satisfying the free combination c o n dition, consider the comparison of two treatments w i t h a control treatment (resulting i n m = 2 null hypotheses). A n y of the three events " n o n e / o n e / b o t h of the treatments is better than the control treatment" is then a plausible configuration and likely to be true i n practice. A s a n example of a hypotheses family satisfying the restricted combination condition, consider all pairwise comparisons of three treatments means 0 i , 02, and 03 (resulting i n m = 3 null hypotheses). I n this example, not all configurations of null and alternative hypotheses are logically possible. For example, if 0i ^ 02, then 0i = 0$ and 0 = 0 cannot be true simultaneously, thus restricting the possible configurations of true and false null hypotheses. 2

T h e motivation for the distinction between free and restricted c o m b i n a tions will become clear later when deriving stepwise test procedures based on the closure principle. T h e Holm procedure (Section 2.3.2) and the step-down Dunnett procedure (Section 4.1.2) are both examples of a large class of closed test procedures tailored to hypotheses satisfying the free combination c o n dition. Correspondingly, the Shaffer procedure (Section 2.3.2) and the closed Tukey test (Section 4.2.2) are examples of closed test procedures for restricted hypotheses.

GENERAL CONCEPTS

20 Coherence

and

consonance

A multiple comparison procedure is called coherent if it has the following property: I f Hi C Hj and Hj is rejected, then Hi is rejected as well (Gabriel 1969). Coherence is an important requirement for any multiple comparison procedure. I f coherence is not satisfied, problems with the interpretation of the test results may occur. T h e closed test procedures described i n Section 2.2.3 are coherent by construction. B y contrast, the Holm procedure described in Section 2.3.2 is coherent with free combinations, but i n special cases involving restricted combinations, may not be coherent (Hommel and Bretz 2008). Note that any non-coherent multiple comparison procedure can be replaced by a coherent procedure which is uniformly at least as powerful (Sonnemann and Finner 1988). Consonance is another desirable property of multiple comparison p r o c e dures, although it is not as important as coherence. L e t Hi = f) i ^* denote the intersection hypothesis for an index set I C M. Furthermore, denote a h y pothesis Hi as non-maximal if there is at least one J C M w i t h Hj D Hi; o t h erwise, denote Hi as as maximal. Consonance implies that if a non-maximal hypothesis Hi is rejected, one can reject at least one maximal hypothesis Hj 3 Hi (Gabriel 1969). I n many applications, the elementary null h y p o t h e ses J H I , . . . , Hm are maximal. Consonance then ensures that if an intersection hypothesis Hj is rejected, at least one elementary hypothesis Hi w i t h i G I can be rejected as well. Consonance will become important later when describing max-t tests that allow one to draw inferences about the individual null h y potheses Hi (Section 2.2.1), and it will become the basis to construct efficient shortcut procedures (Section 2.2.3). A more rigorous discussion of consonance can be found in B r a n n a t h and Bretz (2010). iâ&#x201A;Ź

2.2

C o n s t r u c t i o n m e t h o d s for m u l t i p l e c o m p a r i s o n p r o c e d u r e s

We now consider different methods to construct multiple comparison p r o c e dures. These include union intersection (and intersection union) tests to c o n struct multiple comparison procedures for the intersection (union) of several elementary null hypotheses. T h e closure principle and the recently introduced partitioning principle are powerful tools, which extend the union intersection test principle to obtain individual assessments for the elementary null hypotheses. 2.2.1

Union intersection

test

Historically, the union intersection test was the first construction method for multiple comparison procedures (Roy 1953; Roy and Bose 1953). Assume, for example, that several irrigation systems are compared with a control. I t is natural to claim success if at least one of the comparative irrigation systems leads to better results t h a n the control. I f Hi denotes the elementary hypoth-

CONSTRUCTION METHODS

esis of no difference in effect between irrigation system i and control, we wish to correctly reject any (but at least one) false Hi. T o formalize this multiple comparison problem, consider a family of null hypotheses Hi with associated alternative hypotheses Ki,i € M. We are then interested i n testing the intersection null hypothesis H = C\ieM H** ^ P~ proach is to use test statistics U,i € M, and reject H if any ti exceeds its associated critical value Ci. T h e rejection region is thus a union of rejection regions, (J {U > Ci}, giving rise to the " u n i o n " i n the "union intersection" term. I n summary, this construction leads to union intersection tests, which test the intersection of the null hypotheses Hi against the union of the a l t e r native hypotheses Ki, that is, n

ieM

H =

p | Hi i€M

against

K =

( J Ki. i€M

Note that union intersection tests consider the global intersection null h y pothesis H without formally allowing individual assessments of the elementary hypotheses Hi,..., H . T h a t is, if H is rejected by a union intersection test, one is still left w i t h the question, which of the elementary hypotheses Hi should be rejected. T h i s shortcoming can be dealt w i t h i n numerous ways, including simultaneous confidence interval construction i n Section 2.1.2, or by applying the closure principle described in Section 2.2.3. I t turns out that the union intersection test procedure dovetails nicely w i t h multiple comparison procedures i n general; many of the procedures described i n this book take the union intersection method as a foundation. Max-t tests form a n important class of union intersection tests and will become essential i n Chapters 3 and 4. L e t 11,..., t denote the individual test statistics associated w i t h the hypotheses Hi,..., Hm. Assume without loss of generality that larger values of ti favor the rejection of Hi. A m a t u r a l approach is then to consider the maximum of the individual test statistics U, leading to the max-t test m

*max — m a x { t i , . . . , t }-

(2-1)

We then reject the global null hypothesis H if and only if t > c, where the common constant c is chosen to control the T y p e I error rate at level a, that is, P ( t > c\H) = a. T h e critical value c is calculated from the joint distribution of the random variables t i , . . . , t . Determining c is often difficult or sometimes even impossible if that joint distribution is not available or numerically intractable, and conservative solutions have to be applied. It follows from Gabriel (1969) that applying max-t tests often leads to c o herent and consonant multiple comparison procedures, giving rise to their practical importance. Many popular multiple comparison procedures are in fact max-t tests by construction, such as the Bonferroni test (Section 2.3.1), the Dunnett test (Section 4.1.1), or the Tukey test (Section 4.2.1). I n a d d i tion, powerful stepwise test procedures can be derived based on the closure m a x

m a x

GENERAL CONCEPTS

principle, which allows us to assess the elementary hypotheses Hi,..., H (Hochberg and Tamhane 1987, p. 55); see Section 2.2.3 for further details. Note that if smaller values of U favor the rejection of Hi, the minimum of the individual test statistics U has to be taken instead. Because this is conceptually similar to Equation (2.1), we use the common term max-t tests, regardless of the sideness of the test problem. I n two-sided test problems the maximum is taken over the absolute values of the individual test statistics, that is, *max = m a x { | t i | , . . . , \t \}' Finally, note that the individual test statistics U (or, equivalently, their associated p-values Pi) can be combined i n other ways instead of taking their maximum; see Westfall (2005) for an overview. m

2.2.2

Intersection

union test

Consider the following example from drug development. International guidelines require that combination therapies (that is, the simultaneous a d m i n i s tration of two or more medications) have to show a clinical benefit against all individual monotherapies before being considered for market release ( E M E A 2009). I n contrast to the union intersection settings considered in Section 2.2.1, here it is required that all null hypotheses of no beneficial effect are rejected in order to claim that the combination therapy has a beneficial effect. Formally, we are given the test problem H' =

( J Hi iEM

versus

K' =

f] ieM

T h e intersection union test then rejects the union null hypothesis H' at overall level a, if all elementary hypothesis Hi are rejected by their local a-level tests (Berger 1982). I f all test statistics U,i â&#x201A;Ź M , have the same marginal d i s tribution, the intersection union test rejects H' if and only if m i n ^ M U > c, where c is the (1 â&#x20AC;&#x201D; a)-quantile from that marginal distribution. I n this p a r ticular case the intersection union test is also known as min-test, as coined by L a s k a and Meisner (1989). Suppose that in the example above we have t test ' statistics comparing the combination therapy with the individual monotherapies. T h e hypothesis H' is rejected and we conclude for a beneficial effect of the combination therapy, if the smallest t test statistic is larger t h a n the (1 â&#x20AC;&#x201D; a)-quantile from the univariate t distribution. Note that if only some of the null hypotheses Hi are rejected locally (for example, U > c for some i i n case of the min-test), the union null hypothesis H' is retained and no individual assessments are possible. I n such cases no elementary null hypothesis Hi can be rejected, because otherwise the familywise error rate may not be controlled. T h i s property often leads to the m i s conception that the intersection union test is conservative i n the sense that the nominal T y p e I error rate is not exhausted and therefore the test would lack i n power. I n fact, the intersection union test fully exploits the T y p e I error and is moreover uniformly most powerful within a certain class of m o n o tone a-level tests (Laska and Meisner 1989). Improvements, which discard

CONSTRUCTION METHODS

the monotonicity condition or restrict the parameter space, can be found in Sarkar, Snapinn, and Wang (1995) and Patel (1991), respectively. Compatible confidence intervals for intersection union tests involving two hypotheses are given by Strassburger et al. (2004).

2.2.3

Closure

principle

T h e union intersection tests from Section 2.2.1 test the global null hypothesis H without formally assessing the individual hypotheses Hi,..., H . T h a t is, if j r = n i€M rejected by a union intersection test, we cannot make any conclusions about the elementary hypotheses Hi. T h e closure principle i n t r o duced by Marcus et al. (1976) is a general construction method which leads to stepwise test procedures (Section 2.1.2) and allows one to draw individual conclusions about the elementary hypotheses Hi. T o describe the closure principle, we consider initially the case of m = 2 null hypotheses Hi and H2 and discuss the general case later. Suppose we want to assess whether any of two treatments (for example, two new drugs or irrigations systems) is better than a control treatment. L e t fij denote the mean effect for treatment j , where j = 0 (control), 1,2. L e t further 0i = Mt Mo denote the mean effect difference between treatment £ = 1,2 and the control. T h e $i are the parameters of interest and the resulting elementary null hypotheses are Hi: 0i < 0, i = 1,2. W h e n using the Bonferroni test (which is formally described in Section 2.3), each hypothesis Hi is tested at level a / 2 in order to control the familywise error rate at level a. However, the Bonferroni test can be improved by applying the closure principle, as described now. m

l s

—

I t is useful to formally consider the hypotheses Hi as subsets of the p a r a m e ter space, about which we want to draw our inferences. L e t O = R denote the parameter space w i t h 0 = (0i,02) € O. Figure 2.1 visualizes the hypotheses Hi = {0 G R : 0% < 0 } , i = 1,2, as subsets of the real plane (the parameter space). Clearly, the two elementary hypotheses Hi and H2 are not disjoint: T h e intersection of both is given by H = Hi(lH = {0 6 R : 0i < 0 and 0 < 0 } , which is the lower left quadrant in Figure 2.1. Testing the intersection h y p o t h esis H12 requires a n adjustment for multiplicity. T h i s is taken into account by the Bonferroni test, which actually tests the entire union H1UH2 at level a/2 and not j u s t the intersection hypothesis H12. However, Figure 2.1 suggests that the remaining parts Hi \ H12 and H2 \ H12 can each be tested at full level a, without the need to adjust further for multiplicity. T h i s leads to the " n a t u r a l " test strategy of first testing the intersection hypothesis H12 w i t h an appropriate union intersection test, and, if this is significant, continue testing Hi and H2, each at full level a. T h e null hypothesis Hi is rejected (while controlling the familywise error rate strongly at level a) if and only if both Hi and H12 are rejected, each at (local) level a. Conversely, H2 is rejected if both H2 and H12 are rejected. I f H12 is not rejected at first place, further testing is unnecessary; otherwise, if Hi (say) is rejected, but H12 is not, this 2

GENERAL

CONCEPTS

w o u l d lead t o i n t e r p r e t a t i o n problems (coherence p r o p e r t y , see Section 2.1.2). T h i s c o n s t r u c t i o n m e t h o d is the key idea of the closure p r i n c i p l e .

F i g u r e 2.1

Visualization of two null hypotheses Hi and H2 and their intersection H12 in the parameter space M . 2

There are a l t e r n a t i v e possibilities of visualizing t h e closure p r i n c i p l e t h a n shown i n F i g u r e 2.1. I n F i g u r e 2.2 we provide a V e n n - t y p e d i a g r a m for t w o n u l l hypotheses Hi and H and their intersection H12. I n F i g u r e 2.3 we show a related schematic d i a g r a m , w h i c h provides a convenient way of v i s u a l i z i n g t h e test dependencies a m o n g the hypotheses. T h e intersection hypothesis H is shown at t h e t o p , w h i l e t h e t w o elementary hypotheses Hi and H are shown at t h e b o t t o m of t h e d i a g r a m . Testing occurs i n a " t o p - d o w n " fashion. As described above, H12 is tested w i t h a u n i o n intersection test at level a. I f Hi is n o t rejected, further t e s t i n g is unnecessary. Otherwise, Hi and H are each tested at level a. F i n a l l y , Hi is rejected (while c o n t r o l l i n g t h e f a m i l y w i s e error rate strongly at level a) i f Hi and Hi are b o t h locally rejected. A similar decision rule holds also for H2. 2

We now consider the case of testing any n u m b e r m of n u l l hypotheses Hi,i = 1 , . . . , m (such as c o m p a r i n g m treatments w i t h a c o n t r o l ) . T h e c l o sure p r i n c i p l e considers a l l intersection hypotheses constructed f r o m t h e i n i t i a l hypotheses set. Each intersection hypothesis is tested at local level a. Note t h a t we can specify any a-level u n i o n intersection test for the intersection

CONSTRUCTION METHODS

F i g u r e 2.2

Visualization of two null hypotheses Hi and H2 and their intersection i f 12 using a Venn-type diagram.

H12 â&#x20AC;&#x201D; Hi n H2

F i g u r e 2.3

Schematic diagram of the closure principle for two null hypotheses Hi and H2 and their intersection.

hypotheses. I n particular, different tests can be used for different hypotheses. For the final inference, an elementary null hypothesis Hi is rejected if and only if all intersection hypotheses implying Hi are rejected by their individual tests at local level a , too. I t can be shown that the above procedure controls the familywise error rate strongly at level a. (Marcus et a l . 1976). Following the example below, a proof is sketched. Multiple comparison procedures based on the closure principle are called

GENERAL

CONCEPTS

below. Operationally, closed test procedures are p e r -

closed test procedures formed as follows:

(i) Define a set H = {Hi,...,

} of elementary hypotheses.

(ii) C o n s t r u c t the closure set H = I Hi = f]Hi

: / c {l,...,m},H

/ 0

i€l

of a l l n o n - e m p t y intersection hypotheses ( i i i ) For each intersection hypothesis Hj test.

Hi.

€ H find a suitable (local) a-level

(iv) Reject Hi, i f a l l hypotheses Hi € Ti w i t h i E I are rejected, each at (local) level a. I f a closed test procedure is performed, adjusted p-values qi for the n u l l h y potheses Hi are c o m p u t e d as follows. A n elementary n u l l hypothesis Hi is o n l y rejected, i f a l l Hi £ Ti w i t h i € I are rejected, a n d the m a x i m u m p-value f r o m this set needs t o be less t h a n or equal t o a . L e t pi denote t h e p-value for a given intersection hypothesis Hi, I C { 1 , . . . , m } . T h e n t h e adjusted p-value for Hi is f o r m a l l y defined as qi — m a x p / ,

i = 1 , . . . , m.

(2-2)

Example 2.1. We provide a n example of the closure p r i n c i p l e w i t h m = 3 hypotheses i n F i g u r e 2.4. Accordingly, we have three hypotheses of interest i n the f a m i l y Ti = {Hi, H , H } . These three hypotheses could involve, for e x a m ple, t h e comparison of three t r e a t m e n t s w i t h a c o n t r o l , a l t h o u g h t h e f o l l o w i n g considerations are generic. T h i s completes step 1 f r o m above. For step 2, we need t o consider a l l intersection hypotheses Hi = f] Hi, I C { l , . . . , m } . I n this example, m = 3 a n d four a d d i t i o n a l intersection hypotheses have t o be considered t o o b t a i n t h e closed hypotheses set. Specifically, t h e f u l l c l o sure H — {Hi,H Hs Hi2-> H , H 3, Hi } contains a l l seven intersection h y potheses, where Hij = Hi D Hj, 1 < i < 3, a n d H123 = Hi n H n H3 is the global n u l l hypothesis. N o t e t h a t Ti is closed under intersection. T h a t is, any intersection Hi n Hi' of some Hi, Hi' 6 H is contained i n Tt. For step 3, we assume suitable (local) a-level tests for each of the seven intersection hypotheses. A simple approach is t o use the B o n f e r r o n i test, w h i c h is f o r m a l l y i n t r o d u c e d i n Section 2.3. F i n a l l y , according t o step 4, we reject Hi (while c o n t r o l l i n g the f a m i l y w i s e error rate strongly a t level a) i f - H 1 2 3 , H\ , H13 a n d Hi are a l l rejected by t h e i r a-level tests. Similarly, we reject H i f Hi %, Hi , H$ a n d H are a l l rejected, a n d we reject H i f Hi , H13, H a n d # 3 are a l l rejected. I n F i g u r e 2.4, the arrows l i n k i n g t w o hypotheses at a t i m e reflect t h a t t h e hypotheses are nested, where the smallest hypothesis ^123 is d r a w n on t o p . For example, t h e a r r o w between Hi and Hi w i t h Hi d r a w n one level below Hi indicates t h a t Hi 3 Hi and H% can o n l y be rejected i f Hi is also rejected (coherence p r o p e r t y , see Section 2.1.2). • 2

ieI

}

CONSTRUCTION METHODS

T h e closure principle can be seen to control the familywise error rate in the example above as follows. F i r s t , recall that the familywise error rate is the probability of rejecting at least one null hypothesis incorrectly. We wish to control this probability at level a , regardless of which set Mq C M = { 1 , 2 , 3 } of null hypotheses happens to be true. Suppose, i n the example above, that H12 happens to be true. T h e n , Mo = { 1 , 2 } and a T y p e I error occurs if either Hi or H2 is rejected. I f the closure principle is used, Hi is rejected if and only if H123, #12> # 1 3 and Hi are all rejected. Note that the set of experiments for which # 1 2 3 , H12, His and Hi are all rejected is a subset of the set of experiments for which H12 is rejected. Mathematically, {Hi rejected using closure} = { # 1 2 3 rejected} fl { i f 12 rejected} H {His rejected} Pi {Hi rejected} C {H12 rejected}. Similarly, H2 is rejected if and only if i?i23> # 1 2 , # 2 3 and H2 are all rejected. Note that the set of experiments for which i?i23> H12, H23 a n d H2 are all r e jected is also a subset of the set of experiments for which H12 is rejected. M a t h ematically, {H2 rejected using closure} = { # 1 2 3 rejected} Pi {H12 rejected} fl {if23 rejected} H {H2 rejected} C { i f i 2 rejected}. Since both {Hi

rejected using closure} C { # 1 2 rejected}

{H2

rejected using closure} C {H12 rejected},

and

GENERAL

CONCEPTS

i t follows t h a t {{Hi

rejected using closure} U {H C {H12

rejected u s i n g c l o s u r e } }

rejected}

a n d therefore t h a t F({Hi

rejected using closure} U {H

rejected using closure})

< P ( # 1 2 rejected) = a. Hence the m e t h o d controls the familywise error rate a t level less t h a n or equal t o a w h e n t h e t r u e n u l l state of n a t u r e is # 1 2 - A n analogous a r g u m e n t shows t h a t we have f a m i l y w i s e error rate c o n t r o l when the t r u e state of n a t u r e is # 1 3 , H23, or H123. T h i s argument also clarifies w h y a l l intersection hypotheses have t o be tested: One needs t o protect oneself against every possible c o m b i n a t i o n of t r u e a n d false nulls, i f one wishes t o c o n t r o l the familywise error r a t e i n t h e s t r o n g sense. T h e closure p r i n c i p l e is a flexible construction m e t h o d w h i c h can be t a i l o r e d t o a v a r i e t y of applications. I n p a r t i c u l a r , i t provides a large degree of flexib i l i t y t o m a p t h e difference i n i m p o r t a n c e as well as t h e relationship between the various s t u d y objectives onto an adequate m u l t i p l e test procedure. M a n y c o m m o n m u l t i p l e comparison procedures are i n fact closed test procedures, such as the step-down procedures of H o l m (1979a), Shaffer (1986), a n d Westfall a n d Y o u n g (1993), fixed sequence tests ( M a u r e r , H o t h o r n , a n d Lehmacher 1995; W e s t f a l l a n d K r i s h e n 2001), fallback procedures (Wiens 2003; W i e n s a n d D m i t r i e n k o 2005) a n d gatekeeping procedures (Bauer, R o h m e l , M a u r e r , a n d H o t h o r n 1998; W e s t f a l l a n d K r i s h e n 2001; D m i t r i e n k o , Offen, a n d W e s t f a l l 2003). One disadvantage of the closure p r i n c i p l e is t h a t related simultaneous confidence intervals for t h e parameters of interest &i are difficult t o derive a n d often have l i m i t e d p r a c t i c a l use; see Strassburger a n d B r e t z (2008) for a recent discussion. N o t e t h a t for the closure p r i n c i p l e t h e number of operations is i n general of order 2 , where m is t h e number of hypotheses of interest. I t is o f t e n useful t o find shortcut procedures t h a t can reduce the n u m b e r of operations t o the order of m i n t h e best case scenario (Grechanovsky a n d Hochberg 1999). T h e a i m of s h o r t c u t procedures is t o reach decisions for t h e elementary hypotheses, b u t n o t necessarily for the entire closure test. B y c o n s t r u c t i o n , t h e decisions r e s u l t i n g f r o m a s h o r t c u t procedure coincide w i t h those of a closed test p r o cedure. S h o r t c u t procedures thus reduce the c o m p u t a t i o n a l d e m a n d , w h i c h can be s u b s t a n t i a l for large numbers m of hypotheses a n d / o r i f r e s a m p l i n g based tests are used together w i t h the closure p r i n c i p l e . We refer t o R o m a n o a n d W o l f (2005), H o m m e l , B r e t z , and M a u r e r (2007), W e s t f a l l a n d Tobias (2007), a n d B r a n n a t h a n d B r e t z (2010) for further details. Related g r a p h ical approaches for sequentially rejective closed test procedures have been described by B r e t z , M a u r e r , B r a n n a t h , and Posch (2009b); B r e t z et a l . (2010) a n d B u r m a n , Sonesson, a n d G u i l b a u d (2009). S h o r t c u t procedures w i l l b e come essential i n Chapters 3 and 4, where we use t h e m u l t c o m p package i n m

CONSTRUCTION METHODS

R, which implements efficient methods to reduce the complexity of the u n d e r lying closed test procedures. Related details are described i n Sections 4.1.2 and 4.2.2, where we consider closed test procedures based on max-t tests in the form of E q u a t i o n (2.1) for hypotheses satisfying the free and restricted combination condition, respectively. 2.2.4

Partitioning

principle

T h e partitioning principle was formally introduced by F i n n e r and Strassburger (2002), w i t h early ideas dating back to Takeuchi (1973, 2010), Stefansson et al. (1988) and Hayter and H s u (1994). Using the partitioning principle, one can derive powerful multiple test procedures and simultaneous confidence i n t e r vals, which otherwise would be difficult to obtain when applying the closure principle. T h e key idea of the partition principle is to partition the parameter space into disjoint subsets and test each partition element w i t h a suitable test at level a. Because the partition elements are disjoint to each other, only one of them contains the true parameter vector and c a n lead to a T y p e I error. Hence, test procedures based on the partitioning principle strongly control the familywise error rate at level a. T o motivate the partitioning principle, we consider again the problem from Section 2.2.3 of comparing two treatments w i t h a control. A s before, let $i = fj,i — /xo» i = 1» 2, denote the parameters of interest. L e t further 0 = 1R denote the parameter space w i t h 0 = (0i,02) € 0 . Recall the two hypotheses of interest, Hi = {0 € R : 0i < 0 } , i = 1,2, and let Ki = 0 \ Hi denote the associated alternative hypotheses. Figure 2.1 visualizes the hypotheses Hi and H2 as subsets of the real plane (the parameter space). We now partition the parameter space 0 into the following sets, see also Figure 2.5: O i = Hi, 0 = H O Ki, and 0 3 = Ki n K . Since © * , £ = 1,2,3, are disjoint and 0 i U 0 2 U 0 3 = 0 , these sets constitute a partition of the parameter space 0 . T h e true parameter vector 0 thus lies in one and only one of the disjoint subsets Oi. Applying tests at (local) level a to each of these subsets therefore leads to a multiple test procedure which controls the familywise error rate in the strong sense at level a. Note that the resulting test procedure is at least as powerful as the corresponding closed test procedure based on Hi and H , because it is sufficient to control the T y p e I error rate over 0 i = Hi and the smaller subspace 0 2 £ # 2 - I n addition, a confidence set for the parameter vector 0 is obtained by intersecting the complementary regions of those hypotheses, which have been rejected. 2

Note that the interpretation of 0 i is the same as of Hi ("treatment 1 is not better than the control"). However, the interpretation of 0 2 ("treatment 2 is not better t h a n the control, but treatment 1 is") has changed as compared with that of H ("treatment 2 is not better than the control"). T h e r e are many possibilities for partitioning the parameter space and it is not always clear which partition to apply for a given test problem. I n Figure 2.5 we illustrate one example of partitioning the real plane (and which has a straightforward 2

GENERAL CONCEPTS

i n t e r p r e t a t i o n , as described above). O t h e r p a r t i t i o n s m a y give similar or even a d d i t i o n a l i n f o r m a t i o n about the parameter vector 0.

F i g u r e 2.5

Partitioning principle for two null hypotheses Hi and H2.

T h e previous example illustrates the basic p a r t i t i o n i n g p r i n c i p l e for t w o hypotheses. I n the general case of m hypotheses Hi,..., Hm, the p a r t i t i o n i n g principle can be performed o p e r a t i o n a l l y as follows: (i) Choose a n a p p r o p r i a t e p a r t i t i o n { © £ : £ G L} of t h e parameter space for some index set L .

(ii) Test each ©^ w i t h a n a-level test. (iii) Reject the n u l l hypothesis Hi i f a l l ©^ w i t h ©^ n Hi =£ 0 are rejected. (iv) T h e u n i o n of a l l r e t a i n e d ©^ c o n s t i t u t e a confidence set for 9 a t level 1 - a. For extensions of t h e basic p a r t i t i o n i n g p r i n c i p l e we refer t o F i n n e r a n d S t r a s s burger (2002). A p p l i c a t i o n s of the p a r t i t i o n i n g p r i n c i p l e have been i n v e s t i gated for a number of test problems, such as dose response analyses ( H s u a n d Berger 1999; B r e t z , H o t h o r n , a n d H s u 2003; L i u , H s u , and R u b e r g 2007b; Strassburger, B r e t z , a n d F i n n e r 2007), m u l t i p l e outcome analyses ( L i u and H s u 2009), intersection u n i o n tests (Strassburger et a l . 2004), equivalence tests ( F i n n e r , G i a n i , a n d Strassburger 2006), and simultaneous confidence i n -

METHODS BASED ON BONFERRONI'S I N E Q U A L I T Y

tervals for step-up and step-down procedures (Finner and Strassburger 2006; Strassburger and Bretz 2008). 2.3

Methods based on Bonferroni's inequality

T h e Bonferroni method is probably the best known multiplicity adjustment. Although more powerful multiple comparison procedures exist, the B o n f e r roni test continues to be very popular because of its simplicity and its mild assumptions. I n this section we describe the Bonferroni test, a stepwise e x tension (Holm 1979a) and discuss further topics related to these methods. We also describe briefly some software implementations i n R, leaving the details for Section 3.3. 2.3.1

Bonferroni

test

T h e Bonferroni test is a single-step procedure, which compares the unadjusted p-values p i , . . . , p w i t h the common threshold a / m , where m is the number of hypotheses under investigation. Equivalently, a null hypothesis Hi,i € M , is rejected, if the adjusted p-value qi — m i n { m p i , 1} < a. Here, the minimum is used to ensure that the resulting adjusted p-value qi is not larger t h a n 1. T h e strong control of the familywise error rate follows directly from Bonferroni's inequality m

W>(V > 0) = P ( |J {qi < a } J < Vie M / 0

^2 ( S * ^ i€M P

) ^ ™a/m 0

< a,

(2.3)

where the probability expressions are conditional on DieMb H% and Mq C M denotes the set of mo = \Mq\ true null hypotheses (recall Table 2.1 for the related notation). Example 2.2. Consider the following numerical example w i t h m = 3 null hypotheses being tested at level a = 0.025. L e t p i = 0.01, P2 = 0.015, and P3 = 0.005 denote the unadjusted p-values. Because P3 < a / 3 = 0.0083, but p i , P 2 > a / 3 , only the null hypothesis Hs is rejected. Alternatively, the adjusted p-values qi = 0.03, q — 0.045, and q$ — 0.015 can be compared w i t h a = 0.025, leading to the same test decisions. • A convenient way to perform the Bonferroni test i n R is to call the p . a d j u s t function from the s t a t s package. Its use is self-explanantory. T o calculate the adjusted p-values qi i n the previous example, we first define a vector c o n t a i n ing the unadjusted p-values and subsequently call the p . a d j u s t function as follows: 2

R> p < - c C O . O l , 0 . 0 1 5 , 0 . 0 0 5 ) R> p . a d j u s t ( p , "bonferroni") [1]

0.030

0.045

0.015

I n Section 3.3 we provide a detailed description of the m u l t c o m p package i n

GENERAL CONCEPTS

R, w h i c h provides a n interface t o the m u l t i p l i c i t y adjustments i m p l e m e n t e d i n the p . ad j u s t f u n c t i o n and also allows one to p e r f o r m more sophisticated m u l t i p l e comparison procedures. T h e B o n f e r r o n i m e t h o d is a very general approach, w h i c h is v a l i d for any correlation s t r u c t u r e among the test statistics. However, the B o n f e r r o n i test is conservative i n t h e sense t h a t other test procedures exist, w h i c h reject at least as m a n y hypotheses as t h e Bonferroni test (often a t t h e cost of a d d i t i o n a l assumptions o n t h e j o i n t d i s t r i b u t i o n of the test statistics). I n t h e f o l l o w i n g we describe some improvements or generalizations of the B o n f e r r o n i approach.

2.3.2 Holm procedure H o l m (1979a) i n t r o d u c e d a m u l t i p l e comparison procedure, w h i c h u n i f o r m l y improves the B o n f e r r o n i approach. T h e Holm procedure is a step-down p r o c e dure, w h i c h basically consists of repeatedly a p p l y i n g B o n f e r r o n i ' s i n e q u a l i t y w h i l e t e s t i n g the hypotheses i n a data-dependent order. L e t p^ < . .. < p(m) denote the ordered unadjusted p-values w i t h associated hypotheses # ( i ) , - - -, i f ) . T h e n , H^ is rejected i f < a/(m -j + 1)J = 1 , . . ., z. T h a t is, is rejected i f p^ < a/{m — i + i) and a l l hypotheses H(j\ preceding Hn\ are also rejected. A d j u s t e d p-values for the H o l m procedure are given b y ( m

= m i n { l , m a x [ ( m - i + l)P(i)>

(i)

(2.4)

A l t e r n a t i v e l y , the H o l m procedure can be described b y t h e f o l l o w i n g sequent i a l l y rejective test procedure. S t a r t testing the n u l l hypothesis # ( i ) associated w i t h t h e smallest p-value p ( i ) . I f P(i) > a / m , the procedure stops a n d no h y pothesis is rejected. Otherwise, # ( i ) is rejected a n d t h e procedure continues testing # ( 2 ) a-t the larger significance level a/(m— 1). These steps are repeated u n t i l either t h e first non-rejection occurs or a l l n u l l hypotheses # ( i ) , • - -, - # ( ) are rejected. m

Example 2.3. Consider again the p-values f r o m E x a m p l e 2.2. Since p ( i ) = 0.005 < 0.0083 = a / 3 , H = H is rejected a t a = 0.025. A t t h e second step, P(2) = 0.01 < 0.0125 = a/2 a n d H( ) = Hi is also rejected: A t t h e f i n a l step, P ( 3 ) = 0.015 < 0.025 = a a n d H^ = H is also rejected. A l t e r n a t i v e l y , t h e adjusted p-values are calculated as = 0.015, <j(2) = 0.02, a n d q^ = 0.02, w h i c h are a l l smaller t h a n a = 0.025 a n d thus lead t o t h e same test decisions. U s i n g t h e p . a d j u s t f u n c t i o n i n t r o d u c e d i n Section 2.3.1, we o b t a i n these adjusted p-values by c a l l i n g (1)

R> p . a d j u s t ( p , [1J

0.020

"holm") 0.

015

N o t e t h a t t?( ) = 0.02 (and not 0.015, as one m i g h t expect) due t o t h e monot o n i c i t y enforcement induced b y t h e m a x i m u m argument i n (2.4). • I t becomes clear f r o m t h i s example t h a t t h e H o l m procedure is a step-down procedure, w h i c h by c o n s t r u c t i o n rejects a l l hypotheses rejected by t h e B o n f e r r o n i test and possibly others. We now give a different perspective on t h e H o l m 3

METHODS B A S E D ON BONFERRONI'S I N E Q U A L I T Y

procedure by considering the closure principle from Section 2.2.3. For null h y potheses Hi,..., H satisfying the free combination condition (Section 2.1.2), the Holm procedure is a shortcut of the closure principle when applying the Bonferroni test to each intersection hypothesis Hi = C\ Hi, I Q {1,..., m} (Holm 1979a). T h a t is, the Holm procedure leads to the same decisions for the elementary hypotheses Hi,...,H as a Bonferroni-based closed test p r o cedure. T o see this, consider again the example from Section 2.2.3 comparing two treatments with a control. T h e related closure principle for the two null hypotheses Hi and H is shown i n Figure 2.3. Assume now that the B o n f e r roni method is used to test the intersection hypothesis H12 = Hi D H , that is, H12 is rejected if m i n ( p i , p 2 ) < a/2. I f H12 is rejected, then one of the elementary hypotheses Hi and H2 is immediately rejected as well and only the remaining hypothesis needs to be tested at level a: If, say, pi < a/2, then trivially pi < a and Hi is rejected; H2 remains to be tested at level a. More generally, if Hi = f) Hi,I C M, is rejected, then there exists a n index i* € / such that pi» < a/\I\ and all hypotheses Hj w i t h i* € J Q I are also rejected, because \ J\ < \I\ and consequently Pi* < a/\I\ < a/\J\. Because the Holm procedure is based on the conservative Bonferroni i n equality (2.3), more powerful step-down procedures c a n be obtained by a c counting for the stochastic dependencies between the test statistics. I n S e c tion 4.1.2 we discuss a parametric extension of the H o l m procedure based on max-* tests i n the form (2.1). We also show how to implement these methods using R. m

i€f

ieI

#123

His

H12

F i g u r e 2.6

#23

Schematic diagram of the closure principle for the three null h y p o t h e ses Hij: in — p.j yl < i < j < 3.

I f the null hypotheses Hi,.. .,H satisfy the restricted combination c o n dition, however, the H o l m procedure is still applicable, but conservative. T o see this, consider all pairwise comparisons of three treatments means / i i , / 4 2 , and fjL , resulting in m = 3 null hypotheses Hij : m = fij. T h e related c l o sure principle for the three null hypotheses # i 2 ? # i 3 > and H23 is shown i n Figure 2.6, where H123: Hi = fJ-2 = M3 is e only non-trivial intersection m

t n

GENERAL CONCEPTS

hypothesis. A p p l y i n g t h e B o n f e r r o n i test leads t o t h e rejection of # 1 2 3 i f m i n ( p i , p 2 , p 3 ) < a / 3 . Once # 1 2 3 has been rejected, one of t h e elementary hypotheses is i m m e d i a t e l y rejected as well a n d o n l y t h e t w o r e m a i n i n g h y potheses need t o be tested a t level a . N o t e t h a t for t h e ordered set of p-values (p(i)>P(2)>P(3)) t h e H o l m procedure w o u l d a p p l y t h e significance thresholds ( a / 3 , a / 2 , a ) , whereas t h e thresholds ( a / 3 , a , a ) w o u l d suffice, as seen f r o m t h i s example. N o t e also t h a t i n some cases t h e H o l m procedure m a y n o t be c o herent under t h e r e s t r i c t e d c o m b i n a t i o n c o n d i t i o n ( H o m m e l a n d B r e t z 2008). Shaffer (1986) extended t h e H o l m procedure, u t i l i z i n g logical r e s t r i c t i o n s t o improve t h e power of closed tests. Her m e t h o d is a " t r u n c a t e d " closed test p r o cedure: Closed t e s t i n g is p e r f o r m e d , i n sequence #(2)1 • • • corresponding t o t h e ordered p-values, a n d t e s t i n g stops a t t h e first insignificance. T r u n c a t i o n ensures t h a t H^>) cannot be rejected i f is n o t rejected, for i' > i. W i t h o u t t r u n c a t i o n , a closed test procedure can result i n t h e possibly u n d e sirable outcome t h a t H^>) is rejected and H^ is n o t rejected, for i' > i; see B e r g m a n n a n d H o m m e l (1988), H o m m e l a n d B e r n h a r d (1999), a n d W e s t f a l l a n d Tobias (2007) for details. W h e n the tests satisfy t h e free combinations c o n d i t i o n , a l l these procedures reduce t o t h e o r d i n a r y H o l m procedure. A l t h o u g h based o n t h e conservative B o n f e r r o n i inequality, Shaffer's m e t h o d can be m u c h more powerful t h a n s t a n d a r d methods w h e n c o m b i n a t i o n s are r e s t r i c t e d . I n Section 4.2.2, we discuss a p a r a m e t r i c extension of Shaffer's m e t h o d (called t h e extended Shaffer-Royen procedure by W e s t f a l l a n d Tobias (2007)) t h a t is yet more p o w e r f u l t h a n Shaffer's m e t h o d . T h e increase i n power comes f r o m using specific dependence i n f o r m a t i o n r a t h e r t h a n t h e conservative B o n f e r r o n i inequality. We also show how t o i m p l e m e n t t h i s p o w e r f u l m e t h o d using t h e m u l t c o m p package i n R.

2.3.3 Further

topics

I n this section we describe a d d i t i o n a l extensions of t h e procedures by B o n f e r r o n i a n d H o l m . Simultaneous confidence intervals c o m p a t i b l e w i t h t h e B o n f e r r o n i test are easily o b t a i n e d by c o m p u t i n g t h e m a r g i n a l confidence intervals at the adjusted significance levels a/m. For the H o l m procedure, such confidence intervals are more difficult t o construct (Section 2.1.2). W e refer t o S t r a s s burger a n d B r e t z (2008) a n d G u i l b a u d (2008), w h o independently applied the p a r t i t i o n i n g p r i n c i p l e f r o m Section 2.2.4 t o derive c o m p a t i b l e s i m u l t a n e ous confidence intervals for t h e H o l m procedure a n d o t h e r Bonferroni-based closed test procedures. I f t h e test statistics were independent, t h e f a m i l y w i s e error r a t e w o u l d b e come F W E R = P ( V > 0) - 1 - F(V = 0) = 1 - (1 a) . m

T h i s gives hypothesis qi = 1 — correlated

t h e m o t i v a t i o n for t h e Sidak (1967) approach, w h i c h rejects a n u l l Hi, i f pi < 1 — (1 — a ) , or, equivalently, i f t h e adjusted p-value (1 — pi) < a . T h e Sidak approach also holds for non-negatively test statistics ( T o n g 1980). N o t e t h a t t h e Sidak approach is more 1 / / m

771

METHODS B A S E D ON SIMES' I N E Q U A L I T Y

powerful t h a n the Bonferroni test, although the gain i n power is marginal in practice. A further modification of the Bonferroni method, which has attracted substantial research interest, is based on the Simes inequality described i n S e c tion 2.4. Further improvements of the Bonferroni method, which incorporate stochastic dependencies i n the data, are parametric approaches described i n Chapter 3 and resampling-based approaches reviewed i n Section 5.1. I t follows from the inequality (2.3) that the Bonferroni test controls the f a m ilywise error rate strongly at level a or less, that is, F W E R < moa/m < a. I f the number mo of true null hypotheses was known, more powerful procedures could be obtained by applying the Bonferroni test at level moa/m instead of a . Several methods are available to estimate mo and five of them were c o m pared i n Hsueh, C h e n , and K o d e l l (2003). T h e authors concluded that the approach proposed by Benjamini and Hochberg (2000) gives satisfactory e m pirical results. T h e latter considered the slopes of the lines passing the points ( m + 1,1) and ( i , P ( i ) ) and took the lowest slope estimator to approximate mo- More recently, F i n n e r and Gontscharuk (2009) derived conditions which ensure familywise error rate control of Bonferroni-based test procedures using plug-in estimates similar to those proposed by Schweder and Spj0tvoll (1982) and Storey (2002). Westfall, Kropf, and Finos (2004) described weighted methods that are u s e ful when some hypotheses Hi are deemed more important t h a n others. For example, i n clinical trials the various patient outcomes might be ranked a priori, and the test procedure designed to give more power to the more i m portant hypotheses. T h e simplest weighted multiple comparison procedure is the weighted Bonferroni test, discussed i n , for example, Rosenthal and R u bin (1983). T h e weighted Bonferroni test divides the overall significance level a into portions w\a,..., w <*> such that £ mentary hypothesis Hi is rejected if pi < w\a. Holm (1979a) introduced the following weighted test procedure, which extends the weighted Bonferroni test described above. Order the weighted p-values pi = Pi/u)i as < ... < P( ), where p^) = pi (that is, ij denotes the index of the j - t h ordered weighted p-value). Define the sets Sj = {ij,..., i } > j = 1 , . . . , m . L e t H^ denote the hypothesis corresponding to p(j). T h e weighted Holm procedure rejects Hfy, if P(i) < a/ YlheSi h> f ° all i = 1 , . . . ,j. T h i s method controls the familywise error rate strongly at level a , and when the weights are equal, it reduces to the ordinary Holm procedure. Closed test procedures based on weighted B o n f e r roni tests have recently attracted much attention for the analysis of multiple outcome variables; see Dmitrienko et al. (2003), Hommel et al. (2007), and Bretz et al. (2009b). ?

}

2.4

Methods based on Simes

inequality

Simes (1986) proposed the following modification of the Bonferroni method to test the global intersection hypothesis H = OieM - ^ again p(i) < • • •< P(m) denote the ordered unadjusted p-values w i t h associated hypotheses et

GENERAL

CONCEPTS

- # " ( ! ) , . . . , i / ( ) . Using the Simes test, one rejects H i f there is an index j e M = { l , . . . , m } , such t h a t p^-) ^ joi/m. However, one cannot assess t h e elementary hypotheses H i w i t h t h e Simes test. I n p a r t i c u l a r , one cannot reject .fffi) i f p(j) < ia/m for some i € A f , because the familywise error rate is n o t controlled i n t h i s case ( H o m m e l 1988). B y c o n s t r u c t i o n , t h e Simes test is more powerful t h a n t h e B o n f e r r o n i test i n t h e sense t h a t whenever H is rejected by the l a t t e r i t w i l l also be rejected by t h e former, b u t n o t vice versa. Figure 2.7 compares t h e rejection regions of t h e B o n f e r r o n i and the Simes tests for m — 2. Recall t h a t w h e n m — 2 t h e B o n f e r r o n i test rejects the global intersection hypothesis H i f either p i < a / 2 or p2 < a / 2 . T h e Simes test rejects H i f either < a/2 or p^ ) < - As seen f r o m F i g u r e 2.7, the Simes test "adds" the square [ a / 2 , a] x [ a / 2 , a] t o the rejection region of the B o n f e r r o n i test and is therefore more powerful. m

I 2 F i g u r e 2.7

Rejection regions for the Bonferroni test (horizontal lines) and the Simes test (horizontal and vertical lines).

To i l l u s t r a t e t h e i m p a c t on the (disjunctive) power, F i g u r e 2.8 displays 100

METHODS B A S E D ON SIMES' I N E Q U A L I T Y

simulated pairs of independent p-values ( p i , P 2 ) under a given alternative. S e t ting a = 0.1, the Bonferroni test rejects i n those 88 cases, where in Figure 2.8 the dots lie in the lower left " L - s h a p e d " region { p i < a / 2 } U { p 2 < a / 2 } . T h e Simes test rejects i n one additional case, where ( p i , P 2 ) = (0.0807,0.0602), confirming its slight power advantage. I n the remaining 11 cases neither the Bonferroni nor the Simes test rejects.

CO O

o Q.

CM O

o o

o gÂŠ

o o

0.0

0.2

0.4

0.6

0.8

1.0

Pi F i g u r e 2.8

Illustration of the power difference for the Bonferroni and Simes tests by p l o t t i n g 100 pairs of independent p-values (pi,P2).

Note that the power advantage of the Simes test comes at the cost of a d ditional assumptions on the underlying dependency structure. Simes (1986) proved that for his test F W E R = a under independence of p i , . . . , p - I f the independence assumption is not met, the T y p e I error rate control is not a l ways clear. I n many practically relevant cases the inflation in size is marginal (Samuel-Cahn 1996), although pathological counter-examples exist, where the impact can be substantial (Hommel 1983). For one-sided tests, it can be shown m

GENERAL

CONCEPTS

t h a t t h e Simes test controls t h e f a m i l y w i s e error r a t e i f , for e x a m p l e , t h e test statistics are m u l t i v a r i a t e n o r m a l l y distributed w i t h non-negative correlations or m u l t i v a r i a t e t d i s t r i b u t e d w i t h t h e associated n o r m a l s h a v i n g n o n - n e g a t i v e c o r r e l a t i o n s , as l o n g as a < 1/2. W e r e f e r t o S a r k a r a n d C h a n g (1997) a n d S a r k a r (1998) f o r f u r t h e r r e s u l t s . H o c h b e r g (1988) p r o p o s e d a s t e p - u p e x t e n s i o n o f t h e S i m e s t e s t , w h i c h a l l o w s o n e t o m a k e i n f e r e n c e s a b o u t t h e e l e m e n t a r y n u l l h y p o t h e s e s Hi,i G M. T h e H o c h b e r g p r o c e d u r e c a n b e seen as a r e v e r s e d H o l m p r o c e d u r e , s i n c e i t uses t h e s a m e c r i t i c a l v a l u e s , b u t i n a r e v e r s e d t e s t sequence: is r e j e c t e d i f t h e r e i s a j = i , . . . , m , s u c h t h a t p^ < a/(m — j + 1). A d j u s t e d p - v a l u e s for t h e H o c h b e r g procedure are given b y

(i)

= min{l,min[(m - i +

l)p ,q ]}. (i)

(i+1)

A l t e r n a t i v e l y , t h e H o c h b e r g procedure can be described b y t h e f o l l o w i n g s e q u e n t i a l l y rejective test procedure. S t a r t testing t h e n u l l hypothesis # ( ) a s s o c i a t e d w i t h t h e l a r g e s t p - v a l u e P ( ) . I f P(m) < ot, t h e p r o c e d u r e s t o p s a n d a l l h y p o t h e s i s H(i),..., ffcpt) rejected. O t h e r w i s e , i ? ( ) is r e t a i n e d a n d t h e procedure continues testing # ( _ i ) a t t h e smaller significance level a / 2 . I f P ( - i ) < a / 2 , t h e procedure stops a n d a l l hypothesis -ff(i), • • - , - £ f ( _ i ) are rejected. T h e s e steps are r e p e a t e d u n t i l e i t h e r t h e first r e j e c t i o n occurs o r a l l n u l l hypotheses . . . jH/ ) retained. B y construction, the Hochberg m

p r o c e d u r e is m o r e p o w e r f u l t h a n t h e H o l m p r o c e d u r e . Example 2.4Consider t h e following numerical example w i t h m = 4 null h y p o t h e s e s t e s t e d a t t h e s i g n i f i c a n c e l e v e l a = 0.05. L e t p i = 0.022, p2 = 0.02, P3 = 0.01, a n d p4 = 0.09 b e t h e u n a d j u s t e d p - v a l u e s . C o n s i d e r f i r s t t h e H o l m p r o c e d u r e . S i n c e p ( i ) = p3 = 0.01 < 0.0125 = a / 4 , = Hs is r e j e c t e d . A t t h e s e c o n d s t e p , p ( ) = p2 = 0.02 > 0.0167 = a / 3 a n d t h e H o l m procedure stops w i t h no f u r t h e r rejections. T h e H o c h b e r g procedure, however, s t a r t s w i t h t h e largest p-value, steps t h r o u g h t h e o r d e r e d p-values, a n d stops w i t h t h e f i r s t s i g n i f i c a n t r e s u l t . B e c a u s e p ( ) = p = 0.09 > 0.05 = a b u t P(3) — Pi 0.022 < 0.025 = a / 2 , t h e H o c h b e r g p r o c e d u r e r e j e c t s H\,H^ a n d Hz b u t n o t H( ) — H4. T a b l e 2.2 s u m m a r i z e s t h i s e x a m p l e a n d g i v e s t h e associated adjusted p-values. • 2

H o m m e l (1988) i n t r o d u c e d a n i m p r o v e d p r o c e d u r e b y a p p l y i n g t h e S i m e s t e s t t o e a c h i n t e r s e c t i o n h y p o t h e s i s o f a closed t e s t p r o c e d u r e . T h i s p r o c e d u r e can be shown t o be u n i f o r m l y more powerful t h a n t h e Hochberg procedure ( H o m m e l 1989). A s a n e x a m p l e c o n s i d e r t h e case m = 3 a n d a s s u m e t h a t Pi < P2 < P3- T h e H o c h b e r g p r o c e d u r e t h e n r e j e c t s Hi i f a n y o f t h e e v e n t s

{P3 < ct}

or or

{ p 2 < a / 2 a n d pi < a / 2 } {pi < a/3}

METHODS B A S E D ON SIMES' I N E Q U A L I T Y Decision

Threshold

Adjusted p-value

P(<)

o / ( 4 - i + 1)

Holm

Hochberg

Holm

Hochberg

1 2 3 4

0.01 0.02 0.022 0.09

0.0125 0.0167 0.025 0.05

Rej. N.R. N.R. N.R.

Rej. Rej. Rej. N.R.

0.04 0.06 0.06 0.09

0.04 0.044 0.044 0.09

T a b l e 2.2

Comparison of the Holm and Hochberg procedures for m = 4 hypotheses and a = 0.05. Rej. = rejection, N.R. = no rejection.

is true. I n contrast, the procedure by Hommel rejects Hi i f any of the events or or

{P3 < ot} { p 2 < 2 a / 3 and p i < a / 2 } {pi < a / 3 }

is true. T h u s , if a / 2 < P2 < 2 a / 3 and p i < a / 2 , the Hommel procedure rejects Hi, but the Hochberg procedure does not. B o t h the Hochberg a n d the Hommel procedure are available i n R with the p . a d j u s t function introduced i n Section 2.3.1. For example, calling R> p < - c ( 0 . 0 1 , 0 . 0 2 , 0 . 0 2 2 , 0 . 0 9 ) R> p . a d j u s t ( p , "hochberg") [1]

0.040

0.044

0.090

gives the adjusted p-values from Table 2.2. Similarly, the Hommel procedure is available with R> p . a d j u s t ( p , [1]

0.030

0.040

"hommel") 0.044

0.090

Note that i n this example the Hommel procedure leads to smaller adjusted p-values for JHI a n d H than the Hochberg procedure, reflecting the previous comment about its power advantage. 2

CHAPTER 3

Multiple Comparison Procedures in Parametric Models I n this chapter we introduce a general framework for multiple hypotheses testing i n parametric and semi-parametric models. T h i s chapter provides the theoretical basis for the applications analyzed in Chapter 4. I n Section 3.1 we review briefly the standard linear model theory and show how to perform m u l tiple comparisons in this framework, including analysis-of-variance ( A N O V A ) , analysis-of-covariance ( A N C O V A ) and regression models as special cases. We extend the basic approaches from Chapter 2 by using inherent distributional assumptions, particularly by accounting for the structural correlation between the test statistics, thus achieving larger power. I n addition, we revisit the l i n ear regression example from Chapter 1 to illustrate the resulting methods. I n Section 3.2 we extend the previous linear model framework and introduce m u l tiple comparison procedures for general parametric models relying on standard asymptotic normality results. T h e methods apply, for example, to generalized linear models, linear and non-linear mixed-effects models as well as survival data. Again, the use of the inherent stochastic dependencies leads to p o w erful methods. T h e m u l t c o m p package in R provides a convenient interface to perform multiple comparisons for the parametric models considered in S e c tions 3.1 and 3.2. A n in-depth introduction to the m u l t c o m p package is given in Section 3.3. Detailed examples to illustrate and extend the results from this chapter are left for Chapter 4. 3.1 G e n e r a l l i n e a r m o d e l s I n Section 3.1.1 we introduce the canonical framework for multiple hypotheses testing in general linear models. We refer to several standard results from the theory of linear models which will be used subsequently; see Searle (1971) for a detailed mathematical treatise of this subject. I n Section 3.1.2 we revisit the linear regression example from Chapter 1 to illustrate some of the results using R. 3.1.1

Multiple

comparisons

in linear

models

Multiple comparisons in linear models have been considered previously in the literature; see, for example, the standard textbooks from Hochberg and Tamhane (1987) and H s u (1996). Here, we follow the description from Bretz, 41

MULTIPLE COMPARISONS IN P A R A M E T R I C

MODELS

H o t h o r n , a n d Westfall (2008a) a n d consider the c o m m o n general linear m o d e l y = X/3 + e,

(3.1)

where y = ( j / x , . . . , y ) denotes t h e n x l vector of observations, X = {xij)u denotes the fixed a n d k n o w n n x p design m a t r i x , a n d f3 = (/?i,... denotes t h e fixed a n d u n k n o w n parameter vector. T h e r a n d o m , unobservable n x 1 error vector e is assumed t o follow a n n-dimensional n o r m a l d i s t r i b u t i o n w i t h mean vector 0 = ( 0 , . . . , 0 ) a n d covariance m a t r i x < r I , e ~ iV* (0, o* In) for short, where N denotes an n-dimensional n o r m a l d i s t r i b u t i o n , a t h e c o m m o n variance, a n d I the i d e n t i t y m a t r i x of d i m e n s i o n n. M o d e l (3.1) implies t h a t each i n d i v i d u a l observation yi follows t h e linear model n

y% = Pixn

+ • • - + PpXip + e,

where e 7V(0, c r ) . Extensions of t h i s linear m o d e l w i l l be considered i n Section 3.2 a n d a variety of applications w i l l be discussed i n C h a p t e r 4. Assume t h a t we w a n t t o p e r f o r m pre-specified comparisons a m o n g t h e p a rameters Pi,.. . ,{3 . To accomplish this, define a p x l vector c = ( c i , . . . , Cp) of k n o w n constants. I f t h e vector c is chosen such t h a t c l = 0, where 1 = ( 1 , . . . , l ) T j . i t is denoted as contrast vector. T h e vector c t h u s reflects a single e x p e r i m e n t a l comparison of interest t h r o u g h the linear c o m b i n a t i o n c /3 w i t h associated n u l l hypothesis 2

H: c (3 T

(3.2)

= a

for a fixed a n d k n o w n constant a. I n the following, we refer t o c /3 as t h e (linear) f u n c t i o n of interest; a l l such functions are assumed t o be estimable as defined below. I f we have m u l t i p l e experimental questions, m say, we o b t a i n m vectors c i , . . . , c , w h i c h can be summarized by the m a t r i x C = ( c i , . . . , c ) , r e s u l t i n g i n t h e m elementary n u l l hypotheses T

Hj : c](3

= aj,

j = 1 , . . . , m.

Example 3.1. Recall the linear regression example f r o m C h a p t e r 1. T h e r e we considered t h e m u l t i p l e test p r o b l e m of whether intercept or slope f r o m a linear regression model differ significantly f r o m zero. Based o n t h e t h u e s e n d a t a , we w a n t t o predict the v e n t r i c u l a r shortening velocity y f r o m b l o o d glucose x using the m o d e l Vi

fi2Xi

~\-£i

for t h e i - t h p a t i e n t , where 0i denotes the intercept a n d /?2 denotes t h e slope. We can thus assume t h a t t h e n = 23 i n d i v i d u a l measurements follow t h e linear

GENERAL LINEAR

MODELS

m o d e l (3.1), where (

1.34 1.27

15.3 10.8 8.1

/ 1 1 1

and

1 1

1.03 1.12 V 1-70 J

(3 =

4.9 8.8 9.5 J

Pi fa

For t h e t h u e s e n d a t a example we are interested i n t e s t i n g t h e m = 2 e l e m e n t a r y n u l l hypotheses 0! = 0

H: x

and

(3 = 0,

H: 2

where

i n t h e n o t a t i o n f r o m E q u a t i o n (3.2). S t a n d a r d linear m o d e l t h e o r y provides t h e usual least squares estimates 0 = (X X)-X y T

•

(3.3)

and a

(y - X / 9 ) (y -

X£)

(3-4)

where v — n — r a n k ( X ) a n d ( XX ) denotes some generalized inverse of X X . U n d e r t h e m o d e l assumptions (3.1), a is a n unbiased e s t i m a t e of a . I f r a n k ( X ) = p, t h e estimate (3 = ( X X ) X y is also unbiased. W e can test t h e hypotheses Hj using t h e q u a n t i t i e s T

(3.5)

one for each e x p e r i m e n t a l question defined t h r o u g h Cj. B y c o n s t r u c t i o n , each test s t a t i s t i c tj,j = 1 , . . . , m , follows under t h e n u l l hypothesis (3.2) a c e n t r a l u n i v a r i a t e t d i s t r i b u t i o n w i t h v degrees o f freedom. W h e n t h e n u l l hypothesis Hj: Cj (3 = a is n o t t r u e , tj follows a n o n - c e n t r a l u n i v a r i a t e t d i s t r i b u t i o n w i t h v degrees o f freedom a n d n o n c e n t r a l i t y p a r a m e t e r _ T

cJ/3 - aj

j =

l,...,m.

^ c J ( X T X ) - c /

T h e j o i n t d i s t r i b u t i o n o f t\\ and c o r r e l a t i o n m a t r i x

, tm is m u l t i v a r i a t e t w i t h v degrees o f freedom

R =

DC (X X)-CD, T

where D = d i a g ( c J ( X X ) ~ C j ) / . I f a is k n o w n or i n t h e a s y m p t o t i c case v — • oo, t h e ( l i m i t i n g )m u l t i v a r i a t e n o r m a l d i s t r i b u t i o n holds. T

- 1

MULTIPLE COMPARISONS I N P A R A M E T R I C

MODELS

L e t uidenote the c r i t i c a l value derived f r o m t h e m u l t i v a r i a t e n o r m a l or t d i s t r i b u t i o n for a pre-specified significance level a. We t h e n reject t h e n u l l hypothesis Hj i f \tj\ > v>i- . A l t e r n a t i v e l y , i f qj denotes the adjusted p-value for Hj c o m p u t e d f r o m either t h e m u l t i v a r i a t e n o r m a l or t d i s t r i b u t i o n , we reject Hj i f qj < a (recall Section 2.1.2 for a generic d e f i n i t i o n o f adjusted p-values). Confidence intervals for cJ/3 — ay, j = 1 , . . . , ra, w i t h simultaneous coverage p r o b a b i l i t y 1 — a are given t h r o u g h a

jcJ/3 - dj - ui-o.ayjcj(X. X.)~Cj]

cjfi

— a,j 4- t f c i - c * a ^ / c T ( X X ) - c / T

We o b t a i n s i m i l a r results for one-sided test problems b y r e f o r m u l a t i n g t h e n u l l hypotheses, t a k i n g t h e test statistics tj instead o f t h e i r absolute values a n d c o m p u t i n g t h e associated one-sided c r i t i c a l values and/or adjusted p-values. Genz a n d B r e t z (1999, 2002, 2009) described n u m e r i c a l i n t e g r a t i o n m e t h o d s t o calculate m u l t i v a r i a t e n o r m a l a n d t p r o b a b i l i t i e s . For a general overview of these d i s t r i b u t i o n s we refer the reader t o t h e books o f T o n g (1990); K o t z , B a l a k r i s h n a n , a n d Johnson (2000) a n d K o t z a n d N a d a r a j a h (2004). T h e methods discussed i n t h i s section lead t o powerful m u l t i p l e c o m p a r i s o n procedures. These procedures extend t h e classical B o n f e r r o n i m e t h o d f r o m Section 2.3.1 b y considering t h e j o i n t m u l t i v a r i a t e n o r m a l or t d i s t r i b u t i o n of the i n d i v i d u a l test statistics (3.5). However, t h e m e t h o d s presented here belong t o t h e class of single-step procedures. A s explained i n Section 2.2.3, more powerful closed test procedures based o n m a x - t tests o f t h e f o r m (2.1) can be constructed, w h i c h u t i l i z e t h e c o r r e l a t i o n s t r u c t u r e i n v o l v e d i n t h e j o i n t d i s t r i b u t i o n of t h e test statistics (either m u l t i v a r i a t e t or m u l t i v a r i a t e n o r m a l ) . I t is t h e c o m b i n a t i o n of t h e methods f r o m t h i s section w i t h t h e closure p r i n c i p l e described earlier, t h a t results i n powerful stepwise procedures a n d forms t h e core m e t h o d o l o g y for t h e a p p l i c a t i o n analyses i n C h a p t e r 4. Example 3.2. R e v i s i t i n g t h e t h u e s e n d a t a from E x a m p l e 3 . 1 , we have (3 = ( 1 . 0 9 8 , 0 . 0 2 2 ) a n d a = 0.217. I n Section 3.1.2 we w i l l e x t r a c t t h i s i n f o r m a t i o n from t h e f i t t e d linear m o d e l i n R. These numbers can also be c o m p u t e d using E q u a t i o n s (3.3) a n d (3.4). P l u g g i n g t h e estimates f3 a n d a i n t o E q u a t i o n (3.5) results i n t h e test statistics t\ = 9.345 a n d t = 2.101 w i t h v = 21 degrees of freedom. Based o n t h e c o r r e l a t i o n of — 0 . 9 2 3 between t h e t w o test statistics, we can t h e n c o m p u t e t h e r e q u i r e d bivariate t p r o b a b i l i t i e s for t h e m u l t i p l i c i t y adjusted p-values, as shown i n Section 3.1.2. • We conclude t h i s section b y reviewing t h e e s t i m a b i l i t y c o n d i t i o n s for linear functions i n general linear models. We again refer t o Searle (1971) for f u r t h e r details. A f u n c t i o n c ( 3 is called estimable i f there exists a n x l vector z such t h a t E ( z y ) = E ( X ) I L i iYi) — cj3. I f such a vector z cannot be f o u n d , t h e f u n c t i o n c /3 is n o t estimable. A necessary a n d sufficient c o n d i t i o n for t h e e s t i m a b i l i t y o f c ( 3 is given b y T

( X

X ) " X

X =

c . T

GENERAL LINEAR MODELS F u r t h e r , i f c / 3 is estimable, t h e n E ( z y ) T

= c ( 3 , where

= c

(X X)"X .

Moreover, i f c ( 3 is estimable, t h e n c (3 estimator o f c ( 3 a n d i t s variance T

= z y T

is t h e best linear unbiased

V ( z y ) = (7 c 2

(X X)"c

does n o t depend o n t h e p a r t i c u l a r choice of the generalized inverse. T h u s , once t h e e s t i m a b i l i t y of c ( 3 is confirmed, we o b t a i n a g o o d a n d u n i q u e estimate o f t h e f u n c t i o n c ( 3 o f interest t h r o u g h c ( X X ) X y . N o t e t h a t we have used t h i s result already i n E q u a t i o n s (3.3) a n d (3.5). T

3.1.2

The linear

regression

example

revisited

using

A s a d i r e c t consequence of t h e linear m o d e l t h e o r y reviewed i n Section 3.1.1, t h e i m p l e m e n t a t i o n o f any advanced m u l t i p l e c o m p a r i s o n procedure requires careful consideration o f several i n d i v i d u a l steps. F l e x i b l e interfaces, such as t h e m u l t c o m p package i n R, f a c i l i t a t e t h e use of such advanced m e t h o d s . I n t h i s section we r e v i s i t t h e linear regression example based o n t h e t h u e s e n d a t a f r o m C h a p t e r 1 t o i l l u s t r a t e t h e key steps w h e n u s i n g t h e m u l t c o m p package. A d e t a i l e d i n t r o d u c t i o n t o t h e package is given i n Section 3.3. I f t h e m u l t c o m p package was n o t available, we w o u l d need t o m a k e t h e necessary calculations step b y step based o n t h e m e t h o d s f r o m Section 3.1.1. T h e necessary estimates o f t h e regression coefficients (3 a n d t h e i r covariance m a t r i x can be e x t r a c t e d f r o m t h e previously fitted m o d e l (see C h a p t e r 1 for t h e associated l m f i t ) b y c a l l i n g R> b e t a h a t < - c o e f ( t h u e s e n . l m ) R> V b e t a h a t < - v c o v ( t h u e s e n . l m ) G i v e n these q u a n t i t i e s we can c o m p u t e t h e vector t c o n t a i n i n g t h e t w o i n d i v i d u a l ÂŁ,test s t a t i s t i c s a n d i t s associated c o r r e l a t i o n m a t r i x as (see Section 3.1.1 for t h e t h e o r e t i c a l b a c k g r o u n d ) : R> R> R> R>

C <- d i a g ( 2 ) Sigma < - d i a g ( l / s q r t ( d i a g ( C %*% V b e t a h a t %*% t ( C ) ) ) ) t <- Sigma %*% C '/,*% b e t a h a t Cor < - Sigma %*7, (C %*% V b e t a h a t 7,*% t ( C ) ) %*% t ( S i g m a )

N o t e t h a t t = ( 9 . 3 4 5 , 2 . 1 0 1 ) w i t h associated c o r r e l a t i o n m a t r i x T

[1,] [2,]

[rU 1.000 -0.923

1,2] -0.923 1.000

A d j u s t e d p-values are finally c o m p u t e d f r o m t h e u n d e r l y i n g b i v a r i a t e t d i s t r i b u t i o n u s i n g t h e pmvt f u n c t i o n o f t h e m v t n o r m package ( H o t h o r n , B r e t z , a n d Genz 2001; Genz a n d B r e t z 2009):

M U L T I P L E COMPARISONS I N P A R A M E T R I C MODELS

R> l i b r a r y ( " m v t n o r m " ) R> t h u e s e n . d f <- n r o w ( t h u e s e n ) - l e n g t h ( b e t a h a t ) R> q <- s a p p l y ( a b s ( t ) , f u n c t i o n ( x ) + 1 - p m v t ( - r e p ( x , 2 ) , r e p ( x , 2 ) , c o r r = Cor, + df = thuesen.df)) W e o b t a i n t h e m u l t i p l i c i t y adjusted p-values qi < 0.001

and

q = 0.064,

(3.6)

i n d i c a t i n g t h a t t h e intercept is significantly different f r o m 0 b u t t h e slope is not. A l t e r n a t i v e l y , we can c o m p u t e a c r i t i c a l value U I - Q , derived f r o m t h e b i variate t d i s t r i b u t i o n a n d compare t h e test statistics t = (t\ t ) against i t . Using the function y 2

R> d e l t a < - r e p ( 0 , 2) R> m y f c t < - f u n c t i o n ( x , c o n f ) •{ + l o w e r < - r e p ( - x , 2) + u p p e r < - r e p ( x , 2) + p m v t ( l o w e r , upper, df = t h u e s e n . d f , c o r r = Cor, + d e l t a , abseps = 0 . 0 0 0 1 ) [ 1 ] - c o n f

+ > we can c o m p u t e t h e c r i t i c a l value U\-

w i t h the u n i r o o t function

R> u < - u n i r o o t ( m y f c t , l o w e r = 1 , u p p e r = 5 , R> r o u n d ( u , 3) [1]

conf = 0 . 9 5 ) $ r o o t

2.23

I n our example we set t h e confidence level as 1 — a — 0.95 a n d o b t a i n ui-^ = 2.229. Because t h e test statistics for t h e two-sided test p r o b l e m are \t\\ = 9.345 > i t i - a a n d \t \ = 2.101 < w i we o b t a i n t h e same test decisions as i n (3.6). I n a d d i t i o n , t h e c r i t i c a l value u i _ = 2.229 can be used t o c o m p u t e simultaneous confidence intervals for t h e parameters 0i a n d ' f3 . N o t e t h a t because t h e parameter estimates are h i g h l y correlated, t h e c r i t i c a l value 2.229 f r o m t h e b i v a r i a t e t d i s t r i b u t i o n is considerably smaller t h a n t h e B o n f e r r o n i c r i t i c a l value t i _ 0-/2,1/ == 2.414 from t h e u n i v a r i a t e t d i s t r i b u t i o n . As seen f r o m t h e t h u e s e n example, p e r f o r m i n g advanced m u l t i p l e c o m parisons can involve a n u m b e r o f i n d i v i d u a l steps. T h e m u l t c o m p package provides a f o r m a l f r a m e w o r k t o replace t h e previous calculations w i t h s t a n dardized f u n c t i o n calls. Details o f t h e m u l t c o m p package are given i n Sect i o n 3.3, b u t for i l l u s t r a t i o n purposes we a p p l y i t now t o t h e t h u e s e n d a t a . T h e f o l l o w i n g lines replicate some of t h e calculations f r o m C h a p t e r 1, where we first used t h e m u l t c o m p package. T h e g l h t f u n c t i o n f r o m m u l t c o m p takes a fitted response m o d e l a n d a m a t r i x C defining t h e hypotheses o f interest t o p e r f o r m t h e m u l t i p l e c o m p a r isons: 2

a >

GENERAL LINEAR MODELS

R> l i b r a r y ( " m u l t c o m p " ) R> t h u e s e n . m c < - g l h t ( t h u e s e n . l m , R> s u m m a r y ( t h u e s e n . m c ) Simultaneous

Tests

for

Fit: lm(formula = short.velocity data = thuesen) Linear

General

Linear

Hypotheses

blood.glucose,

Hypotheses:

(Intercept) blood.glucose Signif. (Adjusted

l i n f c t = C)

0 == 0

codes: 0 p values

Estimate 1.0978 0.0220

Std.

Error 0.1175 0.0105

'***' 0.001 '**' 0.01 reported â&#x20AC;&#x201D; single-step

value 9.34 2.10 '*'

Pr(>ftl) le-08 0.064

0.05 '.' method)

0.1

*** . '

For each parameter i = 1,2, m u l t c o m p reports i t s estimate a n d s t a n d a r d error. T a k i n g t h e r a t i o of these t w o values for each parameter results i n t h e reported test statistics. A d j u s t e d p-values are given i n t h e last c o l u m n of t h e s t a n d a r d o u t p u t f r o m m u l t c o m p . A s expected, t h e y are t h e same as c a l c u l a t e d i n (3.6). I n a d d i t i o n , simultaneous confidence intervals can be calculated for each parameter using t h e t h e c o n f i n t m e t h o d : R> c o n f i n t ( t h u e s e n . m c ) Simultaneous

* Confidence

Fit: lm(formula = short.velocity data = thuesen) Quantile =2.23 95% family-wise

Linear (Intercept) blood.glucose

Intervals

~ blood.glucose,

\ a \

confidence

level

Hypotheses: ==

0 == 0

Estimate 1.09781 0.02196

lwr ' 0.83591 -0.00134

upr 1.35972 0.04527

T h e two-sided confidence i n t e r v a l for the slope includes t h e 0, t h u s reflecting t h e previous test decision t h a t t h e slope is n o t s t a t i s t i c a l l y significant different f r o m 0. We f u r t h e r conclude t h a t t h e intercept lies r o u g h l y between 0.84 a n d 1.36. So far we have o n l y i l l u s t r a t e d single-step procedures, w h i c h account for t h e c o r r e l a t i o n a m o n g t h e test statistics. As we k n o w f r o m C h a p t e r 2, associated closed test procedures are available a n d u n i f o r m l y more powerful. I n m u l t c o m p we can call, for example,

M U L T I P L E COMPARISONS I N P A R A M E T R I C

R> s u m m a r y ( t h u e s e n . m c , t e s t = a d j u s t e d ( t y p e Simultaneous

T e s t s f o r General

F i t : lm (formula = short.velocity data = thuesen) Linear

MODELS

"Westfall"))

Linear

Hypotheses

blood.glucose,

Hypotheses:

(Intercept) blood.glucose Signif. (Adjusted

== 0 == 0

Estimate 1.0978 0.0220

Std. E r r o r 0.1175 0.0105

codes: 0 '*+*' 0.001 '**' 0.01 p values reported â&#x20AC;&#x201D; West f a l l

t value 9.34 2.10

Pr(>ltj) le-08 *** 0.048 *

'*' 0.05 method)

'.' 0.1

' '1

t o e x t r a c t related adjusted p-values. N o t e t h a t the p-value associated w i t h t h e slope parameter is now p = 0.048 < 0.05 a n d one can thus safely c l a i m t h a t the slope is significant a t level a = 0.05 after having adjusted for m u l t i p l i c i t y . Details o n t h e i m p l e m e n t e d stepwise procedures i n the m u l t c o m p package are given i n Section 3.3 a n d Chapter 4. 3.2 E x t e n s i o n s t o g e n e r a l p a r a m e t r i c m o d e l s I n Section 3.1 we i n t r o d u c e d m u l t i p l e comparison procedures, w h i c h are well established for c o m m o n regression a n d A N O V A models a l l o w i n g for covariates and/or factorial t r e a t m e n t structures w i t h independent a n d i d e n t i c a l l y d i s t r i b u t e d n o r m a l errors a n d constant variance. I n this section we extend these results a n d provide a unified description of m u l t i p l e comparison procedures i n p a r a m e t r i c models w i t h generally correlated parameter estimates. We relax the s t a n d a r d A N O V A assumptions, such as n o r m a l i t y a n d homoscedasticity, thus a l l o w i n g for simultaneous inference i n generalized linear models, m i x e d effects models, s u r v i v a l models, etc. As before, we assume t h a t each i n d i v i d u a l n u l l hypothesis is specified t h r o u g h a linear c o m b i n a t i o n o f p elemental model parameters a n d we simultaneously test ra n u l l hypotheses. I n Section 3.2.1 we i n t r o d u c e t h e necessary a s y m p t o t i c results for t h e linear functions of interest under r a t h e r weak conditions. I n Section 3.2.2 we d e scribe the framework for m u l t i p l e comparison procedures i n general p a r a m e t r i c models. W e give i m p o r t a n t applications of the methodology i n Section 3.2.3. N u m e r i c a l examples t o i l l u s t r a t e the methods using t h e m u l t c o m p package are left for C h a p t e r 4. M u c h of t h e following m a t e r i a l follows t h e o u t l i n e o f H o t h o r n , B r e t z , a n d Westfall (2008). 3.2.1

Asymptotic

results

We extend t h e n o t a t i o n f r o m Section 3.1 t o cope w i t h t h e generality required below. L e t M ( { z i , . . . , z } , 0, rj) denote a p a r a m e t r i c or semi-parametric s t a tistical model, where { z i , . . . , z } denotes t h e set of n observation vectors, 0 denotes t h e p x 1 fixed parameter vector a n d t h e vector r) contains other n

EXTENSIONS TO GENERAL PARAMETRIC

MODELS

( r a n d o m or nuisance) parameters. We are p r i m a r i l y interested i n the linear functions i9 = C 0 of t h e parameter vector 0 as specified t h r o u g h the p x m m a t r i x C w i t h fixed a n d k n o w n constants. Assume t h a t we are given an e s t i m a t e 0 of t h e parameter vector 0. I n w h a t follows we describe t h e u n d e r l y i n g m o d e l assumptions, t h e l i m i t i n g d i s t r i b u t i o n for t h e estimates o f o u r p a r a m eters of interest, # = C 0 , as well as t h e c o r r e s p o n d i n g test statistics for hypotheses i n v o l v i n g & a n d t h e i r l i m i t i n g j o i n t d i s t r i b u t i o n . Consider, for example, a s t a n d a r d regression m o d e l t o i l l u s t r a t e the new n o t a t i o n a n d how i t relates t o the one used i n Section 3 . 1 . Here, t h e o b s e r v a t i o n s Zi of subject i — 1 , . . . , n consist of a response variable yi a n d a vector of covariates x^ = ( x i i , • • •, # i p ) , such t h a t z$ = (yi, Xj) a n d xn = 1 for a l l i. T h e response is modeled b y a linear c o m b i n a t i o n of t h e covariates w i t h n o r m a l error Si ~ N(0,a ) a n d constant variance <7 , T

T h e parameter vector is t h e n 0 = (0i,..., 0 ), r e s u l t i n g i n t h e linear functions of interest •& = C 0. C o m i n g back t o t h e general theory, suppose 0 is a n estimate of 0 a n d S E MP is a n estimate of c o v ( 0 ) w i t h P

aS n

i S G

M ' p

(3.7)

for some positive, nondecreasing sequence a . F u r t h e r m o r e , we assume t h a t t h e m u l t i v a r i a t e c e n t r a l l i m i t t h e o r e m holds, t h a t is, n

-^>iV (0,S).

a}/ (d -0) 2

(3.8)

I f b o t h (3.7) a n d (3.8) are fulfilled, we w r i t e 0 ~ N (0,S ). Then, by T h e o r e m 3.3.A i n Serfling (1980), t h e estimate o f o u r p a r a m e t e r o f interest, *& = C 0 , is a p p r o x i m a t e l y m u l t i v a r i a t e n o r m a l l y d i s t r i b u t e d , t h a t is, n

w i t h covariance m a t r i x S* = C S C for any fixed m a t r i x C . T h u s , we do n o t need t o d i s t i n g u i s h between t h e m o d e l parameter 0 a n d t h e derived parameter & = C 0. I n analogy t o (3.7) a n d (3.8) we can t h u s assume t h a t T

tfn ~ J v ( t f , s ; ) m

holds, where aS n

S* = C S C 6 T

R ' , m

a n d t h a t t h e m parameters i n are t h e parameters o f interest. I t is a s sumed t h a t t h e d i a g o n a l elements o f t h e covariance m a t r i x are positive, t h a t is, S^- > 0 for j = l , . . . , m . Consequently, t h e s t a n d a r d i z e d e s t i m a t e ^ is again a s y m p t o t i c a l l y m u l t i v a r i a t e n o r m a l l y d i s t r i b u t e d . F o l l o w i n g H o t h o r n et a l . (2008), n

= ~D- (K 1/2

- 0) ~ W (0, Rn) m

MULTIPLE COMPARISONS IN P A R A M E T R I C

50 where D

MODELS

= d i a g ( S * ) contains the diagonal elements of S* a n d R

= D-^S^D" / 1

€

R ' m

is t h e correlation m a t r i x of t h e m-dimensional statistic t . T h i s leads t o the n

m a i n a s y m p t o t i c result i n t h i s section,

For the purpose of m u l t i p l e comparisons, we need convergence of m u l t i v a r i ate probabilities calculated for the vector t , where t is assumed n o r m a l l y d i s t r i b u t e d and R treated as i f i t was the t r u e correlation m a t r i x . However, the necessary probabilities are continuous functions of R (and t h e associated IP c r i t i c a l value) w h i c h converge by R • R as a consequence of T h e o r e m 1.7 i n Serfling (1980). I n cases where t is assumed t o be m u l t i v a r i a t e t d i s t r i b u t e d w i t h R treated as the estimated correlation m a t r i x , we have similar convergence as the degrees of freedom approach infinity. Since we o n l y assume t h a t the parameter estimates 0 are a s y m p t o t i c a l l y n o r m a l l y d i s t r i b u t e d w i t h a n available consistent estimate of the associated covariance m a t r i x , the framework i n this section covers a wide range of s t a tistical models, i n c l u d i n g linear regression a n d A N O V A models, generalized linear models, linear mixed-effects models, Cox models, robust linear models, etc. S t a n d a r d software packages can be used to fit such models and o b t a i n the estimates 9 a n d S , w h i c h are the only t w o quantities t h a t are needed here. n

7 l

3.2.2

Multiple

comparisons

in general parametric

models

Based o n the a s y m p t o t i c results f r o m Section 3.2.1, we can derive suitable m u l t i p l e comparison procedures for the class of p a r a m e t r i c models i n t r o d u c e d there. We s t a r t considering the global n u l l hypothesis H:

= a

where, as before, & = C 6 a n d a = ( a i , . .. , a ) denotes a vector of fixed a n d k n o w n constants. Under the conditions of H i t follows f r o m Section 3.2.1 that t = D " / ^ - a) £ N - ( 0 , R n ) . r

T h i s a p p r o x i m a t i n g d i s t r i b u t i o n w i l l be used as t h e reference d i s t r i b u t i o n when c o n s t r u c t i n g suitable m u l t i p l e comparison procedures below. N o t e t h a t the a s y m p t o t i c results can be sharpened i f we assume exact n o r m a l i t y 0 ~ N (0, S ) instead of the a s y m p t o t i c n o r m a l i t y assumption (3.8). I f the c o v a r i ance m a t r i x S is k n o w n , i t follows by s t a n d a r d arguments t h a t t ^ ~ N (0, R), where t is normalized using fixed and k n o w n variances. Otherwise, i n the t y p ical s i t u a t i o n of linear models w i t h independent, n o r m a l l y d i s t r i b u t e d errors and £J = c r A , where cr is u n k n o w n b u t A is fixed a n d k n o w n , t h e exact d i s t r i b u t i o n of t is m u l t i v a r i a t e t; see Section 3.1.1. T h e global n u l l hypothesis H can be tested using s t a n d a r d tests, such as F or X tests, see H o t h o r n et a l . (2008) for a n a l y t i c a l expressions. A n alternative n

EXTENSIONS TO GENERAL PARAMETRIC

MODELS

approach is t o consider t h e m a x i m u m o f t h e i n d i v i d u a l components of t h e vector t = (tin, • • - •> tmn)> leading t o the m a x - t test t n

max — m a x \"tn\ (see also Section 2.2.1 for a b r i e f discussion a b o u t m a x - £ tests). T h e d i s t r i b u t i o n of t h i s statistic under t h e c o n d i t i o n s of H can be c o m p u t e d u s i n g t h e m - d i m e n s i o n a l distribution function t t IP(*max < *) = J' • • • J —t

</?m(x; R , u)dx

=: s „ ( R , t)

(3.9)

-t

for some £ E K, where ipm denotes the density f u n c t i o n of either t h e l i m i t i n g m - d i m e n s i o n a l n o r m a l ( w i t h v = oo degrees of freedom a n d t h e operator) or t h e exact m u l t i v a r i a t e t d i s t r i b u t i o n ( w i t h v < oo a n d t h e " = " o p e r a t o r ) . Because R is u s u a l l y u n k n o w n , we p l u g i n t h e consistent estimate R , as discussed i n Section 3.2.1. T h e r e s u l t i n g (exact or a s y m p t o t i c ) p-value for H is t h e n given b y 1 - ^ ( R n , m a x | t | ) , where t = ( £ ? , . . . , t^ ) denotes t h e vector of observed test statistics £ ° - Efficient m e t h o d s t o calculate m u l t i v a r i a t e n o r m a l a n d t integrals are described i n Genz a n d B r e t z (1999, 2002, 2009). N o t e t h a t i n c o n t r a s t t o s t a n d a r d global F or x tests, m a x - £ tests of t h e form £ n u l l hypotheses (consonance property, see Section 2.2.1). T o t h i s end, recall t h a t C = ( c i , . . . , c ) a n d let flj = cJO denote t h e j - t h linear f u n c t i o n of interest, j = 1 , . . . , m. T h e elementary n u l l hypotheses of interest are given b y n

obs

o b s

b s

Hj : cJO

j = 1,..., m,

= a-,-,

for fixed a n d k n o w n constants a,, such t h a t H = H ^ L i Hj(exact or a s y m p t o t i c ) are given b y q j

= 1 - guiHn,

|£f |), s

Adjusted

p-values

j = l,...,m,

and calculated f r o m expression (3.9). B y c o n s t r u c t i o n , we reject a n elementary n u l l hypothesis Hj whenever t h e associated adjusted p-value qj is less t h a n or equal t o t h e pre-specified significance level a , t h a t is, qj < a. A l t e r n a t i v e l y , i f u i - a denotes t h e ( 1 — a ) - q u a n t i l e of t h ed i s t r i b u t i o n ( a s y m p t o t i c , i f necessary) of t , we reject Hj i f \tj \ > ui-. . I n a d d i t i o n , s i m u l t a n e o u s confidence bs

intervals for *&j w i t h coverage p r o b a b i l i t y 1 — a c a n be c o n s t r u c t e d f r o m i) ± i t i _ d i a g ( D ) / . S i m i l a r results also h o l d for one-sided test problems. S i m i l a r t o Section 3.1.1, t h e methods discussed here c a n be used t o c o n s t r u c t more p o w e r f u l closed test procedures, as first discussed b y W e s t f a l l (1997). T h a t is, a p p l y i n g closed test procedures (Section 2.2.3) based o n m a x - t tests of t h e f o r m (2.1) i n c o m b i n a t i o n w i t h t h e results f r o m here gives p o w e r f u l stepwise procedures for a large class o f p a r a m e t r i c models. T h e m u l t c o m p package i m p l e m e n t s these m e t h o d s w h i l e e x p l o i t i n g logical c o n s t r a i n t s , leading t o c o m p u t a t i o n a l l y efficient, yet powerful t r u n c a t e d closed test procedures (Westfall a n d Tobias 2007), as described i n Section 3.3 a n d i l l u s t r a t e d w i t h examples i n C h a p t e r 4. n

52 3.2.3

MULTIPLE COMPARISONS I N P A R A M E T R I C

MODELS

Applications

T h e methodological framework described i n Section 3.2.1 is very general a n d t h u s applicable t o a wide range of s t a t i s t i c a l models. M a n y e s t i m a t i o n t e c h niques, such as (restricted) m a x i m u m l i k e l i h o o d a n d M estimates, provide at least a s y m p t o t i c a l n o r m a l estimates of the o r i g i n a l parameter vector 0 a n d a consistent estimate of the covariance m a t r i x . I n t h i s section we review some p o t e n t i a l applications. D e t a i l e d numerical examples are discussed i n C h a p ter 4. I n Section 3.2.1 we already p r o v i d e d a connection between t h e general p a r a m e t r i c framework a n d s t a n d a r d regression models. S i m i l a r l y , ANOVA models are easily embedded i n t h a t framework as well. W i t h a n a p p r o p r i a t e change of n o t a t i o n , one can verify t h a t the results f r o m Section 3.1 are a special case of t h e general framework considered i n Section 3.2.1. I n generalized linear models, t h e exact d i s t r i b u t i o n o f t h e parameter e s t i m a t e s is usually u n k n o w n a n d t h e a s y m p t o t i c n o r m a l d i s t r i b u t i o n is t h e basis for a l l inference procedures. I f inferences a b o u t m o d e l parameters c o r r e s p o n d i n g t o levels of a c e r t a i n factor are of interest, one can use t h e m u l t i p l e comparison procedures described above. Similarly, i n linear a n d non-linear mixed-effects models f i t t e d b y restricted m a x i m u m l i k e l i h o o d , t h e a s y m p t o t i c n o r m a l d i s t r i b u t i o n using consistent covariance m a t r i x estimates forms t h e basis for inference. A l l assumptions of the general p a r a m e t r i c framework are satisfied again a n d one can set u p s i m u l taneous inference procedures for these models as w e l l . T h e same is t r u e for either t h e Cox model or other p a r a m e t r i c s u r v i v a l models such as t h e Weibull survival model. Yet another a p p l i c a t i o n is t o use robust variants of t h e previously discussed s t a t i s t i c a l models. One p o s s i b i l i t y is t o consider t h e use of sandwich estimators S for t h e covariance m a t r i x c o v ( 0 ) w h e n t h e variance homogeneity a s s u m p t i o n is questionable. A n alternative approach is t o a p p l y r o b u s t e s t i m a t i o n techniques i n linear models (such as 5-, M - or M M - e s t i m a t e s ) , w h i c h again provide a s y m p t o t i c a l l y n o r m a l estimates (Rousseeuw a n d Leroy 2003). n

H e r b e r i c h (2009) investigated t h r o u g h a n extensive s i m u l a t i o n s t u d y t h e o p e r a t i n g characteristics (size a n d power) of t h e m u l t i p l e comparison procedures described i n t h i s section for a variety of s t a t i s t i c a l models (generalized linear models, m i x e d effects models, s u r v i v a l data, etc.). I t transpires t h a t t h e f a m i l y wise error r a t e is generally w e l l m a i n t a i n e d even for m o d e r a t e t o s m a l l sample sizes, a l t h o u g h under c e r t a i n models a n d scenarios t h e r e s u l t i n g tests m a y b e come either conservative (especially for b i n a r y d a t a w i t h s m a l l sample sizes) or l i b e r a l ( s u r v i v a l d a t a , depending o n t h e censoring mechanism). F u r t h e r simulations i n d i c a t e d a good overall performance of t h e proposed p a r a m e t r i c m u l t i p l e comparison procedures compared w i t h e x i s t i n g procedures, such as t h e simultaneous confidence intervals for b i n o m i a l parameters i n t r o d u c e d by A g r e s t i , B i n i , B e r t a c c i n i , a n d R y u (2008).

THE

MULTCOMP

PACKAGE

3.3 T h e m u l t c o m p p a c k a g e I n this section we i n t r o d u c e t h e m u l t c o m p package i n R t o p e r f o r m m u l t i p l e comparisons under t h e p a r a m e t r i c m o d e l f r a m e w o r k described i n Sections 3.1 a n d 3.2. I n Section 3.3.1 we d e t a i l t h e g l h t f u n c t i o n , w h i c h provides t h e core f u n c t i o n a l i t y t o p e r f o r m single-step tests based o n a given m a t r i x C reflecting the e x p e r i m e n t a l questions of interest a n d t h e u n d e r l y i n g m u l t i v a r i a t e n o r m a l or t d i s t r i b u t i o n . I n Section 3.3.2 we describe t h e summary m e t h o d associated w i t h t h e g l h t f u n c t i o n , w h i c h provides detailed o u t p u t i n f o r m a t i o n , i n c l u d i n g t h e results of several p-value a d j u s t m e n t m e t h o d s a n d stepwise test p r o cedures. F i n a l l y , we describe i n Section 3.3.3 t h e c o n f i n t m e t h o d associated w i t h the g l h t f u n c t i o n , w h i c h provides t h e f u n c t i o n a l i t y t o c o m p u t e a n d p l o t simultaneous confidence intervals for some of t h e m u l t i p l e c o m p a r i s o n p r o c e dures. T h e m u l t c o m p package includes a d d i t i o n a l f u n c t i o n a l i t y n o t covered i n this section. I n C h a p t e r 4 we i l l u s t r a t e some of i t s enhanced capabilities. T h e complete d o c u m e n t a t i o n is available w i t h t h e package ( H o t h o r n , B r e t z , Westfall, Heiberger, a n d Schutzenmeister 2010a). A s any o t h e r package used i n t h i s book, t h e m u l t c o m p package can be d o w n l o a d e d f r o m C R A N . 3.3.1

The glht

function

I n t h i s section we consider t h e w a r p b r e a k s d a t a f r o m T u k e y (1977), w h i c h are also available f r o m t h e base R d a t a s e t s package. T h e d a t a give t h e n u m b e r of breaks i n y a r n d u r i n g weaving for t w o types of w o o l (A a n d B) a n d t h r e e levels of tension ( L , M , a n d H). For i l l u s t r a t i o n purposes, here we o n l y consider t h e effect of t h e tension o n t h e n u m b e r of breaks, neglecting t h e p o t e n t i a l effect of t h e w o o l t y p e . F i g u r e 3.1 shows t h e associated b o x p l o t s for t h e d a t a . Assume t h a t we are interested i n assessing w h e t h e r t h e t h r e e levels of t e n s i o n differ f r o m each o t h e r w i t h respect t o t h e n u m b e r of breaks. I n o t h e r words, i f fij denotes t h e m e a n n u m b e r of breaks for tension level j = L , M , H, we are interested i n t e s t i n g t h e three n u l l hypotheses Hiji

fM -fij

= 0

(3.10)

against t h e a l t e r n a t i v e hypotheses Kij

: \Xi - fij ^ 0,

e { L , M , H, } , i ^ j .

I n t h e f o l l o w i n g we use t h i s example t o i l l u s t r a t e t h e m u l t c o m p package. To analyze t h e w a r p b r e a k s d a t a we use t h e w e l l - k n o w n T u k e y test, w h i c h perfoms a l l pairwise comparisons between t h e t h r e e t r e a t m e n t s a n d w h i c h we f o r m a l l y i n t r o d u c e i n Section 4.2. W e use t h e aov f u n c t i o n t o f i t t h e one-factor A N O V A model R> w a r p b r e a k s . a o v < - a o v ( b r e a k s R> s u m m a r y ( w a r p b r e a k s . a o v ) tension

~ tension, data =

Df Sum Sq Mean Sq F value 2 2034 1017 7.21

Pr (>F) 0.0018

warpbreaks)

1 i 54

MULTIPLE COMPARISONS I N P A R A M E T R I C

MODELS

CO o in

CO CD

iâ&#x20AC;&#x201D;

o CO

o CM

i H

M Tension

F i g u r e 3.1

Residuals Signif.

7199

51 codes:

'***'

Boxplots of the warpbreaks data.

141 0.001

'**'

0.01

'*'

0.05

'.'

0.1

Based o n t h e A N O V A F test we conclude t h a t t h e factor t e n s i o n has a n overall significant effect. T h e g l h t f u n c t i o n f r o m t h e m u l t c o m p package provides a convenient a n d general f r a m e w o r k i n R t o test m u l t i p l e hypotheses i n p a r a m e t r i c models, i n c l u d i n g t h e general linear models i n t r o d u c e d i n Section 3.1, linear a n d n o n linear mixed-effects models as w e l l as s u r v i v a l models. Generally speaking, t h e g l h t f u n c t i o n takes a f i t t e d response m o d e l a n d a m a t r i x C defining t h e hypotheses of interest t o p e r f o r m t h e m u l t i p l e comparisons. T h e general s y n t a x for t h e g l h t f u n c t i o n is

MULTCOMP

THE

PACKAGE

glht(model, linfct, alternative = c("two.sided", rhs = 0,

"less",

"greater"),

) I n t h i s call, t h e m o d e l argument is a f i t t e d m o d e l , such as a n object r e t u r n e d b y l m , g l m , or aov. I t is assumed t h a t the p a r a m e t e r estimates a n d t h e i r c o v a r i ance m a t r i x are available for the m o d e l a r g u m e n t . T h a t is, t h e m o d e l a r g u m e n t needs suitable c o e f a n d v c o v methods t o be available. T h e a r g u m e n t l i n f c t specifies t h e m a t r i x C i n t r o d u c e d i n Section 3.1. T h e r e are a l t e r n a t i v e ways of specifying t h e m a t r i x C and we i l l u s t r a t e t h e m below u s i n g t h e w a r p b r e a k s example. T h e a l t e r n a t i v e argument is a character s t r i n g specifying t h e a l t e r native hypothesis, w h i c h m u s t be one of " t w o . s i d e d " ( d e f a u l t ) , " g r e a t e r " or " l e s s " , depending o n whether two-sided or one-sided hypotheses are of interest. T h e r h s a r g u m e n t is a n o p t i o n a l n u m e r i c vector specifying t h e r i g h t h a n d side of t h e hypothesis ( i n E q u a t i o n (3.10) t h e r i g h t h a n d side is 0 for a l l three hypotheses). F i n a l l y , w i t h " . . . " a d d i t i o n a l a r g u m e n t s can be passed o n t o t h e m o d e l p a r m f u n c t i o n i n a l l g l h t methods. We now review t h e different possibilities of specifying t h e m a t r i x C , w h i c h can, b u t does n o t have t o be a contrast m a t r i x ; see Section 3.1.1 for t h e d e f i n i t i o n of a contrast. T h e most convenient way is t o use t h e mcp f u n c t i o n . M u l t i p l e comparisons of means are defined b y objects of t h e mcp class as r e t u r n e d by t h e mcp f u n c t i o n . For each factor contained i n m o d e l as a n independent variable, a (contrast) m a t r i x or a symbolic d e s c r i p t i o n o f t h e comparisons of interest can be specified as an argument t o mcp. A s y m b o l i c d e s c r i p t i o n m a y be a c h a r a c t e r or a n e x p r e s s i o n where t h e factor levels are o n l y used as variables names. I n a d d i t i o n , t h e t y p e a r g u m e n t t o t h e ( c o n t r a s t ) m a t r i x g e n e r a t i n g f u n c t i o n c o n t r M a t m a y serve as a s y m b o l i c d e s c r i p t i o n as w e l l . To i l l u s t r a t e t h e l a t t e r , we invoke t h e T u k e y test consisting o f a l l p a i r wise c o m parisons for t h e factor t e n s i o n b y using a s y m b o l i c d e s c r i p t i o n , t h a t is, using the t y p e argument t o the c o n t r M a t function R> g l h t ( w a r p b r e a k s . a o v , General

Linear

Multiple

Comparisons

Linear

Hypotheses: Estimate 0 -10.00 0 -14.12 0 -4.72

M H H -

L == L == M ==

linfct

= mcp(tension =

"Tukey"))

Hypotheses of

Means:

Tukey

Contrasts

N o t e t h a t t h e c o n t r M a t f u n c t i o n also p e r m i t s pre-specifying o t h e r contrast matrices, such as " D u i m e t t " , " W i l l i a m s " , a n d " C h a n g e p o i n t " ; see Section 4.3.2

MULTIPLE

COMPARISONS I N P A R A M E T R I C MODELS

for f u r t h e r examples a n d the online d o c u m e n t a t i o n for a complete list ( H o t h o r n et a l . 2010a). A n o t h e r way of defining the m a t r i x C is t o use a symbolic description, where either a c h a r a c t e r or an e x p r e s s i o n vector is passed t o g l h t v i a its l i n f c t argument. A symbolic description m u s t be interpretable as a v a l i d R expression consisting o f b o t h t h e left a n d t h e r i g h t h a n d side of t h e expression for t h e n u l l hypotheses. O n l y the names of c o e f ( b e t a ) can be used as variable names. T h e alternative hypotheses are given by the d i r e c t i o n under the n u l l hypothesis (= or == refer t o " t w o . s i d e d " , <= refers t o " g r e a t e r " a n d >= refers t o " l e s s " ) . N u m e r i c vectors of l e n g t h one are v a l i d arguments for t h e r i g h t h a n d side. I f we call R> + + +

glht(warpbreaks.aov, l i n f c t = mcp(tension =

General

Linear

Multiple

Comparisons

Linear

Hypotheses: Estimate 0 -10.00 0 -14.12 0 -4.12

M H H -

L == L == M ==

c("M - L = 0 " , "H - L = 0 " , "H - M = 0")))

Hypotheses Means:

User-defined

Contrasts

we o b t a i n t h e same results as before. A l t e r n a t i v e l y , the m a t r i x C can be defined d i r e c t l y by calling R> c o n t r < - r b i n d ( " M - L " = c ( - l , 1, 0 ) , + "H - L" = c ( - l , 0, 1 ) , + " H - M" = c ( 0 , - 1 , 1 ) ) R> c o n t r t,l] M H H -

L L M

[,2] -1 -1 0

[,3] 0 1 1

1 0 -1

a n d again we o b t a i n t h e same results as before: R> g l h t ( w a r p b r e a k s . a o v , General Multiple

Linear

Comparisons

Hypotheses: Estimate

l i n f c t = mcp(tension =

contr))

Hypotheses Means:

User-defined

Contrasts

MULTCOMP

THE

L == 0 L == 0 M == 0

M H H -

PACKAGE

-10.00 -14.72 -4. 72

F i n a l l y , t h e m a t r i x C can be specified d i r e c t l y v i a the l i n f c t argument. I n this case, t h e n u m b e r of columns o f t h e m a t r i x needs t o m a t c h the number of parameters e s t i m a t e d b y m o d e l . I t is assumed t h a t suitable c o e f and v c o v m e t h o d s are available for m o d e l . I n the w a r p b r e a k s example, we can specify R> g l h t ( w a r p b r e a k s . a o v , + l i n f c t = c b i n d ( 0 , c o n t r %*% c o n t r . t r e a t m e n t ( 3 ) ) ) General Linear M H H -

L == L == M ==

Linear

Hypotheses

Hypotheses: Estimate 0 -10.00 0 -14.72 0 -4.72

a n d again we o b t a i n t h e same results as before. N o t e t h a t i n t h e last statement we used t h e so-called treatment contrasts R> c o n t r . t r e a t m e n t ( 3 ) 2 3 10 0 2 10 3 0 1

w h i c h are used as default i n R t o fit A N O V A a n d regression models. T h e first g r o u p is t r e a t e d as a c o n t r o l g r o u p , t o w h i c h t h e other groups are compared. T o be more specific, t h e analysis is performed as a m u l t i p l e regression analysis by i n t r o d u c i n g t w o d u m m y variables w h i c h are 1 for observations i n t h e relevant g r o u p a n d 0 elsewhere. We now t u r n our a t t e n t i o n t o t h e d e s c r i p t i o n o f t h e o u t p u t f r o m t h e g l h t f u n c t i o n a n d t h e available m e t h o d s for further analyses. U s i n g t h e g l h t f u n c t i o n , a list is r e t u r n e d w i t h t h e elements R> w a r p b r e a k s . m c < - g l h t ( w a r p b r e a k s . a o v , + l i n f c t = mcp(tension = R> name s ( w a r p b r e a k s . m c ) [1] [5] [9]

"model" "vcov" "focus"

"linfct" "df"

"rhs" "alternative"

"Tukey")) "coef" "type"

where p r i n t , summary, a n d c o n f i n t methods are available for f u r t h e r i n f o r m a t i o n h a n d l i n g . I f g l h t is called w i t h l i n f c t as a n mcp object, t h e a d d i t i o n a l element f o c u s is available, w h i c h stores t h e names of t h e factors tested. I n t h e f o l l o w i n g we describe t h e o u t p u t i n more d e t a i l . Some o f t h e associated m e t h o d s w i l l be discussed i n the subsequent sections. T h e first element o f t h e r e t u r n e d object

M U L T I P L E COMPARISONS IN P A R A M E T R I C

MODELS

R> w a r p b r e a k s . m c $ m o d e l Call: aov(formula

= breaks

- tension,

data

warpbreaks)

Terms: ten si on Sum of Squares 2034 Deg. of Freedom 2 Residual standard error: Estimated e f f e c t s may be

Residuals 1199 51 11.9 unbalanced

gives the f i t t e d model, as used i n t h e g l h t call. T h e element R> w a r p b r e a k s . m c $ l i n f c t ( I n t e r c e p t ) tensionM M - L 0 1 H - L 0 0 H â&#x20AC;&#x201D; M 0 -1 a t t r (, "type ") [1] "Tukey"

tensionH 0 1 1

returns t h e m a t r i x C used for the d e f i n i t i o n of the linear functions of interest, as discussed above. Each row of this m a t r i x essentially defines the left h a n d side of the hypotheses defined i n equation (3.2). T h e associated r i g h t h a n d side of equation (3.2) is r e t u r n e d by R> w a r p b r e a k s . m c $ r h s [1]

0 0 0

T h e next t w o elements of the list r e t u r n e d by t h e g l h t f u n c t i o n give the estimates a n d the associated covariance m a t r i x of the parameters specified t h r o u g h model, R>

warpbreaks.mc$coef

(Intercept) 36. 4 R>

tensionM -10. 0

tensionH -14 . 7

warpbreaks.mc$vcov

(Intercept) tensionM tensionH

(Intercept) 1.84 -1.84 -7.84

tensionM -7.84 15.68 7.84

tensionH -1.84 1.84 15. 68

As a n o p t i o n a l element the degrees of freedom R> w a r p b r e a k s . m c $ d f [1]

is r e t u r n e d i f the m u l t i v a r i a t e t d i s t r i b u t i o n is used for simultaneous inference. Because we have 54 observations i n t o t a l for the w a r p b r e a k s d a t a and the

MULTCOMP

THE

PACKAGE

factor t e n s i o n has 3 levels, we o b t a i n the displayed 5 1 degrees of freedom for t h e u n d e r l y i n g one-way layout. N o t e t h a t inference is based on the m u l t i v a r i a t e t d i s t r i b u t i o n whenever a linear model w i t h n o r m a l l y d i s t r i b u t e d errors is applied; the l i m i t i n g m u l t i v a r i a t e n o r m a l d i s t r i b u t i o n is used i n a l l other cases. T h e element R>

warpbreaks.mc$alternative

[1 ]

"two.

sided"

is a character s t r i n g specifying t h e sideness of t h e test p r o b l e m . F i n a l l y , R>

warpbreaks.mc$type

[1]

"Tukey"

o p t i o n a l l y specifies t h e name of t h e applied m u l t i p l e comparison 3.3.2

The summary

procedure.

method

T h e m u l t c o m p package provides a summary m e t h o d t o summarize a n d d i s play the results r e t u r n e d f r o m t h e g l h t f u n c t i o n . I n a d d i t i o n , t h e summary m e t h o d provides several functionalities t o p e r f o r m f u r t h e r analyses a n d a d j u s t m e n t s for m u l t i p l i c i t y . W e s t a r t a p p l y i n g t h e summary m e t h o d t o t h e object r e t u r n e d b y g l h t as R> summary(warpbreaks.mc) Simultaneous Multiple

Fit:

Comparisons

aov(formula

Tests of

= breaks

Hypotheses: Estimate M — L == 0 -10.00 H - L == 0 -14.12 H - M == 0 -4.12

for

Means:

General

Linear

Tukey

Hypotheses

Contrasts

~ tension,

data

warpbreaks)

Linear

Signif. (Adjusted

codes: 0 p values

Std.

Error 3.96 3.96 3.96

value -2.53 -3.12 -1.19

'***' 0.001 '**' 0.01 reported — single-step

Pr(>\t\) 0.0385 0.0014 0.4 631 '*'

* **

0.05 '.' method)

0.1

' 1

A s seen f r o m t h e o u t p u t above, for each of t h e m = 3 linear functions of interest t h e estimates cj J3 are reported together w i t h t h e associated s t a n d a r d errors -^/v(cJ/3), r e s u l t i n g i n t h e test statistics tj —

, J £ ==z, VV(cJ/3) C

j — 1,2,3,

M U L T I P L E COMPARISONS IN P A R A M E T R I C

MODELS

recall E q u a t i o n (3.5) for t h e general expression (note t h a t a = 0 i n t h e w a r p b r e a k s example). M u l t i p l i c i t y adjusted p-values qj are r e p o r t e d i n t h e last c o l u m n . These p-values are calculated f r o m t h e u n d e r l y i n g m u l t i v a r i a t e t (or n o r m a l , i f a p p r o p r i a t e ) d i s t r i b u t i o n and can be d i r e c t l y compared w i t h t h e significance level a . I n t h e w a r p b r e a k s example, we conclude t h a t b o t h tension levels M ( m e d i u m ) and H (high) are significantly different f r o m L (low) at level a â&#x20AC;&#x201D; 0.05, because qi â&#x20AC;&#x201D; qML = 0.0385 < 0.05 = a a n d Q2 = QHL = 0.0014 < 0.05. N o t e t h a t q = q = 0.4631 > 0.05 a n d t h e tension levels M a n d H are n o t declared t o be significantly different. I f we save t h e i n f o r m a t i o n p r o v i d e d by t h e summary m e t h o d i n t o t h e object 3

R> w a r p b r e a k s . r e s

summary(warpbreaks.mc)

we can e x t r a c t t h e n u m e r i c a l i n f o r m a t i o n f r o m t h e r e s u l t i n g list w a r p b r e a k s . r e s $ t e s t for further analyses. For example, R>

element

warpbreaks.res$test$pvalues

[1] 0.03840 0.00144 attr(, "error") [1] 0.000184

0.46307

returns t h e adjusted p-values f r o m t h e previous analysis. I n a d d i t i o n t o s u m m a r i z i n g a n d displaying t h e i n f o r m a t i o n f r o m t h e object r e t u r n e d b y t h e g l h t f u n c t i o n , t h e summary m e t h o d provides a t e s t a r g u m e n t , w h i c h allows one t o p e f o r m f u r t h e r analyses. Recall E q u a t i o n (3.10) for t h e elementary n u l l hypotheses of interest i n t h e w a r p b r e a k s example. T h e global n u l l hypothesis H = H M L n H H L n H H M can t h e n be tested w i t h t h e c o m m o n F test b y c a l l i n g R> summary ( w a r p b r e a k s . mc, t e s t = General

Linear

Multiple

Comparisons

Linear

Hypotheses: Estimate 0 -10.00 0 -14.72 0 -4.72

M H H -

L == L == M ==

Global Test: F DF1 DF2 1 7.2 2 51

FtestO)

Hypotheses of

Means:

Tukey

Contrasts

Pr(>F) 0.00175

N o t e t h a t t h e above F test result coincides w i t h t h e result f r o m t h e analysis following E q u a t i o n (3.10), where we used t h e aov f u n c t i o n t o fit t h e onefactor A N O V A m o d e l . S i m i l a r l y , a W a l d test can be performed b y specifying Chisqtest(). I f no m u l t i p l i c i t y a d j u s t m e n t is foreseen, t h e u n i v a r i a t e ( ) o p t i o n can be passed t o t h e t e s t a r g u m e n t

MULTCOMP

THE

PACKAGE

R> s u m m a r y ( w a r p b r e a k s . m c , Simultaneous Multiple

Fit:

aov(formula

Linear M H H -

L == L == M

Std.

0 '***' values

univariate())

for

Means:

= breaks

Hypotheses: Estimate 0 -10.00 0 -14.72 0 -4.72

Signif. codes: (Univariate p

test

Tests

Comparisons

General Tukey

data

value -2.53 -3.72 -1.19

0.001 '**' reported)

Hypotheses

Contrasts

~ tension,

Error 3.96 3.96 3.96

Linear

warpbreaks)

Pr(>\t\) 0.0147 * 0.0005 *** 0.2386

0.01

0.05

'.'

0.1

' 1

T h i s results i n u n a d j u s t e d p-values, as i f we h a d p e r f o r m e d m separate t tests w i t h o u t a c c o u n t i n g for m u l t i p l i c i t y . I f we compare t h e u n a d j u s t e d p-values pj f r o m t h e o u t p u t above w i t h t h e adjusted p-values qj f r o m t h e T u k e y test, we conclude t h a t pj < q j , j = 1, 2,3, as expected. Table 3.1 summarizes t h e m u l t i p l i c i t y a d j u s t m e n t m e t h o d s available w i t h t h e summary m e t h o d v i a t h e a d j u s t e d a r g u m e n t t o t h e t e s t f u n c t i o n . I n p a r t i c u l a r , an interface t o t h e m u l t i p l i c i t y a d j u s t m e n t s i m p l e m e n t e d i n t h e p . a d j u s t f u n c t i o n f r o m t h e s t a t s package is available. G i v e n a set o f u n a d j u s t e d p-values, t h e p . a d j u s t f u n c t i o n provides t h e r e s u l t i n g adjusted p-values using one o f several m e t h o d s . C u r r e n t l y , t h e f o l l o w i n g m e t h o d s i m p l e m e n t e d i n p . a d j u s t are available: " n o n e " (no m u l t i p l i c i t y a d j u s t m e n t ) , " b o n f e r r o n i " (Section 2.3.1), " h o l m " (Section 2.3.2), " h o c h b e r g " (Section 2.4), " h o m m e l " (Section 2.4), " B H " ( B e n j a m i n i a n d H o c h b e r g 1995), a n d "BY" ( B e n j a m i n i a n d Y e k u t i e l i 2001). T h e last t w o methods are t a i l o r e d t o c o n t r o l t h e false discovery rate; see Section 2.1.1. T o i l l u s t r a t e t h e f u n c t i o n a l i t y , assume t h a t we w a n t t o a p p l y t h e B o n f e r r o n i correction t o t h e w a r p b r e a k s d a t a . T h i s can be achieved b y specifying t h e t y p e argument t o t h e a d j u s t e d o p t i o n as R> s u m m a r y ( w a r p b r e a k s . m c , Simultaneous Multiple

Fit:

aov(formula

Linear M H H -

Comparisons

L == L == M ==

Tests of

= breaks

Hypotheses: Estimate 0 -10.00 0 -14.72 0 -4.72

test

Std.

= adjusted(type

for

Means:

General Tukey

~ tension,

Error 3.96 3.96 3.96

value -2.53 -3.72 -1.19

Linear

"bonferroni")) Hypotheses

Contrasts

data

warpbreaks)

Pr(>\t\) 0.0442 0.0015 0.7158

* **

a o o I—I

type argument to a d j u s t e d

"none" "BH","BY" "bonferroni" "holm" "hochberg" "hommel" "single-step" "free" "Shaffer" "Westfall"

i—i CO

CO I—I

<: a, S

o o w i—i

Table 3.1

Reference (section number)

Comments

no multiplicity adjustment; identical to t e s t = u n i v a r i a t e ( ) option procedures controlling the false discovery rate 2.3.1 2.3.2 2.4 2.4 3.1, 3.2 4.1.2 2.3.2 4.2.2

stepwise extension of and more powerful than " b o n f e r r o n i " based on Simes test; more powerful than "holm", but has additional assumptions more powerful than "hochberg" default option; incorporates correlations and is more powerful than " b o n f e r r o n i " stepwise extension of and more powerful than " s i n g l e - s t e p " stepwise extension of " b o n f e r r o n i " under restricted combination condition extension of "Shaffer" that incorporates correlations

Multiple comparison procedures available with the summary method from the multcomp package. In addition, global F and x tests are available as direct arguments to test. 2

MULTCOMP

THE

Signif. (Adjusted

PACKAGE

codes: 0 p values

'***' 0.001 '**' 0.01 reported â&#x20AC;&#x201D; bonferroni

'*'

0.05 '. method)

0.1

' 1

N o t e t h a t a l l m e t h o d s f r o m the p . a d j u s t f u n c t i o n are based o n u n a d j u s t e d p-values. M o r e powerful methods are available by t a k i n g t h e correlations b e tween t h e test statistics i n t o account, as described i n Sections 3.1 a n d 3.2. I n m u l t c o m p , four a d d i t i o n a l options for t h e t y p e a r g u m e n t t o a d j u s t e d are i m p l e m e n t e d t o accomplish this. T h e t y p e = " s i n g l e - s t e p " o p t i o n specifies a single-step test (Section 2.1.2), w h i c h incorporates t h e correlations between t h e test statistics using either t h e m u l t i v a r i a t e n o r m a l or t d i s t r i b u t i o n i m p l e m e n t e d i n t h e m v t n o r m package (Genz, B r e t z , a n d H o t h o r n 2010). C o n s e quently, c a l l i n g R> s u m m a r y ( w a r p b r e a k s . m c , Simultaneous Multiple

Fit:

aov(formula

Linear M H H -

Comparisons

L == L == M ==

Signif. (Adjusted

codes: 0 p values

Tests of

= breaks

Hypotheses: Estimate 0 -10.00 0 -14.72 0 -4.72

test = adjusted(type

Std.

for

Means:

General Tukey

~ tension,

Error 3.96 3.96 3.96

"single-step"))

Linear

Hypotheses

Contrasts

data

value -2.53 -3.72 -1.19

'***' 0.001 '**' 0.01 reported â&#x20AC;&#x201D; single-step

warpbreaks)

Pr(>\t\) 0.0385 0.0014 0.4631 '*'

* **

0.05 '.' method)

0.1

' 1

results i n t h e same adjusted p-values a n d test decisions as for t h e T u k e y test considered previously. T h e t y p e = " f r e e " o p t i o n leads t o a step-down test procedure under t h e free c o m b i n a t i o n c o n d i t i o n (Section 2.1.2), w h i c h i n c o r porates correlations a n d is t h u s more powerful t h a n t h e H o l m procedure; see Section 4.1.2 for a d e t a i l e d description of i t s usage. U n d e r t h e r e s t r i c t e d c o m b i n a t i o n c o n d i t i o n , t h e t y p e = " S h a f f e r " o p t i o n performs t h e S2 procedure from. Shaffer (1986), w h i c h uses B o n f e r r o n i tests for each intersection h y p o t h esis of t h e u n d e r l y i n g closed test procedure (Section 2.3.2). W h e n t h e tests satisfy t h e free c o m b i n a t i o n s c o n d i t i o n instead, Shaffer's procedure reduces t o t h e o r d i n a r y H o l m procedure. F i n a l l y , t y p e = " W e s t f a l l " is yet m o r e p o w e r f u l t h a n Shaffer's procedure. T h e increase i n power comes f r o m using specific dependence i n f o r m a t i o n r a t h e r t h a n t h e conservative B o n f e r r o n i i n e q u a l i t y (Westfall 1997; W e s t f a l l a n d Tobias 2007); see Section 4.2.2 for a detailed discussion.

MULTIPLE

3.3.3

The conf int

COMPARISONS IN P A R A M E T R I C

MODELS

method

So far we have focused o n the capabilities of m u l t c o m p to provide adjusted p-values for a variety of m u l t i p l e comparison procedures. I n this section we describe the available f u n c t i o n a l i t y to compute a n d p l o t simultaneous c o n f i dence intervals using the c o n f i n t m e t h o d . As mentioned i n Section 2.1.2, s i multaneous confidence intervals are available i n closed f o r m for m a n y s t a n d a r d single-step procedures, b u t t h e y are usually more d i f f i c u l t t o derive for s t e p wise procedures. Consequently, the c o n f i n t m e t h o d implements simultaneous confidence intervals for single-step tests i n the p a r a m e t r i c m o d e l framework f r o m Sections 3.1 a n d 3.2. Note t h a t the c o n f i n t m e t h o d is o n l y available for g l h t objects. Simultaneous confidence intervals for the B o n f e r r o n i test, for example, are n o t d i r e c t l y available b u t can be c o m p u t e d by specifying the c a l p h a argument; see f u r t h e r below. Consider again the w a r p b r e a k s example analyzed previously i n t h i s c h a p ter. We can calculate simultaneous confidence intervals for the T u k e y test by a p p l y i n g the c o n f i n t m e t h o d t o t h e w a r p b r e a k s .mc object f r o m t h e g l h t function: R> w a r p b r e a k s . c i <R> w a r p b r e a k s . c i

confint(warpbreaks.mc, l e v e l

S i m u l t a n e o u s Confidence Multiple

Fit:

Comparisons

aov(formula

of Means:

= breaks

Hypotheses: Estimate M - L == 0 -10.000 H — L == 0 -14.722 H - M == 0 -4.722

0.95)

Intervals Tukey

~ tension,

Quantile = 2. 41 95% family-wise confidence

Contrasts

data

warpbreaks)

level

Linear

lwr -19.559 -24.281 -14.281

upr -0.441 -5.164 4.836

We conclude from the o u t p u t t h a t the upper confidence bounds for t h e pairwise differences fJLj^ — &L a n d \IH — }J>L are negative, i n d i c a t i n g t h a t b o t h tension levels M ( m e d i u m ) a n d H (high) significantly reduce t h e n u m b e r of breaks as compared t o tension level L (low). Moreover, we cannot conclude at the 9 5 % confidence level t h a t the groups M a n d H differ significantly. I n a d d i t i o n , we can display t h e confidence intervals graphically using the associated p l o t method R> p l o t ( w a r p b r e a k s . c i , m a i n + xlab = "Breaks")

"" , y l i m

= c(0.5, 3.5),

THE

MULTCOMP

PACKAGE

see F i g u r e 3.2 for t h e r e s u l t i n g p l o t . A n i m p r o v e d greaphical 'display of the confidence intervals is available w i t h t h e p l o t .matchMMC c o m m a n d f r o m the H H package (Heiberger 2009); see Section 4.2.1 for an example of i t s use.

M-L

H-L

H-M

(

-25

F i g u r e 3.2

Two-sided simultaneous confidence intervals for the Tukey test i n the warpbreaks example.

N o t e t h a t u n a d j u s t e d ( m a r g i n a l ) confidence intervals can be c o m p u t e d b y specifying c a l p h a = u n i v a r i a t e _ c a l p h a ( ) t o c o n f i n t . A l t e r n a t i v e l y , t h e c r i t i c a l value can be d i r e c t l y specified as a scalar t o c a l p h a . I n t h e p r e v i ous example, t h e 9 5 % c r i t i c a l value uo.95 = 2.4142 was calculated f r o m t h e m u l t i v a r i a t e t d i s t r i b u t i o n using t h e m v t n o r m package. I f instead R> ebon <- q t ( 1 - 0 . 0 5 / 6 , R> ebon [1J

51)

2.48

was specified t o c a l p h a , one w o u l d o b t a i n t h e two-sided 9 5 % simultaneous confidence intervals for t h e B o n f e r r o n i test w i t h

MULTIPLE COMPARISONS IN P A R A M E T R I C

R> c o n f i n t ( w a r p b r e a k s . m c , Simultaneous Multiple

Fit:

Comparisons

aov(formula

MODELS

c a l p h a = ebon)

Confidence of Means:

= breaks

Intervals Tukey

~ tension,

Contrasts

data

warpbreaks)

Quantile =2.48 95% confidence level

Linear

Hypotheses: Estimate M - L == 0 -10.000 H - L == 0 -14.122 H - M == 0 -4.122

Iwr -19.804 -24.526 -14.526

upr -0.196 -4.919 5.081

I n case of a l l pairwise comparisons among several t r e a t m e n t s , the m u l t c o m p package also provides the o p t i o n of p l o t t i n g the results using a compact letter display (Piepho 2004). W i t h this display, t r e a t m e n t s t h a t are n o t s i g nificantly different are assigned a c o m m o n letter. I n other words, significantly different treatments have no letters i n common. T h i s t y p e of graphical display has advantages when a large number of treatments is being compared w i t h each other, as i t summarizes the test results more efficiently t h a n a simple collection of simultaneous confidence intervals. Using m u l t c o m p , the e l d f u n c t i o n extracts the necessary i n f o r m a t i o n f r o m g l h t , summary, g l h t or c o n f i n t . g l h t objects t o create a compact letter d i s play of a l l pairwise comparisons. I n case of c o n f i n t . g l h t objects, a pairwise comparison is reported significant i f the associated simultaneous confidence i n t e r v a l does not c o n t a i n 0. Otherwise, the associated adjusted p-value is c o m pared w i t h the given significance level a. For the w a r p b r e a k s example, we can e x t r a c t the necessary i n f o r m a t i o n f r o m the g l h t object by calling R> w a r p b r e a k s . e l d

eld(warpbreaks.mc)

Once this i n f o r m a t i o n has been extracted, we can use a p l o t m e t h o d associated w i t h e l d objects t o create the compact letter display of all pairwise comparisons. I f the fitted model contains any covariates, boxplots for each level are p l o t t e d . Otherwise, different types of plots are used, depending on the class of the response variable and the e l d object; see t h e online d o c u m e n t a t i o n for further details ( H o t h o r n et a l . 2010a). Figure 3.3 reproduces t h e boxplots for each of the three tension levels from Figure 3.1 together w i t h t h e letter display when using t h e c o m m a n d R>

plot(warpbreaks.eld)

Because tension level L has no letter i n c o m m o n w i t h any other tension level, i t is significantly different at the chosen significance level (a = 0.05 by default).

THE

MULTCOMP

PACKAGE

O CD

O lO CO

00 O CO

O CM

F i g u r e 3.3

Compact letter display for all pairwise comparisons i n the warpbreaks example.

F u r t h e r m o r e , t h e groups M a n d H do n o t differ significantly, as t h e y share a c o m m o n letter. These conclusions are i n line w i t h t h e previous results o b t a i n e d i n F i g u r e 3.2.

CHAPTER 4

Applications I n t h i s c h a p t e r we use s e v e r a l a p p l i c a t i o n s to i l l u s t r a t e t h e m u l t i p l e h y p o t h e ses t e s t i n g f r a m e w o r k developed i n C h a p t e r 3 for g e n e r a l p a r a m e t r i c m o d e l s . T h e c o m b i n a t i o n of t h e m e t h o d s from C h a p t e r 3 w i t h t h e c l o s u r e p r i n c i p l e d e s c r i b e d i n S e c t i o n 2.2.3 leads to powerful s t e p w i s e p r o c e d u r e s , w h i c h form the core m e t h o d o l o g y for a n a l y z i n g the e x a m p l e s i n t h i s c h a p t e r . A t t h e s a m e t i m e we i l l u s t r a t e t h e c a p a b i l i t i e s of t h e m u l t c o m p package i n R, w h i c h provides a n efficient i m p l e m e n t a t i o n of these m e t h o d s . I n S e c t i o n 4.1 w e a p p r o a c h t h e c o m m o n p r o b l e m of c o m p a r i n g s e v e r a l g r o u p s w i t h a c o n t r o l a n d d e s c r i b e t h e D u n n e t t test. I n S e c t i o n 4.2 w e c o n s i d e r t h e e q u a l l y i m p o r t a n t p r o b l e m of a l l p a i r w i s e c o m p a r i s o n s for s e v e r a l groups a n d d e s c r i b e t h e T u k e y test. S e c t i o n s 4.1 a n d 4.2 b o t h focus o n i m p o r t a n t applications a n d accordingly we discuss the illustrating examples i n detail. I n a d d i t i o n , we d e s c r i b e stepwise extensions, w h i c h l e a d t o m o r e p o w e r f u l p r o cedures t h a n the original D u n n e t t a n d T u k e y tests. T h e developed methods are v e r y g e n e r a l , h o l d for a n y t y p e of c o m p a r i s o n w i t h i n t h e f r a m e w o r k from C h a p t e r 3, a n d a r e a v a i l a b l e i n m u l t c o m p . I n t h e r e m a i n d e r of t h i s c h a p t e r we i l l u s t r a t e t h e c o n d u c t of m u l t i p l e c o m p a r i s o n p r o c e d u r e s for t h e general l i n e a r a n d p a r a m e t r i c m o d e l s f r o m C h a p ter 3, i n c l u d i n g a n u m b e r of a p p l i c a t i o n s b e y o n d t h e c o m m o n A N O V A or regression s e t t i n g , as d i s c u s s e d i n H o t h o r n et a l . ( 2 0 0 8 ) . I n S e c t i o n 4.3 we consider a n A N C O V A e x a m p l e of a dose r e s p o n s e s t u d y w i t h two c o v a r i a t e s . H e r e , we focus o n t h e e v a l u a t i o n of n o n - p a i r w i s e c o n t r a s t tests as o p p o s e d to t h e p r e c e e d i n g t w o sections, w h e r e we were o n l y i n t e r e s t e d i n p a i r w i s e c o m p a r i s o n s . I n S e c t i o n 4.4 we use a b o d y fat p r e d i c t i o n e x a m p l e to i l l u s t r a t e the a p p l i c a t i o n of m u l t i p l e c o m p a r i s o n p r o c e d u r e s to v a r i a b l e s e l e c t i o n i n l i n e a r regression m o d e l s . I n S e c t i o n 4.5 we consider t h e c o m p a r i s o n of t w o l i n e a r r e gression m o d e l s u s i n g s i m u l t a n e o u s confidence b a n d s . I n S e c t i o n 4.6 we look a t a l l p a i r w i s e c o m p a r i s o n s of e x p r e s s i o n levels for v a r i o u s g e n e t i c c o n d i t i o n s of a l c o h o l i s m i n a h e t e r o s c e d a s t i c o n e - w a y A N O V A m o d e l u s i n g s a n d w i c h e s t i m a t o r s . I n S e c t i o n 4.7 we use logistic regression t o e s t i m a t e t h e p r o b a b i l i t y of suffering f r o m A l z h e i m e r ' s disease. I n S e c t i o n 4.8 w e c o m p a r e s e v e r a l r i s k factors for s u r v i v a l of l e u k e m i a p a t i e n t s u s i n g a W e i b u l l m o d e l . F i n a l l y , i n S e c t i o n 4.9 we o b t a i n p r o b a b i l i t y e s t i m a t e s of deer b r o w s i n g for v a r i o u s tree species from mixed-effects m o d e l s . T h e e x a m p l e s i n t h i s c h a p t e r are s e l f - c o n t a i n e d . R e a d e r s , w h o o n l y g l a n c e d t h r o u g h C h a p t e r s 2 a n d C h a p t e r 3, s h o u l d b e a b l e to follow t h e e x a m p l e s a n d apply the techniques (and, in particular, the relevant function calls using 69

APPLICATIONS

Blanket

F i g u r e 4.1

B o x p l o t s of the r e c o v e r y d a t a .

m u l t c o m p ) t o t h e i r o w n p r o b l e m s . T h e o r e t i c a l r e s u l t s are l i n k e d b a c k t o p r e v i o u s c h a p t e r s , w h e r e necessary. F o r d e t a i l s o n t h e m u l t c o m p p a c k a g e w e refer t o S e c t i o n 3.3.

4.1

Multiple comparisons with a control

I n t h i s s e c t i o n we c o n s i d e r t h e p r o b l e m o f c o m p a r i n g s e v e r a l g r o u p s w i t h a c o m m o n c o n t r o l g r o u p i n an unbalanced one-way l a y o u t . I n Section 4.1.1 w e i n t r o d u c e t h e w e l l - k n o w n D u n n e t t t e s t , w h i c h is t h e s t a n d a r d m e t h o d i n t h i s s i t u a t i o n . I n S e c t i o n 4.1.2 w e t h e n c o n s i d e r a s t e p w i s e e x t e n s i o n o f t h e D u n n e t t t e s t b a s e d o n t h e c l o s u r e p r i n c i p l e , w h i c h is m o r e p o w e r f u l t h a n t h e original D u n n e t t test.

MULTIPLE COMPARISONS WITH A CONTROL 4-1.1

Dunnett

test

T o i l l u s t r a t e t h e Dunnett test w e consider t h e r e c o v e r y d a t a f r o m W e s t fall et a l . ( 1 9 9 9 ) . A c o m p a n y developed s p e c i a l i z e d h e a t i n g b l a n k e t s designed to h e l p t h e b o d y h e a t following a s u r g i c a l p r o c e d u r e . F o u r t y p e s of b l a n k e t s fro &i>&2> a n d &3 were t e s t e d o n s u r g i c a l p a t i e n t s to assess r e c o v e r y t i m e s . T h e b l a n k e t bo w a s a s t a n d a r d b l a n k e t a l r e a d y i n u s e a t v a r i o u s h o s p i t a l s . T h e p r i m a r y o u t c o m e of interest w a s r e c o v e r y t i m e i n m i n u t e s o f p a t i e n t s a l l o c a t e d r a n d o m l y t o one of t h e four t r e a t m e n t s . L o w e r r e c o v e r y t i m e s w o u l d i n d i c a t e a b e t t e r t r e a t m e n t effect. 5

T h e r e c o v e r y d a t a s e t is a v a i l a b l e from t h e m u l t c o m p

R> d a t a ( " r e c o v e r y " , p a c k a g e R> s u m m a r y ( r e c o v e r y )

blanket b0:20 bl: 3 b2: 3 b3:15

package,

"multcomp")

minutes Min. : 5.0 1st Qu. :12.0 Median :13.0 Mean :13.5 3rd Qu.:16.0 Max. :19.0

A s seen f r o m t h e s u m m a r y above, t h e group s a m p l e sizes differ c o n s i d e r a b l y : 20 p a t i e n t s r e c e i v e d b l a n k e t bo, 3 p a t i e n t s r e c e i v e d b l a n k e t &i, a n o t h e r 3 p a tients r e c e i v e d b l a n k e t 6 , a n d 15 p a t i e n t s r e c e i v e d b l a n k e t 63. F i g u r e 4.1 d i s p l a y s t h e b o x p l o t s for t h e r e c o v e r y d a t a . W e c o n c l u d e f r o m t h e b o x p l o t s t h a t t h e o b s e r v a t i o n s a r e a p p r o x i m a t e l y n o r m a l l y d i s t r i b u t e d w i t h e q u a l group v a r i a n c e s . B l a n k e t 62 seems to r e d u c e t h e m e a n r e c o v e r y t i m e a s c o m p a r e d to t h e s t a n d a r d b l a n k e t bo, b u t w e w a n t to m a k e t h i s c l a i m w h i l e c o n t r o l l i n g t h e familywise error rate w i t h a suitable multiple comparison procedure. 2

T o analyze the d a t a more formally we assume the unbalanced one-way l a y out

Vij =-y + /J>i + £ij

(4.1)

w i t h i n d e p e n d e n t , h o m o s c e d a s t i c a n d n o r m a l l y d i s t r i b u t e d r e s i d u a l errors £ij ~ N(0, t r ) . I n E q u a t i o n ( 4 . 1 ) , yij denotes t h e j ' - t h o b s e r v a t i o n i n t r e a t m e n t g r o u p i, j — 1 , . . . , r i i , w h e r e denotes t h e s a m p l e size of g r o u p z, 7 denotes t h e i n t e r c e p t (i.e., t h e o v e r a l l m e a n ) , a n d fjii denotes t h e m e a n effect of t r e a t m e n t g r o u p i = 0 , . . . , 3. N o t e t h a t m o d e l ( 4 . 1 ) i s a s p e c i a l c a s e of t h e 2

APPLICATIONS

72 general l i n e a r m o d e l ( 3 . 1 ) , w h e r e /

15 13 12 13

(

1 1 0 1 1 0

0 0

0 \ 0

1 1 0 0 1 0 1 0

0 0

( and

1 Mo

J3 = Mi

9 14

V 13

1 0 1 0

0 0

1 0 0 1

T h e n a t u r a l q u e s t i o n of t h i s s t u d y is w h e t h e r a n y of t h e b l a n k e t t y p e s 61,62, or 63 significantly r e d u c e s t h e recovery t i m e c o m p a r e d w i t h 60. T h i s is t h e c l a s s i c a l many-to-one p r o b l e m of c o m p a r i n g s e v e r a l t r e a t m e n t s w i t h a c o n t r o l . T h u s , we a r e i n t e r e s t e d i n t e s t i n g t h e t h r e e o n e - s i d e d n u l l h y p o t h e s e s Hi:

fj, < flu 0

i =

1,2,3.

T h e n u l l h y p o t h e s i s Hi therefore i n d i c a t e s t h a t t h e m e a n recovery t i m e for b l a n k e t 60 is lower t h a n it is for b l a n k e t 6*. A c c o r d i n g l y , t h e a l t e r n a t i v e h y potheses a r e g i v e n b y Ki \ fjL > Mi> 0

i =

1,2,3.

R e j e c t i n g a n y of t h e t h r e e n u l l h y p o t h e s e s Hi t h u s e n s u r e s t h a t a t least one of t h e n e w b l a n k e t s is b e t t e r t h a n t h e s t a n d a r d 60 a t a g i v e n confidence level 1 — ce, if s u i t a b l e m u l t i p l e c o m p a r i s o n p r o c e d u r e s a r e e m p l o y e d . T h e s t a n d a r d m u l t i p l e c o m p a r i s o n p r o c e d u r e to a d d r e s s the m a n y - t o - o n e p r o b l e m is t h e D u n n e t t (1955) test. I n essence, t h e o n e - s i d e d D u n n e t t test t a k e s t h e m i n i m u m (or t h e m a x i m u m , d e p e n d i n g o n t h e sideness of t h e test p r o b l e m ) of t h e m , say, p a i r w i s e t tests U

V% -2/0

,m,

(4.2)

w h e r e pi = J ^ ^ i i 3 / * j / * denotes t h e a r i t h m e t i c m e a n of group i — 0 , . . . , m , a n d s = YliLo Y^Liilfij ~~ Vi) l denotes t h e p o o l e d v a r i a n c e e s t i m a t e w i t h v = Y^iLo i ~~ ( + 1) degrees of freedom. N o t e t h a t these a r e t h e s t a n d a r d expressions for o n e - w a y A N O V A m o d e l s , w h i c h a r e s p e c i a l cases of t h e m o r e g e n e r a l e x p r e s s i o n s (3.3) a n d ( 3 . 4 ) . I n t h e r e c o v e r y d a t a e x a m p l e , we h a v e m = 3 t r e a t m e n t - c o n t r o l c o m p a r i s o n s , l e a d i n g to t h r e e test s t a t i s t i c s of t h e form (4.2). n

2 v

A s i m m e d i a t e l y seen f r o m e x p r e s s i o n ( 4 . 2 ) , e a c h t e s t s t a t i s t i c U is u n i v a r i a t e t d i s t r i b u t e d . T h e vector of test s t a t i s t i c s t = ( £ i , . . . , t ) follows a n m - v a r i a t e t d i s t r i b u t i o n w i t h v degrees of freedom a n d c o r r e l a t i o n m a t r i x R = (pij)ij, m

MULTIPLE COMPARISONS WITH A w h e r e for i ^

CONTROL

I n~i I rTJ \ — T \ T — > i,j = l,...,m. (4.3) V rii + no y njr + n I n t h e b a l a n c e d c a s e , no = ri\ = . . . = n and the correlations are constant, Pij — 0.5 for a l l i ^ j . A s d i s c u s s e d i n C h a p t e r 3, e i t h e r m u l t i d i m e n s i o n a l i n t e g r a t i o n r o u t i n e s , s u c h as those from G e n z a n d B r e t z ( 2 0 0 9 ) , or u s e r - f r i e n d l y interfaces o n t o p of s u c h r o u t i n e s , s u c h as t h e m u l t c o m p p a c k a g e i n R, c a n be u s e d to c a l c u l a t e a d j u s t e d p-values or c r i t i c a l v a l u e s . Pv =

I n t h e following w e s h o w h o w to a n a l y z e the r e c o v e r y d a t a w i t h m u l t c o m p . W e s t a r t f i t t i n g a n A N O V A m o d e l by c a l l i n g t h e a o v f u n c t i o n R> r e c o v e r y . a o v

aov(minutes

~ blanket,

data =

recovery)

T h e g l h t f u n c t i o n f r o m m u l t c o m p t a k e s t h e fitted r e s p o n s e m o d e l to p e r f o r m the multiple comparisons through R> l i b r a r y ( " m u l t c o m p " ) R> r e c o v e r y . m c < - g l h t ( r e c o v e r y . a o v , + l i n f c t = mcp(blanket = + alternative = "less")

"Dunnett"),

I n t h e p r e v i o u s c a l l , we u s e d t h e mcp f u n c t i o n for t h e l i n f c t a r g u m e n t to specify t h e c o m p a r i s o n s t y p e (i.e., t h e c o n t r a s t m a t r i x ) w e a r e i n t e r e s t e d i n . T h e s y n t a x is a l m o s t self-descriptive: S p e c i f y t h e factor of r e l e v a n c e ( b l a n k e t i n our e x a m p l e ) a n d select one of s e v e r a l p r e - d e f i n e d c o n t r a s t m a t r i c e s ; see t h e a p p l i c a t i o n s d i s c u s s e d i n t h e s u b s e q u e n t s e c t i o n s of t h i s c h a p t e r for o t h e r e x a m p l e s of p r e - d e f i n e d c o m p a r i s o n t y p e s . B e c a u s e we h a v e a o n e - s i d e d test p r o b l e m a n d w e a r e i n t e r e s t e d i n s h o w i n g a r e d u c t i o n i n r e c o v e r y t i m e , we h a v e to p a s s t h e a l t e r n a t i v e = " l e s s " a r g u m e n t to g l h t . W e o b t a i n a d e t a i l e d s u m m a r y of t h e r e s u l t s b y u s i n g t h e summary m e t h o d associated w i t h the g l h t function, R>

summary(recovery.mc) Simultaneous

Multiple

Fit:

bl b2 b3

Comparisons

aov(formula

Linear - bO - bO - bO

Signif. (Adjusted

Tests of

codes: 0 p values

Means:

= minutes

Hypotheses: Estimate >= 0 -2.133 >= 0 -7.467 >= 0 -1.667

for

Std.

General Dunnett

~ blanket,

Error 1.604 1.604 0.885

'***' 0.001 reported —

Linear

Hypotheses

Contrasts

data

t value -1.33 -4.66 -1.88

'**' 0.01 single-step

recovery)

Pr(<t) 0.241 <0.001 *** 0.092 . '*'

0.05 '.' method)

0.1

APPLICATIONS

T h e m a i n p a r t of t h e o u t p u t consists of a table w i t h t h r e e r o w s , o n e for e a c h of t h e m — 3 hypotheses. F r o m left to right we h a v e a s h o r t d e s c r i p t o r of t h e c o m p a r i s o n s , the effect e s t i m a t e s w i t h a s s o c i a t e d s t a n d a r d e r r o r s a n d t h e test s t a t i s t i c s , a s defined i n E q u a t i o n ( 4 . 2 ) . M u l t i p l i c i t y a d j u s t e d p-values a r e r e p o r t e d i n t h e last c o l u m n . B y default, these p-values a r e c a l c u l a t e d from t h e u n d e r l y i n g m u l t i v a r i a t e t d i s t r i b u t i o n ( t h u s a c c o u n t i n g for t h e c o r r e l a t i o n s b e t w e e n t h e test s t a t i s t i c s ) a n d c a n be c o m p a r e d d i r e c t l y w i t h t h e pre-specified significance level a. F o r t h e r e c o v e r y e x a m p l e we c o n c l u d e a t level a = 0.05 t h a t b l a n k e t 62 leads to significantly lower recovery t i m e s a s c o m p a r e d w i t h the s t a n d a r d b l a n k e t boF o r t h e p u r p o s e of i l l u s t r a t i o n , we c o m p a r e t h e D u n n e t t test w i t h t h e s t a n d a r d B o n f e r r o n i a p p r o a c h . R e c a l l t h a t t h e B o n f e r r o n i a p p r o a c h does n o t a c c o u n t for t h e c o r r e l a t i o n s between t h e test s t a t i s t i c s ( S e c t i o n 2.3.1) a n d is t h u s less powerful t h a n t h e D u n n e t t test. W i t h t h e m u l t c o m p package we c a n apply the Bonferroni approach by using the a d j u s t e d option from the summary m e t h o d , R> s u m m a r y ( r e c o v e r y . m c , Simultaneous Multiple

Fit: Linear bl b2 b3

- bO - bO - bO

Signif. (Adjusted

Tests

Comparisons

aov(formula

test

for

of Means:

= minutes

Hypotheses: Estimate >= 0 -2.133 >= 0 -1.467 >= 0 -1.667 codes: 0 p values

= adjusted(type

Std.

General

Error 1.604 1.604 0.885

'***' 0.001 reported —

"bonferroni"))

Linear

Dunnett

~ blanket,

Hypotheses

Contrasts

data

t value -1.33 -4.66 -1.88

'**' 0.01 bonferroni

recovery)

Pr(<t) 0.29 6.1e-05 0.10

***

'*' 0.05 '.' 0.1 method)

' ' 1

A s e x p e c t e d , t h e a d j u s t e d p - v a l u e s for t h e D u n n e t t test a r e u n i f o r m l y s m a l l e r t h a n t h o s e from t h e B o n f e r r o n i a d j u s t m e n t . T h e differences i n p-values a r e not t h a t large i n t h i s e x a m p l e b e c a u s e of t h e r e l a t i v e l y s m a l l c o r r e l a t i o n s P12 = 0.13, p = 0.236, a n d p = 0.236; see E q u a t i o n ( 4 . 3 ) . 13

N e x t , w e c o m p u t e 9 5 % o n e - s i d e d s i m u l t a n e o u s confidence i n t e r v a l s for t h e m e a n effect differences Hi — / i o , i = 1 , 2 , 3 . I tfollows f r o m S e c t i o n 3.1.1 t h a t for t h e D u n n e t t test these a r e g i v e n b y

(

— 0 0 ; yi — 2/o + ui-<x s

w h e r e ui-^ denotes t h e (1 — a ) - q u a n t i l e from t h e m u l t i v a r i a t e t d i s t r i b u t i o n w i t h the correlations (4.3). Alternatively, we c a n use t h e c o n f i n t m e t h o d associated w i t h the g l h t function,

MULTIPLE COMPARISONS WITH A CONTROL R> r e c o v e r y . c i < R> r e c o v e r y . c i

confint(recovery.mc,

Simultaneous Multiple

Fit:

Confidence

Comparisons

aov(formula

of Means:

= minutes

Quantile = 2.18 95% family-wise confidence

Linear bl b2 b3

- bO - bO - bO

Hypotheses: Estimate >= 0 -2.133 >= 0 -7.467 >= 0 -1.667

level

75 =

0.95)

Intervals Dunnett

~ blanket,

Contrasts

data

recovery)

level

Iwr upr - I n f 1.367 - I n f -3.966 - I n f 0.265

W e c o n c l u d e f r o m t h e o u t p u t t h a t o n l y t h e u p p e r confidence b o u n d for \x — fio is n e g a t i v e , reflecting t h e p r e v i o u s test d e c i s i o n t h a t b l a n k e t b leads to a significant r e d u c t i o n i n r e c o v e r y t i m e c o m p a r e d w i t h t h e s t a n d a r d b l a n k e t 60 • M o r e o v e r , w e c o n c l u d e a t t h e d e s i g n a t e d confidence level of 95% t h a t t h e m e a n r e c o v e r y t i m e for b is a t least 4 m i n u t e s s h o r t e r t h a n for bo. I n a d d i t i o n , w e c a n d i s p l a y t h e confidence i n t e r v a l s g r a p h i c a l l y w i t h 2

R> p l o t ( r e c o v e r y . c i , m a i n = + xlab = "Minutes")

ylim = c(0.5,

3.5),

see F i g u r e 4.2 for t h e r e s u l t i n g plot. A s e x p l a i n e d i n S e c t i o n 3.3.1, t h e r e a r e s e v e r a l w a y s t o specify t h e c o m p a r i s o n s of i n t e r e s t . A convenient w a y is to c a l l p r e - d e f i n e d c o n t r a s t m a t r i c e s , as d o n e a b o v e w h e n s e l e c t i n g t h e m c p ( b l a n k e t = " D u n n e t t " ) o p t i o n for t h e l i n f c t a r g u m e n t . A l t e r n a t i v e l y , we c a n d i r e c t l y specify t h e c o n t r a s t m a t r i x C reflecting t h e c o m p a r i s o n s o f interest. T h e c o n t r a s t m a t r i x a s s o c i a t e d w i t h t h e m a n y - t o - o n e c o m p a r i s o n s for t h e levels of t h e factor b l a n k e t is g i v e n b y

such that

/ Mo Mo \ •T Mi M2 M3 / V M3

Mi - Mo M2 - Mo M3 - Mo

l e a d s t o t h e p a i r w i s e c o m p a r i s o n s of i n t e r e s t . U s i n g m u l t c o m p , w e first define the contrast m a t r i x manually a n d then call the g l h t function,

APPLICATIONS

b1 - bO

b2 - bO

b3 - bO

~i -6

-4

-2 Minutes

F i g u r e 4.2

One-sided simultaneous confidence intervals for t h e D u n n e t t test i n t h e r e c o v e r y example.

R> c o n t r < - r b i n d O ' b l - b O " = c ( - l , 1 , 0 , 0), + "b2 -bO" = c ( - l , 0, 1 , 0 ) , + "b3 -bO" = c ( - l , 0, 0, 1)) R> s u m m a r y ( g l h t ( r e c o v e r y . a o v , l i n f c t = mcp(blanket + alternative = "less"))

Multiple

Fit:

Tests

Comparisons

of Means:

aov(formula

Linear bl b2 b3

Simultaneous

-bO -bO -bO

Signif. (Adjusted

>= >= >=

- minutes

Hypotheses: Estimate 0 -2.133 0 -7.467 0 -1.667

codes: 0 p values

Std.

f o r General

Error 1.604 1.604 0.885

'***' 0.001 reported â&#x20AC;&#x201D;

Linear

User-defined

~ blanket,

t value -1.33 -4.66 -1.88

contr),

Hypotheses Contrasts

data

recovery)

Pr(<t) 0.241 <0.001 *** 0.093 .

'**' 0.01 '*' single-step

0.05 '.' method)

0.1

' ' 1

A s e x p e c t e d , w e o b t a i n t h e s a m e a d j u s t e d p-values as i n t h e p r e v i o u s c a l l w i t h t h e m c p ( b l a n k e t = " D u n n e t t " ) o p t i o n for t h e l i n f c t a r g u m e n t . T h e a d v a n t a g e o f d e f i n i n g t h e c o n t r a s t m a t r i x m a n u a l l y is t h e r e t a i n e d

MULTIPLE COMPARISONS WITH A

CONTROL

flexibility to p e r f o r m m u l t i p l e c o m p a r i s o n s w h i c h a r e less often a p p l i e d i n p r a c t i c e a n d t h u s not pre-defined i n m u l t c o m p . T o i l l u s t r a t e t h i s a d v a n t a g e , consider t h e following m o d i f i c a t i o n of t h e r e c o v e r y e x a m p l e . S u p p o s e t h a t b o t h b l a n k e t s bo a n d 61 are s t a n d a r d t r e a t m e n t s . A s s u m e f u r t h e r t h a t we are i n t e r e s t e d i n c o m p a r i n g t h e two n e w b l a n k e t s 62 a n d 63 w i t h b o t h 60 a n d 61. T h e o r i g i n a l D u n n e t t test is not a p p l i c a b l e , b e c a u s e i t a s s u m e s only a single c o n t r o l t r e a t m e n t . C o m p a r i n g s e v e r a l t r e a t m e n t s w i t h m o r e t h a n one c o n t r o l g r o u p w a s i n v e s t i g a t e d by S o l o r z a n o a n d S p u r r i e r ( 1 9 9 9 ) a n d S p u r r i e r a n d S o l o r z a n o ( 2 0 0 4 ) . W i t h m u l t c o m p w e c a n s i m p l y specify t h e c o m p a r i s o n t y p e ourselves a n d r u n t h e g l h t f u n c t i o n as d o n e a b o v e for t h e D u n n e t t test. T o i l l u s t r a t e t h i s , a s s u m e t h a t we w a n t to c o m p a r e e a c h of t h e n e w b l a n k e t s 62 a n d 63 w i t h b o t h 60 a n d 61, r e s u l t i n g i n a t o t a l of m = 4 c o m p a r i s o n s . Accordingly, R> c o n t r 2 + + +

rbind("b2 "b2 "b3 "b3

-bO" -bl" -bO" -bl"

= = = =

c( c( c( c(

-l, 0, -l, 0,

0, -1, 0, -1,

1, 1, 0, 0,

0), 0), 1), 1))

defines t h e c o n t r a s t m a t r i x of interest a n d w e p a s s c o n t r 2 to mcp, R> s u m m a r y ( g l h t ( r e c o v e r y . a o v , l i n f c t = m c p ( b l a n k e t + alternative = "less")) Simultaneous Multiple

Fit:

Comparisons

aov(formula

Linear b2 b2 b3 b3

-bO -bl -bO -bl

Signif. (Adjusted

Tests

>= >= >= >=

Means:

= minutes

Hypotheses: Estimate 0 -7.467 0 -5.333 0 -1.667 0 0.4 67

codes: 0 p values

for

Std.

General User-defined

~ blanket,

Error 1.604 2.115 0.885 1.638

'***' 0.001 reported â&#x20AC;&#x201D;

Linear

'**' 0.01 single-step

contr2),

Hypotheses Contrasts

data

t value -4.66 -2.52 -1.88 0.28

recovery)

Pr(<t) <0.001 *** 0.027 * 0.105 0.915 '*'

0.05 '.' method)

0.1

' ' 1

T h e o u t p u t gives t h e c o r r e c t r e s u l t s for t h i s m u l t i p l e c o m p a r i s o n p r o b l e m a n d we c o n c l u d e t h a t b l a n k e t 62 is b e t t e r t h a n b o t h 60 a n d b\.

4-1.2

Step-down

Dunnett

test

procedure

I n S e c t i o n 4.1.1 w e c o n s i d e r e d t h e o r i g i n a l D u n n e t t t e s t , w h i c h is a s i n g l e - s t e p t e s t . A s e x p l a i n e d i n S e c t i o n 2.1.2, stepwise e x t e n s i o n s of s i n g l e - s t e p m u l t i p l e c o m p a r i s o n p r o c e d u r e s a r e often a v a i l a b l e a n d l e a d to m o r e p o w e r f u l m e t h o d s

APPLICATIONS

i n t h e sense t h a t t h e y reject a t least as m a n y h y p o t h e s e s as t h e i r s i n g l e step c o u n t e r p a r t s . S t e p - d o w n extensions of t h e o r i g i n a l D u n n e t t t e s t w e r e i n v e s t i g a t e d b y N a i k ( 1 9 7 5 ) ; M a r c u s et a l . ( 1 9 7 6 ) ; D u n n e t t a n d T a m h a n e (1991) a n d others. S t e p - u p versions of the D u n n e t t test are also a v a i l a b l e ( D u n n e t t a n d T a m h a n e 1991) b u t w i l l not be c o n s i d e r e d here. I n t h e following, we a p p l y t h e closure p r i n c i p l e d e s c r i b e d i n S e c t i o n 2.2.3 to t h e m a n y - t o - o n e c o m p a r i s o n p r o b l e m , give a general s t e p - d o w n t e s t i n g a l g o r i t h m b a s e d o n m a x - t tests ( w h i c h i n c l u d e s t h e D u n n e t t test as a s p e c i a l case) a n d t h e n c o m e b a c k to t h e r e c o v e r y d a t a e x a m p l e to i l l u s t r a t e t h e s t e p - d o w n D u n n e t t t e s t , w h i c h is also available i n t h e m u l t c o m p package. T o m a k e t h e ideas c o n c r e t e , r e c a l l t h e m = 3 n u l l h y p o t h e s e s Hi: fio < \±i, i = 1 , 2 , 3 , i n t h e r e c o v e r y e x a m p l e from S e c t i o n 4.1.1. F o l l o w i n g t h e closure p r i n c i p l e , we c o n s t r u c t from t h e set ?i = {Hi, H , H } of e l e m e n t a r y n u l l h y p o t h e s e s t h e full closure H — {Hi, H, #3, #12 > #13> # 2 3 , #123} of a l l i n t e r s e c t i o n h y p o t h e s e s Hi = (~) Hi,I C { 1 , 2 , 3 } . T h e closure p r i n c i p l e d e m a n d s t h a t we r e j e c t a n e l e m e n t a r y n u l l h y p o t h e s i s Hi,i = 1 , 2 , 3 , only, if a l l i n t e r s e c t i o n h y p o t h e s e s Hj € H w i t h i € I are r e j e c t e d by t h e i r l o c a l a - l e v e l tests. F o r e x a m p l e , i n order to reject H\, we need to e n s u r e t h a t a l l i n t e r s e c t i o n h y p o t h e s e s c o n t a i n e d i n Hi a r e also r e j e c t e d (i.e., Hi Hi3, a n d i f 123). T h i s i n d u c e s a n a t u r a l t o p - d o w n t e s t i n g order; see also F i g u r e 4.3 for a v i s u a l i z a t i o n of t h e r e s u l t i n g closed test p r o c e d u r e . W e s t a r t w i t h t e s t i n g i f 123. I f w e do not reject #123, none of t h e e l e m e n t a r y hypotheses Hi, H , H$ c a n be r e j e c t e d a n d t h e p r o c e d u r e stops. O t h e r w i s e , we reject Hi a n d continue t e s t i n g one level d o w n , t h a t i s , a l l p a i r w i s e i n t e r s e c t i o n h y p o t h e s e s Hi , His, H $; a n d so on. 2

ieI

W e h a v e not m e n t i o n e d before, w h i c h tests to use for the i n t e r s e c t i o n h y potheses. C o n s i d e r i n m o r e d e t a i l t h e global i n t e r s e c t i o n h y p o t h e s i s

#123 = Hi PI H

PI Hz : Mo < Mi a n d Mo < M2 a n d

(jlq <

^3,

(4.4)

w h i c h s t a t e s t h a t t h e b l a n k e t t y p e s 61, b , a n d 63 a r e a l l inferior to t h e s t a n d a r d b l a n k e t 60. I f a t least one of t h e n e w b l a n k e t s is s u p e r i o r to 60 a n d r e d u c e s t h e r e c o v e r y t i m e , t h e n u l l h y p o t h e s i s #123 w o u l d no longer be t r u e a n d we w o u l d like to r e j e c t i t w i t h h i g h p r o b a b i l i t y . T h i s is t h e u n i o n i n t e r s e c t i o n s e t t i n g d i s c u s s e d i n S e c t i o n 2.2.1. A n a t u r a l a p p r o a c h is to consider t h e m i n i m u m of t h e m s t a n d a r d i z e d p a i r w i s e differences U from E q u a t i o n ( 4 . 2 ) , l e a d i n g to m a x - t tests of t h e form ( 2 . 1 ) ; r e c a l l t h a t we keep t h e c o m m o n t e r m " m a x - t " regardless of t h e sideness of t h e test p r o b l e m . T h u s , we c a n use t h e s i n g l e - s t e p D u n n e t t test ( w h i c h , of c o u r s e , is a m a x - £ t e s t ) for t h e global i n t e r s e c t i o n h y p o t h e s i s #123 • S i m i l a r l y , w e c a n a p p l y t h e D u n n e t t test for a n y of t h e p a i r w i s e i n t e r s e c t i o n h y p o t h e s e s Hi ,Hi , a n d H $ w h i l e only a c c o u n t i n g for t h e p a i r of t r e a t m e n t groups involved i n the r e s p e c t i v e i n t e r s e c t i o n . F i n a l l y , t h e e l e m e n t a r y h y p o t h e s e s Hi,H , a n d H3 are t e s t e d u s i n g t h e t tests from E q u a t i o n ( 4 . 2 ) . T h i s l e a d s to a closed test p r o c e d u r e w h i c h uses D u n n e t t tests for e a c h i n t e r s e c t i o n h y p o t h e s i s ( a d j u s t e d for t h e n u m b e r of t r e a t m e n t s entering the corresponding intersection). 2

MULTIPLE COMPARISONS WITH A CONTROL

A s m e n t i o n e d i n S e c t i o n 2.2.1, however, a p a r t i c u l a r l y a p p e a l i n g p r o p e r t y of m a x - £ tests is t h a t t h e y a d m i t s h o r t c u t s of t h e full c l o s u r e , significantly r e d u c i n g t h e c o m p l e x i t y . I n other w o r d s , i f m n u l l h y p o t h e s e s s a t i s f y i n g t h e free c o m b i n a t i o n c o n d i t i o n ( w h i c h is t r u e for m a n y - t o - o n e c o m p a r i s o n p r o b l e m s ; see S e c t i o n 2.1.2) a r e tested w i t h m a x - £ t e s t s , s h o r t c u t s of size m c a n be a p p l i e d . T h i s avoids t e s t i n g t h e 2 — 1i n t e r s e c t i o n h y p o t h e s e s i n t h e full c l o s u r e . B e l o w w e give a general s t e p - d o w n a l g o r i t h m t o test m h y p o t h e s e s u n d e r t h e free c o m b i n a t i o n c o n d i t i o n , a s s u m i n g t h a t larger v a l u e s of U favour t h e r e j e c t i o n of Hi. F o r t w o - s i d e d a n d l o w e r - s i d e d test p r o b l e m s t h e a r g m a x i ^ i o p e r a t i o n s h a v e t o b e modified accordingly. m

Step-down algorithm based o n max-t tests u n d e r t h e free c o m b i n a t i o n c o n d i t i o n

Step

T e s t t h e g l o b a l i n t e r s e c t i o n h y p o t h e s i s Hj = H t e / i ^ { 1 , . . . , m } , w i t h a s u i t a b l e m a x - t t e s t , r e s u l t i n g i n t h e p-value Pi ; if pj < a , reject Hi w i t h a d j u s t e d p-value q^ = pi a n d p r o c e e d , w h e r e i\ = a r g m a x j g / j U; o t h e r w i s e s t o p . x

Step

L e t I = I\\ {h}T e s t Hj = f] Hi w i t h a s u i t a b l e m a x - t test, r e s u l t i n g i n t h e p-value p / ; i f p / < a , r e j e c t Hi w i t h a d j u s t e d p-value *fe = m a x - j ^ , p / } a n d p r o c e e d , w h e r e i = a r g m a x j j ti\ o t h e r w i s e stop. 2

iel2

Step j : L e t Ij = Ij-i \ {ij-i}. T e s t Hj — Diei a b l e m a x - i test r e s u l t i n g i n t h e p-value pjj; Hij w i t h a d j u s t e d p-value = max{qi _ ,pi } w h e r e ij = a r g m a x i g ^ . ti\ o t h e r w i s e s t o p .

^ ^ suiti f pi < a, r e j e c t a n d proceed, w

Step

t n

m: L e t I = Im-\ \ {im-i} = {^m}- Test H w i t h a t test, r e s u l t i n g i n t h e p-value Pi \ if Pi < a:, r e j e c t Hi with a d j u s t e d p-value qi = max{qi _ , P i } ; the procedure stops i n any case. m

irn

T h e H o l m p r o c e d u r e d e s c r i b e d i n S e c t i o n 2.3.2 is a w e l l - k n o w n e x a m p l e of t h i s a l g o r i t h m , b e c a u s e i t r e p e a t e d l y applies B o n f e r r o n i ' s i n e q u a l i t y i n a t m o s t m steps ( r e c a l l t h a t t h e B o n f e r r o n i m e t h o d is also a m a x - t t e s t ) . H o w e v e r , b e c a u s e t h e B o n n f e r r o n i m e t h o d does n o t a c c o u n t for t h e c o r r e l a t i o n s b e t w e e n t h e test s t a t i s t i c s , t h e H o l m p r o c e d u r e c a n b e i m p r o v e d . B y a p p l y i n g D u n n e t t tests a t e a c h s t e p for t h e m a n y - t o - o n e c o m p a r i s o n p r o b l e m of t h e r e c o v e r y e x a m p l e , w e o b t a i n t h e powerful step-down Dunnett procedure. T o i l l u s t r a t e t h e s t e p - d o w n D u n n e t t t e s t , c o n s i d e r F i g u r e 4.3, w h i c h s h o w s

APPLICATIONS

t h e closed r e p r e s e n t a t i o n of the D u n n e t t tests for t h e r e c o v e r y e x a m p l e . T h e observed one-sided t statistics s h o w n a t the b o t t o m level for t h e comparisons w i t h t h e c o n t r o l are £ j -1.33, t§ = - 4 . 6 5 6 , a n d t% = -1.884, respect i v e l y for t h e b l - bO, b 2 - bO, a n d b 3 - bO c o m p a r i s o n s . A t t h e t o p l e v e l , t h e g l o b a l i n t e r s e c t i o n h y p o t h e s i s (4.4) is Hj i n Step 1 o f t h e s t e p - d o w n a l g o r i t h m , w h e r e I \ = { 1 , 2 , 3 } . W e t e s t H123 u s i n g i j ^ l —4.656, which is h i g h l y s i g n i f i c a n t w i t h p = P ( m h v ( > i , £ > £3} < - 4 . 6 5 6 ) < 0 . 0 0 1 . T h i s a l l o w s o n e t o use a s h o r t c u t f o r t h e r e s t o f t h e t r e e f o r t e s t s t h a t i n c l u d e t h e b 2 - bO c o m p a r i s o n : I t f o l l o w s f r o m P ( m i n { £ i , £ that P(min{£ ,£ } < - 4 . 6 5 6 ) < 0.001, P ( m i n { £ , £ } < - 4 . 6 5 6 ) < 0.001, and P ( £ < — 4 . 6 5 6 ) < 0.001. Hence, a l l intersection hypotheses i n c l u d i n g t h e b 2 - bO c o m p a r i s o n c a n b e r e j e c t e d b y v i r t u e o f r e j e c t i n g t h e g l o b a l n u l l h y p o t h e s i s . T h u s , t h e b 2 - bO c o m p a r i s o n c a n i t s e l f b e d e e m e d s i g n i f i c a n t . T h i s l o g i c e x p l a i n s Step 1 o f t h e s t e p - d o w n a l g o r i t h m a b o v e . T h e r e m a i n i n g s t e p s are e x p l a i n e d s i m i l a r l y . I n t h e n o t a t i o n f r o m t h e s t e p - d o w n algorithm, i = 2 and consequently 7 = I i \ { 2 } = { 1 , 3 } . Consider F i g u r e 4.3 a g a i n . T h e r e m a i n i n g n o n - r e j e c t e d p a i r w i s e i n t e r s e c t i o n h y p o t h e s i s is H-13- Mo < M i a n d [1Q < [13. T h e r e l a t e d t e s t s t a t i s t i c is t° = —1.884 with p — P ( m i n { £ i , £} < — 1 . 8 8 4 ) = 0.064. T h u s , w e c a n n o t r e j e c t i f a t s i g n i f i c a n c e l e v e l a — 0.05, w h i c h r e n d e r s f u r t h e r t e s t i n g u n n e c e s s a r y . T h i s l o g i c e x p l a i n s Step 2 o f t h e a l g o r i t h m , a n d , because w e c a n n o t r e j e c t a t t h i s s t e p , we have t o t e r m i n a t e the s t e p - d o w n procedure. b s

b s

123

I n s u m m a r y , w e c o n c l u d e f r o m t h i s e x a m p l e t h a t s t e p - d o w n p r o c e d u r e s are s p e c i a l cases o f c l o s e d t e s t p r o c e d u r e s a n d t h a t t h e use o f m a x - £ t e s t s a l l o w s for s h o r t c u t s t o t h e f u l l c l o s u r e , w h e r e o n l y t h e s u b s e t s c o r r e s p o n d i n g t o t h e ordered test statistics need t o be tested. N o t e t h a t t h e s t e p - d o w n a l g o r i t h m p r e s e n t e d h e r e is v a l i d u n d e r t h e free c o m b i n a t i o n c o n d i t i o n ; for r e s t r i c t e d h y p o t h e s e s w e r e f e r t o t h e d i s c u s s i o n i n S e c t i o n 4.2.2. We conclude this section by showing how to invoke t h e step-down D u n n e t t test w i t h t h e m u l t c o m p package using t h e r e c o v e r y d a t a . T h e s t e p - d o w n a l g o r i t h m for h y p o t h e s e s s a t i s f y i n g t h e free c o m b i n a t i o n c o n d i t i o n is a v a i l a b l e w i t h t h e a d j u s t e d ( t y p e = " f r e e " ) o p t i o n i n the summary m e t h o d R> s u m m a r y ( r e c o v e r y . m c , Simultaneous Multiple

Fit: Linear bl b2 b3

- bO - bO - bO

Tests

Comparisons

aov(formula

test

= adjusted(type f o r General

of Means:

= minutes

Hypotheses: Estimate >= 0 -2.133 >= 0 -7.467 >= 0 -1.667

Std. E r r o r 1.604 1.604 0.885

"free"))

Linear

Dunnett

~ blanket,

Hypotheses

Contrasts

data

t value -1.33 -4.66 -1.88

recovery)

Pr(<t) 0.096 . 6e-05 *** 0.064 .

MULTIPLE COMPARISONS WITH A CONTROL

#123 *?23 = - 4 - 6 5 6

Signif. (Adjusted

codes: 0 p values

'***' 0.001 reported â&#x20AC;&#x201D;

'**' free

0.01 '*' 0.05 method)

'.'

0.1

' ' 1

T h e r e s u l t s f r o m t h e o u t p u t above m a t c h t h e r e s u l t s from t h e p r e v i o u s d i s c u s sion. I n p a r t i c u l a r , t h e s t e p - d o w n D u n n e t t test p r o v i d e s s u b s t a n t i a l l y s m a l l e r a d j u s t e d p-values t h a n t h e single-step D u n n e t t test f r o m S e c t i o n 4.1.1; see also T a b l e 4.1 for a s i d e - b y - s i d e c o m p a r i s o n . I n T a b l e 4.1 w e h a v e also i n c l u d e d for i l l u s t r a t i o n p u r p o s e s t h e u n a d j u s t e d p-values Pi a s w e l l as t h e a d j u s t e d p-values from t h e B o n f e r r o n i test a n d t h e H o l m p r o c e d u r e R> s u m m a r y ( r e c o v e r y . m c ,

test

= adjusted(type

"holm"))

B o t h step-down procedures (Holm, step-down D u n n e t t ) are clearly more p o w erful t h a n t h e i r s i n g l e - s t e p c o u n t e r p a r t s ( B o n f e r r o n i , D u n n e t t ) . I n a d d i t i o n , we c o n c l u d e f r o m T a b l e 4.1 t h a t m e t h o d s a c c o u n t i n g for c o r r e l a t i o n s b e t w e e n t h e test s t a t i s t i c s ( D u n n e t t , s t e p - d o w n D u n n e t t ) a r e m o r e p o w e r f u l t h a n m e t h o d s w h i c h do not ( B o n f e r r o n i , H o l m ) . A s e x p l a i n e d i n S e c t i o n 4.1.1, t h e c o r r e l a tions a r e fairly s m a l l i n t h i s e x a m p l e a n d t h u s t h e B o n f e r r o n i - b a s e d p r o c e d u r e s b e h a v e r e a s o n a b l y w e l l . F i n a l l y , note t h a t s i m u l t a n e o u s confidence i n t e r v a l s for s t e p w i s e p r o c e d u r e s a r e not i m p l e m e n t e d i n m u l t c o m p , s i n c e t h e y are t y p i c a l l y h a r d to c o m p u t e , if a v a i l a b l e at a l l . S i m u l t a n e o u s confidence i n t e r vals h a v e b e e n i n v e s t i g a t e d b y Bofinger (1987) for t h e s t e p - d o w n D u n n e t t p r o c e d u r e a n d b y S t r a s s b u r g e r a n d B r e t z (2008) a n d G u i l b a u d ( 2 0 0 8 ) for the H o l m procedure.

APPLICATIONS

Bonferroni

0.096 0.001. 0.034

0.287

Comparison bl b2 b3

bO bO bO

Table 4.1

4.2

A d j u s t e d p - v a l u e s qi Dunnett Holm step-down D u n n e t t 0.241 0.001 0.092

0.00] 0.101

(U)!)(i 0.001 0.067

0.096 0.001 0.064

C o m p a r i s o n of several m u l t i p l e comparison procedures for t h e r e c o v e r y data.

A l l pairwise

comparisons

I n t h i s s e c t i o n we c o n s i d e r a l l p a i r w i s e c o m p a r i s o n s o f s e v e r a l m e a n s i n a t w o - w a y l a y o u t . I n S e c t i o n 4.2.1 we i n t r o d u c e t h e w e l l - k n o w n T u k e y t e s t , w h i c h is t h e s t a n d a r d p r o c e d u r e i n t h i s s i t u a t i o n . I n S e c t i o n 4.2.2 w e c o n s i d e r a s t e p w i s e e x t e n s i o n b a s e d o n t h e c l o s u r e p r i n c i p l e , w h i c h is m o r e p o w e r f u l t h a n t h e original T u k e y test. 4.2.1

Tukey

test

T o i l l u s t r a t e t h e Tukey test w e c o n s i d e r t h e i m m e r d a t a f r o m V e n a b l e s a n d R i p l e y (2002) describing a field e x p e r i m e n t on barley yields. F i v e varieties of b a r l e y w e r e g r o w n i n s i x l o c a t i o n s i n b o t h 1931 a n d 1932. F o l l o w i n g V e n a b l e s a n d R i p l e y ( 2 0 0 2 ) , w e a v e r a g e t h e r e s u l t s for t h e t w o y e a r s . T o a n a l y z e t h e d a t a we consider the t w o - w a y layout Vij = 7 + Mi + ocj + ÂŁij

(4-5)

w i t h independent, homoscedastic a n d n o r m a l l y d i s t r i b u t e d residual errors Sij ÂťJ N(0,a' ). I n E q u a t i o n ( 4 . 5 ) , yij d e n o t e s t h e a v e r a g e b a r l e y y i e l d f o r v a r i e t y i a t l o c a t i o n j " , 7 t h e i n t e r c e p t (i.e., t h e o v e r a l l m e a n ) , (JLI t h e m e a n effect o f v a r i e t y i = 1 , . . . , 5, a n d otj t h e m e a n effect o f l o c a t i o n j = 1 , . . . , 6. N o t e t h a t m o d e l (4.5) is a s p e c i a l case o f t h e g e n e r a l l i n e a r m o d e l ( 3 . 1 ) , w h e r e 2

109.35 \ 77.30

1 1

X =

123.50 100.25 V 131.45

(

)

0 0

V 1

1 0

0 1

0 0

1 J

a n d (5 = ( 7 , /J, . . . , / x , ot\,. . . , a ) T h e immer d a t a are available w i t h t h e M A S S package (Venables a n d R i p l e y 2002). Consequently, we can r u n a s t a n d a r d analysis of variance using t h e aov f u n c t i o n as f o l l o w s : U

ALL PAIRWISE COMPARISONS R> d a t a ( " i m m e r " , p a c k a g e = " M A S S " ) R> i m m e r . a o v < - a o v ( ( Y l + Y 2 ) / 2 ~ V a r + L o c , R> s u m m a r y ( i m m e r . a o v ) Df Sum Sq 4 2655 5 10610 20 2211

Var Loc Residuals Signif.

codes:

Mean

'***'

Sq F value 664 5.99 2122 19.15 111

0.001

'**'

data =

Pr (>F) 0.0025 5.2e-07

0.01

immer)

** ***

0.05

'.'

0.1

' ' 1

T h e A N O V A F t e s t s i n d i c a t e t h a t b o t h V a r a n d L o c h a v e a significant effect o n t h e average b a r l e y y i e l d . F o l l o w i n g V e n a b l e s a n d R i p l e y ( 2 0 0 2 ) , we are i n t e r e s t e d i n c o m p a r i n g the k = 5 varieties w i t h e a c h o t h e r w h i l e a v e r a g i n g t h e r e s u l t s for t h e t w o y e a r s , r e s u l t i n g i n k(k —1)/2 = 10 p a i r w i s e c o m p a r i s o n s . W e c a n r e t r i e v e t h e m e a n y i e l d s for t h e five v a r i e t i e s b y a c a l l to m o d e l . t a b l e s R> m o d e l . t a b l e s ( i m m e r . a o v ,

type

"means")$tables$Var

Var M 94.4

P 102.5

S T 91.1 118.2

V 99.2

T h i s l e a d s to t h e c l a s s i c a l all-pairwise comparison problem. W e are thus i n t e r e s t e d i n t e s t i n g t h e 10 t w o - s i d e d n u l l h y p o t h e s e s Hij : m = p,j,

{ M , P, 5, T , V}

i + j.

T h e n u l l h y p o t h e s i s Hij i n d i c a t e s t h a t t h e m e a n y i e l d i n b a r l e y , d o e s n o t differ for t h e v a r i e t i e s i a n d j . A c c o r d i n g l y , the a l t e r n a t i v e h y p o t h e s e s a r e g i v e n by Kij

: pi ^ pj,

€

{ M , P, S, T , V}, i ^ j .

R e j e c t i n g a n y of t h e 10 n u l l h y p o t h e s e s Hij e n s u r e s t h a t a t l e a s t t w o of t h e five varieties differ w i t h respect to their average y i e l d . W e m a k e t h i s c l a i m a t a g i v e n confidence level 1 — a , if we e m p l o y s u i t a b l e m u l t i p l e c o m p a r i son procedures. T h e s t a n d a r d multiple comparison procedure to address the a l l - p a i r w i s e c o m p a r i s o n p r o b l e m is t h e T u k e y ( 1 9 5 3 ) t e s t , also k n o w n as studentized range test. N o t e t h a t s o m e t e x t b o o k s suggest p e r f o r m i n g t h e T u k e y test o n l y after o b s e r v i n g a significant A N O V A F test r e s u l t . W e r e c o m m e n d u s i n g t h e m e t h o d s d e s c r i b e d below, regardless of w h e t h e r or n o t t h e A N O V A F test is s i g n i f i c a n t . I n essence, t h e T u k e y test t a k e s t h e m a x i m u m over t h e a b s o l u t e v a l u e s of a l l p a i r w i s e test s t a t i s t i c s , w h i c h i n t h e c o m p l e t e l y r a n d o m i z e d l a y o u t of t h e immer e x a m p l e t a k e t h e f o r m Ui =

(4.6)

I n t h e l a s t e x p r e s s i o n , yi denotes the least s q u a r e s m e a n e s t i m a t e for v a r i e t y i ( w h i c h c o i n c i d e s w i t h t h e u s u a l a r i t h m e t i c m e a n e s t i m a t e i n t h i s c a s e ) , s the p o o l e d s t a n d a r d d e v i a t i o n a n d n t h e c o m m o n s a m p l e s i z e , t h a t i s , n = 6 as

APPLICATIONS

we h a v e s i x l o c a t i o n s a n d o n e o b s e r v a t i o n f o r e a c h c o m b i n a t i o n o f v a r i e t y a n d l o c a t i o n ; see E q u a t i o n s (3.3) a n d (3.4) f o r t h e m o r e g e n e r a l e x p r e s s i o n s . E a c h t e s t s t a t i s t i c Uj is u n i v a r i a t e t d i s t r i b u t e d . T h e v e c t o r o f t h e t e s t statistics follows a m u l t i v a r i a t e t d i s t r i b u t i o n . U n t i l the emergence of m o d e r n c o m p u t i n g facilities, t h e exact c a l c u l a t i o n of c r i t i c a l values for t h e T u k e y test was o n l y p o s s i b l e f o r l i m i t e d cases. O n e e x a m p l e is t h e b a l a n c e d i n d e p e n d e n t o n e - w a y l a y o u t . H a y t e r ( 1 9 8 4 ) p r o v e d a n a l y t i c a l l y t h a t i n cases, w h e r e t h e c o v a r i a n c e m a t r i x o f t h e m e a n e s t i m a t e s is d i a g o n a l ( w h i c h i n c l u d e s t h e u n b a l a n c e d o n e - w a y l a y o u t ) , u s i n g t h e c r i t i c a l values f r o m t h e b a l a n c e d case is c o n s e r v a t i v e . I n f a c t , i t has b e e n c o n j e c t u r e d t h a t u s i n g t h e b a l a n c e d c r i t i c a l p o i n t s w i l l a l w a y s b e c o n s e r v a t i v e . T h i s was p r o v e d w h e n A; â&#x20AC;&#x201D; 3 ( B r o w n 1984) and i n a m o r e general c o n t e x t by H a y t e r (1989). However, efficient n u m e r i c a l i n t e g r a t i o n m e t h o d s , s u c h as t h o s e f r o m B r e t z , H a y t e r , a n d G e n z ( 2 0 0 1 ) a n d Genz a n d B r e t z (2009), or user-friendly interfaces o n t o p of such r o u t i n e s , such as t h e m u l t c o m p p a c k a g e i n R, c a n be u s e d t o c a l c u l a t e a d j u s t e d p - v a l u e s o r c r i t i c a l values w i t h o u t r e s t r i c t i o n s . I n t h e f o l l o w i n g we s h o w h o w t o a n a l y z e t h e i m m e r d a t a w i t h m u l t c o m p . F o r d e t a i l s o n t h e m u l t c o m p p a c k a g e w e r e f e r t o S e c t i o n 3.3. W e use t h e fitted A N O V A model immer. aov and apply the g l h t f u n c t i o n to p e r f o r m the multiple comparisons t h r o u g h R> i m m e r . m c

< â&#x20AC;&#x201D; glht(immer.aov, l i n f c t

= mcp(Var

"Tukey"))

I n t h e p r e v i o u s c a l l , w e u s e d t h e mcp f u n c t i o n f o r t h e l i n f c t a r g u m e n t t o specify t h e c o m p a r i s o n t y p e (i.e., the c o n t r a s t m a t r i x ) t h a t we are i n t e r e s t e d i n . T h e s y n t a x is a l m o s t s e l f - d e s c r i p t i v e : S p e c i f y t h e f a c t o r o f r e l e v a n c e a n d select one o f s e v e r a l p r e - d e f i n e d c o n t r a s t m a t r i c e s ( V a r = " T u k e y " i n o u r example). W e o b t a i n a d e t a i l e d s u m m a r y of the results by u s i n g t h e summary m e t h o d associated w i t h t h e g l h t f u n c t i o n , R>

summary(immer.mc) Simultaneous

Multiple

Fit:

Comparisons

aov(formula

Linear p S T V s T V

M M M M P P P

== == == == == == ==

0 0 0 0 0 0 0

Tests

f o r General

of Means:

Hypotheses: Estimate 8. 15 -3.26 23. 81 4. 79 -11.41 15.66 -3.36

(Yl

+ Y2)/2

Tukey

~ Var

Linear

Hypotheses

Contrasts

+ Loc,

data

Std. E r r o r t value P r ( > l t 1) 6. 08 1 . 34 0. 6701 -0.54 0.9824 6. 08 0.0067 6. 08 3. 92 0.9310 6. 08 0. 79 -1. 88 0.3607 6. 08 0.1132 6. 08 2.58 0.9803 6. 08 -0.55

immer)

ALL PAIRWISE COMPARISONS T ~ S == V - S == V - T == Signif. (Adjusted

0 0 0

27. 07 8.05 -19.02

codes: 0 p values

85 4.45 1.32 -3.13

6. 08 6. 08 6. 08

'***' 0.001 reported â&#x20AC;&#x201D;

'**' 0.01 single-step

0.0020 0.6798 0.0377 '*'

** *

0.05 '.' method)

0.1

' ' 1

T h e m a i n p a r t of t h e o u t p u t consists of a t a b l e w i t h 10 r o w s , one for e a c h p a i r w i s e c o m p a r i s o n . P r o m left to right we h a v e a s h o r t d e s c r i p t o r of t h e c o m p a r i s o n s , t h e effect e s t i m a t e s w i t h a s s o c i a t e d s t a n d a r d e r r o r s a n d t h e test s t a t i s t i c s , as defined i n E q u a t i o n ( 4 . 6 ) . M u l t i p l i c i t y a d j u s t e d p-values are r e p o r t e d i n t h e l a s t c o l u m n . B y default, these p-values a r e c a l c u l a t e d f r o m t h e u n d e r l y i n g m u l t i v a r i a t e t d i s t r i b u t i o n ( t h u s a c c o u n t i n g for t h e c o r r e l a t i o n s b e t w e e n t h e test s t a t i s t i c s ) a n d c a n be c o m p a r e d d i r e c t l y w i t h t h e p r e - s p e c i f i e d significance level a . F o r the immer e x a m p l e w e c o n c l u d e a t level OL = 0.05 t h a t t h e p a i r w i s e c o m p a r i s o n s T - M, T - S , a n d V - T a r e a l l s i g n i f i c a n t l y different. A s a n a s i d e , w e p o i n t out t h a t t h e immer e x a m p l e c a n also b e a n a l y z e d u s i n g t h e T u k e y H S D f u n c t i o n from t h e s t a t s p a c k a g e . C a l l i n g R> i m m e r . m c 2 < - T u k e y H S D ( i m m e r . a o v , R> i m m e r . m c 2 $ V a r

p- -M s- -M T--M V--M S--P T--P V--P T--s V--s V--T

diff 8. 15 -3. 26 23. 81 4. 79 -11. 41 15. 66 -3. 36 27. 07 8. 05 -19. 02

lwr -10. 04 -21. 45 5. 62 -13. 40 -29. 60 -2. 53 -21. 55 8. 88 -10. 14 -37. 20

upr 26. 338 14. 929 41 . 996 22. 979 6. 779 33. 846 14. 829 45. 254 26. 238 -0. 829

0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

which =

"Var")

P adj 67006 98242 00678 93102 36068 11322 98035 00203 67982 03771

we c o n c l u d e t h a t t h e r e s u l t s o b t a i n e d w i t h g l h t a r e t h e s a m e a s t h o s e f r o m TukeyHSD. N o t e however t h a t t h e TukeyHSD f u n c t i o n is s p e c i f i c a l l y d e s i g n e d to p e r f o r m t h e T u k e y t e s t for m o d e l s w i t h o n e - w a y s t r u c t u r e s ( H s u 1996, S e c t i o n 7.1). C o n s e q u e n t l y , t h e TukeyHSD f u n c t i o n l a c k s t h e flexibility of t h e m u l t c o m p package, w h i c h allows one to p e r f o r m t h e T u k e y test ( a n d o t h e r t y p e s of c o n t r a s t t e s t s ) for t h e g e n e r a l p a r a m e t r i c m o d e l s d e s c r i b e d i n C h a p t e r 3. F o r e x a m p l e , t h e p r e v i o u s c a l l to g l h t c a n e a s i l y b e c h a n g e d to R> g l h t ( i m m e r . a o v , l i n f c t = m c p ( V a r = + alternative = "greater")

"Tukey"),

i n order to p e r f o r m t h e o n e - s i d e d s t u d e n t i z e d r a n g e t e s t i n v e s t i g a t e d b y H a y t e r ( 1 9 9 0 ) , w h i c h is a o n e - s i d e d v e r s i o n of t h e T u k e y test w i t h p o t e n t i a l a p p l i c a t i o n to dose r e s p o n s e a n a l y s e s . W e n o w c o n s i d e r t h e c o m p u t a t i o n of 9 5 % t w o - s i d e d s i m u l t a n e o u s confidence

APPLICATIONS

i n t e r v a l s for t h e m e a n effect differences \M — fij. I t f o l l o w s f r o m E q u a t i o n (4.6) t h a t i n the completely randomized layout of the immer example the s i m u l t a n e o u s c o n f i d e n c e i n t e r v a l s f o r t h e T u k e y t e s t are g i v e n b y Vi - Vj - i-cv u

— ; Ui - vj + u>i-a $

w h e r e uidenotes t h e (1 — a ) - q u a n t i l e of t h e m u l t i v a r i a t e i d i s t r i b u t i o n . A l t e r n a t i v e l y , w e c a n use t h e c o n f i n t m e t h o d a s s o c i a t e d w i t h t h e g l h t f u n c tion, a

R> i m m e r . c i < R> i m m e r . c i

Multiple

Fit:

confint(immer.mc,

Simultaneous

Confidence

Comparisons

of Means:

aov (formula

(Yl

p s T V s T V T V V

== 0 == 0

- M = = 0 — M == 0 - p == 0 - P == 0 - p mm 0 - s == 0 -'s == 0 - T == 0

Hypotheses: Est imate 8.150 -3.258 23.808 4. 792 -11.408 15.658 -3.358 27. 067 8. 050 -19.017

1 wr -10. -21. 5. -13. -29.

038 446 620 396 596 -2. 530 -21. 546 8. 879 -10. 138 -37. 205

0.95)

Intervals Tukey

+ Y 2 ) / 2 ~ Var

Quantile = 2.99 95% family-wise confidence

Linear

level

Contrasts

+ Loc,

data

immer)

level

upr 338 14. 930 41 . 996 22. 980 6. 780 33. 846 14 . 830 45. 255 26. 238 -0. 829 26.

W e c o n c l u d e f r o m t h e o u t p u t t h a t o n l y t h e c o n f i d e n c e i n t e r v a l s for t h e T - M, T - S, a n d V - T c o m p a r i s o n s e x c l u d e 0, w h i c h is c o n s i s t e n t w i t h t h e previously established test results. Moreover, we conclude at t h e designated c o n f i d e n c e l e v e l o f 9 5 % t h a t t h e a v e r a g e b a r l e y y i e l d f o r v a r i e t y T is l a r g e r t h a n t h e y i e l d f o r t h e v a r i e t i e s M, S, a n d V. T h e c r i t i c a l v a l u e is tto.95 — 2.993. I n a d d i t i o n , we can display the confidence intervals g r a p h i c a l l y w i t h R> p l o t ( i m m e r . c i ,

main = "",

xlab

"Yield")

see F i g u r e 4.4 f o r t h e r e s u l t i n g p l o t . T h e p l o t .matchMMC m e t h o d f r o m t h e H H package orders t h e confidence intervals i n a different, often m o r e i n t u i t i v e way a n d c a n b e u s e d a l t e r n a t i v e l y ; see t h e b o t t o m p l o t o f F i g u r e 4.7. I f t h e n u m b e r k o f t r e a t m e n t s o r c o n d i t i o n s is l a r g e , t h e n u m b e r o f p a i r w i s e

ALL PAIRWISE COMPARISONS

P-M

S - M

T - M

V - M

S - P

T - P

V - P

T - S

V - S

V - T

l -40

-20

Yield

F i g u r e 4.4

Two-sided simultaneous confidence intervals for the T u k e y test in the immer example.

c o m p a r i s o n s , k(k â&#x20AC;&#x201D; l ) / 2 , i n c r e a s e s r a p i d l y . I n s u c h c a s e s , p l o t t i n g t h e r e s u l t i n g set of s i m u l t a n e o u s confidence i n t e r v a l s or a d j u s t e d p-values m i g h t be too e x t e n s i v e a n d m o r e efficient m e t h o d s to d i s p l a y t h e t r e a t m e n t effects together w i t h t h e a s s o c i a t e d significances a r e r e q u i r e d . C o m p a c t letter d i s p l a y s are a convenient t o o l t o p r o v i d e the n e c e s s a r y s u m m a r y i n f o r m a t i o n a b o u t the o b t a i n e d significances. W i t h t h i s display, t r e a t m e n t s a r e a s i g n e d l e t t e r s to i n d i c a t e significant differences. T r e a t m e n t s t h a t a r e n o t s i g n i f i c a n t l y different are a s s i g n e d a c o m m o n letter. I n other w o r d s , t w o t r e a t m e n t s w i t h o u t a c o m m o n letter a r e s t a t i s t i c a l l y significant a t t h e c h o s e n s i g n i f i c a n c e l e v e l . U s i n g s u c h a letter display, a n efficient p r e s e n t a t i o n of t h e test r e s u l t s is a v a i l a b l e , w h i c h is often easier to c o m m u n i c a t e t h a n e x t e n s i v e t a b l e s a n d confidence i n t e r v a l p l o t s , s u c h a s t h o s e p r e s e n t e d i n F i g u r e 4.4. P i e p h o ( 2 0 0 4 ) p r o v i d e d a n efficient a l g o r i t h m for a c o m p a c t l e t t e r - b a s e d r e p r e s e n t a t i o n of a l l p a i r w i s e c o m p a r i s o n s , w h i c h is also i m p l e m e n t e d i n t h e m u l t c o m p p a c k a g e . T h i s

APPLICATIONS

m e t h o d w o r k s for g e n e r a l lists of p a i r w i s e p-values, regardless of h o w t h e y were o b t a i n e d . I n p a r t i c u l a r , t h i s m e t h o d is a p p l i c a b l e to t h e c a s e of u n b a l a n c e d data. T h e e l d f u n c t i o n e x t r a c t s t h e n e c e s s a r y i n f o r m a t i o n from g l h t o b j e c t s t h a t is r e q u i r e d to c r e a t e a c o m p a c t letter d i s p l a y of a l l p a i r w i s e c o m p a r i s o n s . U s i n g t h e immer e x a m p l e from above, we c a n c a l l R> i m m e r . e l d < -

cId(immer.mc)

N e x t , we c a n use t h e p l o t m e t h o d a s s o c i a t e d w i t h e l d o b j e c t s . F i g u r e 4.5 s h o w s t h e b o x p l o t s for e a c h of t h e five varieties together w i t h t h e letter d i s p l a y when calling R>

plot(immer.eld)

B e c a u s e v a r i e t y T h a s no letter i n c o m m o n w i t h a n y other v a r i e t y t h a n P, it differs significantly from M, S , a n d V, b u t not from P a t t h e c h o s e n significance level (a = 0.05 b y d e f a u l t ) . N o further significant differences c a n b e d e c l a r e d , b e c a u s e t h e r e m a i n i n g four varieties a l l s h a r e t h e c o m m o n letter b. T h e s e c o n c l u s i o n s a r e i n line w i t h t h e p r e v i o u s results o b t a i n e d i n F i g u r e 4.4. M o r e details a b o u t t h e e l d f u n c t i o n are given i n S e c t i o n 3.3.3. O n e a d v a n t a g e of u s i n g R is t h e r e t a i n e d flexibility to fine t u n e s t a n d a r d o u t p u t . F o r e x a m p l e , one m a y w i s h to i m p r o v e F i g u r e 4.5 b y o r d e r i n g t h e b o x p l o t s a c c o r d i n g to t h e m e a n y i e l d s for t h e five v a r i e t i e s . S u c h o r d e r i n g m a y e n h a n c e t h e r e a d a b i l i t y , as t r e a t m e n t s w i t h c o m m o n letters a r e m o r e likely to b e d i s p l a y e d s i d e - b y - s i d e . O n e w a y of d o i n g t h i s is s h o w n below; see F i g u r e 4.6 for t h e r e s u l t i n g boxplots. R> R> R> R> + R> R> + R> R> R> R> R> + R> + + + + +

d a t a ( " i m m e r " , package = "MASS") library("HH") immer2 < - immer immer2$Var < - o r d e r e d ( i m m e r 2 $ V a r , l e v e l s = c ( " S " , "M", " V " , " P " , " T " ) ) i m m e r 2 . a o v < - a o v ( ( Y l + Y 2 ) / 2 ~ V a r + L o c , d a t a = immer2) position(immer2$Var) <model.tables(immer2.aov, "type = " m e a n s " ) $ t a b l e s $ V a r immer2.mc <- g l h t ( i m m e r 2 . a o v , l i n f c t = mcp(Var = " T u k e y " ) ) immer2.eld <- eld(immer2.mc) immer2.cld$pos.x <- immer2.cld$x p o s i t i o n ( i m m e r 2 . c l d $ p o s . x ) <- position(immer2$Var) lab <immer2.cld$mcletters$monospacedLetters[levels(immer2$Var)] bwplotdp ~ pos.x, data = immer2.eld, panel^function(...){ p a n e l . b w p l o t . i n t e r m e d i a t e . hh ( . . . ) c p l <- c u r r e n t . p a n e l . l i m i t s ( ) pushViewport(viewport(xscale = cpl$xlim, yscale = cpl$ylim,

ALL PAIRWISE COMPARISONS

o • 4 — »

o "O < iD _ Q. i— (0 <D cz

Var

F i g u r e 4.5

+ + + + + + + + +

Compact letter display for all pairwise comparisons in the immer e x ample.

panel.axis("top", labels upViewport()

clip = "off")) at = position(immer2$Var), = l a b , o u t s i d e = TRUE)

>, scales

= l i s t ( x = l i s t ( l i m i t s = c(90, 120), at = position(immer2$Var), labels = levels(immer2$Var))), main = " " , x l a b = " V a r " , y l a b = " l i n e a r p r e d i c t o r " )

A n a l t e r n a t i v e g r a p h i c a l r e p r e s e n t a t i o n of t h e T u k e y test w a s developed b y H s u a n d P e r u g g i a (1994) a n d e x t e n d e d t o a r b i t r a r y c o n t r a s t s b y H e i b e r g e r a n d H o l l a n d (2004, 2 0 0 6 ) . T h e r e s u l t i n g m e a n - m e a n m u l t i p l e c o m p a r i s o n p l o t allows t h e d i s p l a y of m u l t i p l e relevant i n f o r m a t i o n i n t h e s a m e plot, i n c l u d i n g

APPLICATIONS

90 .a CO

o o

140

120

100

Var

F i g u r e 4.6

A l t e r n a t i v e compact letter display for a l l pairwise comparisons i n t h e immer example.

t h e observed g r o u p m e a n values, w i t h correct r e l a t i v e distances, a n d p o i n t estimates for a r b i t r a r y contrasts together w i t h t h e i r associated simultaneous confidence intervals. I n t h e f o l l o w i n g we i l l u s t r a t e t h e m e a n - m e a n m u l t i p l e comparison plot w i t h the immer data. M e a n - m e a n m u l t i p l e c o m p a r i s o n p l o t s are a v a i l a b l e w i t h t h e H H p a c k a g e ( H e i b e r g e r 2 0 0 9 ) v i a t h e g l h t . m m c f u n c t i o n . I t s s y n t a x is v e r y s i m i l a r t o t h e one used for g l h t , R> l i b r a r y ( " H H " ) R> immer.nunc < - g l h t . m m c ( i m m e r . a o v , l i n f c t = m c p ( V a r = " T u k e y " ) , + focus = "Var", lmat.rows = 2:5) I n t h e p r e v i o u s s t a t e m e n t , t h e f o c u s a r g u m e n t defines t h e f a c t o r w h o s e c o n t r a s t s a r e t o b e c o m p u t e d . I n a d d i t i o n , t h e l m a t . r o w s a r g u m e n t specifies t h e r o w s i n l m a t f o r t h e f o c u s f a c t o r , w h e r e b y c o n v e n t i o n l m a t is t h e t r a n s p o s e

ALL PAIRWISE COMPARISONS

of t h e l i n f c t c o m p o n e n t p r o d u c e d by g l h t (see S e c t i o n 3.3.1 g l h t objects). I n our example, R>

for d e t a i l s on

t(immer.mc$linfct)

P-M S-M T-M V-M S-P 0 0 0 0 (Intercept) 0 VarP 1 0 0 0 -1 1 0 1 VarS 0 0 VarT 0 0 0 1 0 VarV 0 0 0 1 0 LocD 0 0 0 0 0 LocGR 0 0 0 0 0 LocM 0 0 0 0 0 LocUF 0 0 0 0 0 LocW 0 0 0 0 0 attr( "type") [1] "Tukey"

T-P V-P 0 0 -1 -1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0

T-S v-s V-T 0 0 0 0 0 0 -1 -1 0 1 0 -1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

a n d we see t h a t t h e T u k e y c o n t r a s t s for the f o c u s factor V a r a r e specified i n rows 2 t h r o u g h 5; t h u s the l m a t . r o w s = 2 : 5 s p e c i f i c a t i o n i n t h e g l h t . m m c c a l l above. W i t h these p r e p a r a t i o n s , we c a n d i s p l a y the m e a n - m e a n m u l t i p l e c o m p a r i s o n plot for t h e t h e immer d a t a u s i n g R> p l o t (immer .nunc, r y = c ( 8 5 , 1 2 2 ) , + main = , main2 = " " )

x.offset

w h e r e r y a n d x . o f f s e t are a r g u m e n t s to fine t u n e t h e a p p e a r a n c e of t h e plot; see t h e top g r a p h i n F i g u r e 4.7 for the r e s u l t s . T h e 9 5 % confidence i n t e r v a l s o n fix ~ M M , M T — Ms> d PT — Pv lie e n t i r e l y to t h e r i g h t of 0, w h i l e a l l other confidence i n t e r v a l s i n c l u d e 0. W e t h u s c o n c l u d e a t t h e 5 % level t h a t the average b a r l e y y i e l d for v a r i e t y T is larger t h a n t h e y i e l d for t h e v a r i e t i e s M, S , a n d V. N o further significances b e t w e e n t h e v a r i e t i e s a r e u n c o v e r e d , w h i c h reflects t h e test decisions o b t a i n e d from F i g u r e 4.4. a

E a c h of t h e s i m u l t a n e o u s confidence i n t e r v a l s for t h e 10 p a i r w i s e differences pi — fij i n F i g u r e 4.7 is centered a t a point w h o s e height o n t h e left y - a x i s is e q u a l to t h e average of t h e c o r r e s p o n d i n g m e a n s yi a n d yj a n d w h o s e l o c a t i o n a l o n g t h e rc-axis is a t d i s t a n c e yi — yj f r o m t h e v e r t i c a l line a t 0. N o t e t h a t t h e confidence i n t e r v a l s o n pp — ps a m d pv — M M a p p e a r a l m o s t at t h e s a m e height i n F i g u r e 4.7, l e a d i n g to a n i n f o r m a t i v e o v e r p r i n t i n g of the c o n t r a s t n a m e s a t t h e right y - a x i s . I n cases of o v e r p r i n t i n g , H e i b e r g e r a n d H o l l a n d (2006) suggested a tiebreaker plot of t h e s i m u l t a n e o u s confidence i n t e r v a l s s i m i l a r to F i g u r e 4.4, w h e r e the c o n t r a s t s a r e e v e n l y s p a c e d v e r t i c a l l y i n t h e s a m e o r d e r as i n t h e m e a n - m e a n m u l t i p l e c o m p a r i s o n plot a n d o n the s a m e h o r i z o n t a l s c a l e . T h e tiebreaker plot for t h e immer d a t a d i s p l a y e d a t t h e b o t t o m of F i g u r e 4.7 w a s o b t a i n e d w i t h t h e c o m m a n d R> p l o t . m a t c h M M C ( i m m e r . m m c $ m c a ,

main =

"")

T h e i s o m e a n s g r i d i n t h e center of F i g u r e 4.7 is p a r t i c u l a r l y useful w h e n

APPLICATIONS

92 Mean-mean multiple comparison plot

118.20

TQP

Ml!V

n iiv TLIS 102.54

99.18

94.39

91.13

-h

POV

POK 90S VDS - MQS

mean ( Y I + Y2)/2 _ Var level

4 Q

-20

60 contrast

contrast value

Tiebreaker plot

-TOP

rnv T!"fv TfiS

• PDV

• PDN •PDS

• Vrjiv VDS

Mas

-Hi •40

F i g u r e 4.7

•20

Mea.n-iTiea.ii m u l t i p l e comparison p l o t ( t o p ) a n d tiebreaker p l o t ( b o t t o m ) for a l l pairwise comparisons i n t h e immer example.

ALL PAIRWISE COMPARISONS

p l o t t i n g s i m u l t a n e o u s confidence intervals for o r t h o g o n a l or m o r e g e n e r a l c o n t r a s t s , as s h o w n now. I t is w e l l k n o w n t h a t t h e use of c a r e f u l l y selected o r t h o g o n a l c o n t r a s t s is often preferable to a r e d u n d a n t set of c o n t r a s t s . B e c a u s e we h a v e a t o t a l of five v a r i e t i e s , we c a n choose a b a s i s of four o r t h o g o n a l c o n t r a s t s , s u c h t h a t a n y other c o n t r a s t c a n be c o n s t r u c t e d a s a l i n e a r c o m b i n a t i o n of t h e elements of t h i s b a s i s set. A s s u m e t h a t w e a r e i n t e r e s t e d i n t h e four orthogonal contrasts R> i m m e r . l m a t < - c b i n d ( " M - S " + "MS-V" + "VMS-P" + "PVMS-T" R> r o w . n a m e s C i m m e r . l m a t ) < - c ( "

= = = = M"

c ( l , 0,-1, 0, 0), c ( l , 0 , 1, 0,-2), c ( l , - 3 , 1, 0 , 1), c ( l , 1, 1 , - 4 , 1)) ,"P","S","T","V")

W e w o u l d like to r e p e a t a s i m i l a r a n a l y s i s b a s e d o n a m e a n - m e a n c o m p a r i s o n p l o t as d o n e a b o v e for t h e T u k e y t e s t . T h e s t a t e m e n t R> immer.mmc2 +

glht.nunc(immer.aov, focus.lmat

multiple

l i n f c t = mcp(Var = " T u k e y " ) , = immer.lmat)

is sufficient to c o n s t r u c t t h e plot i n F i g u r e 4.8. C l e a r l y , t h e T - P VMS c o m p a r i s o n is h i g h l y significant, t h u s u n d e r p i n n i n g t h e p r e v i o u s l y o b s e r v e d g o o d r e s u l t of v a r i e t y T . B y c o m p a r i n g F i g u r e s 4.7 a n d 4.8, one n o t i c e s t h a t t h e p o i n t e s t i m a t e s a s s o c i a t e d w i t h t h e p a i r w i s e c o n t r a s t s a r e c o n s t r u c t e d as a l i n e a r c o m b i n a t i o n of p o i n t e s t i m a t e s of the o r t h o g o n a l c o n t r a s t s . N o t e also t h a t t h e confidence i n t e r v a l s i n F i g u r e 4.7 a r e not n e c e s s a r i l y c e n t e r e d a n y m o r e a t t h e i n t e r s e c t i o n s of t h e i s o m e a n s g r i d . F i n a l l y , we n o t e t h a t i n t e r v a l s o b t a i n e d w i t h t h e g l h t . nunc s t a t e m e n t above a r e b y default b a s e d o n t h e c o n s e r v a t i v e g e n e r a l i z e d T u k e y t e s t , w h i c h allows for selection of c o n t r a s t s b a s e d o n findings suggested b y t h e p a i r w i s e a n a l y s i s ( H o c h b e r g a n d T a m h a n e 1987, p. 8 0 ) . A s mentioned previously, the m e a n - m e a n multiple comparison plots i m p l e m e n t e d i n t h e H H p a c k a g e a r e a p p l i c a b l e to g e n e r a l c o n t r a s t s i n f a c t o r i a l designs. W e refer to H e i b e r g e r a n d H o l l a n d (2004, 2 0 0 6 ) for f u r t h e r d e t a i l s and examples.

4-2.2

Closed

Tukey

test

procedure

I n S e c t i o n 4.2.1 we c o n s i d e r e d t h e o r i g i n a l T u k e y t e s t , w h i c h is a s i n g l e - s t e p t e s t . S i m i l a r to t h e D u n n e t t test d e s c r i b e d i n S e c t i o n 4.1, s t e p w i s e e x t e n s i o n s are also a v a i l a b l e for t h e T u k e y t e s t , l e a d i n g to m o r e p o w e r ; see, for e x a m p l e , F i n n e r ( 1 9 8 8 ) . A s e x p l a i n e d i n S e c t i o n 2.1.2, s t e p w i s e test p r o c e d u r e s r e j e c t at least as m a n y h y p o t h e s e s as t h e i r single-step c o u n t e r p a r t s . I n t h e following, we a p p l y t h e c l o s u r e p r i n c i p l e d e s c r i b e d i n S e c t i o n 2.2.3 to t h e a l l - p a i r w i s e c o m p a r i s o n p r o b l e m , d i s c u s s efficient m e t h o d s to p r u n e t h e full c l o s u r e a n d t h e n c o m e b a c k t o t h e immer d a t a e x a m p l e to i l l u s t r a t e t h e c l o s e d T u k e y t e s t , w h i c h is also a v a i l a b l e i n t h e m u l t c o m p package. R e c a l l from S e c t i o n 2.2.3 t h a t t h e c l o s u r e p r i n c i p l e c o n s t r u c t s a l l possible i n t e r s e c t i o n h y p o t h e s e s f r o m a n i n i t i a l set of e l e m e n t a r y h y p o t h e s e s . I n s o m e

APPLICATIONS

Var level

contrast contrast value

F i g u r e 4.8

Mean-mean multiple comparison plot for selected orthogonal contrasts in the immer example.

cases, d e p e n d i n g u p o n t h e s t r u c t u r e of t h e h y p o t h e s e s t e s t e d , t h e c l o s u r e tree c a n be p r u n e d considerably, r e s u l t i n g i n m u c h s m a l l e r a d j u s t e d p-values and h e n c e m o r e powerful tests. F o r e x a m p l e , consider t h e p a i r w i s e c o m p a r i s o n s case w i t h t h r e e g r o u p s : T h e b a s i c hypotheses a r e H\ : Mi M2 H13: M I â&#x20AC;&#x201D; M3> a n d #23: M2 = M3- A s seen i n F i g u r e 2.6, these a r e t h e t h r e e e l e m e n t a r y h y p o t h e s e s . H o w e v e r , u n l i k e t h e m a n y - t o - o n e case d i s p l a y e d i n F i g u r e 4.3, t h e r e is o n l y one possible i n t e r s e c t i o n h y p o t h e s i s , r a t h e r t h a n four, b e c a u s e e v e r y i n t e r s e c t i o n h y p o t h e s i s h a s t h e form H123: Mi = M2 = M3- H a v i n g fewer intersections to test i m p l i e s greater power. I n t h i s case, t h e a d j u s t e d p-values for e a c h of t h e t h r e e e l e m e n t a r y hypotheses Hij are t h e m a x i m u m of j u s t two p-values: T h e u n a d j u s t e d p-value from Hij a n d t h e p-value f r o m t h e global i n t e r s e c t i o n h y p o t h e s i s #123; see E q u a t i o n (2.2) for a g e n e r a l definition of a d j u s t e d p - v a l u e s u s i n g t h e closure p r i n c i p l e . =

P r u n i n g of t h e closure set h a p p e n s w h e n Hi

â&#x20AC;&#x201D;

for d i s t i n c t s u b s e t s

ALL PAIRWISE COMPARISONS

a n d I', a n d i n t h i s case the h y p o t h e s e s a r e s a i d to be r e s t r i c t e d ; see also S e c t i o n 2.1.2. O n t h e o t h e r h a n d , if Hi ^ Hp for e a c h p a i r of d i s t i n c t s u b s e t s / and t h e n t h e h y p o t h e s e s are s a i d to satisfy t h e free c o m b i n a t i o n c o n d i t i o n . E x a m p l e s of free c o m b i n a t i o n s i n c l u d e t h e m a n y - t o - o n e c o m p a r i s o n s c o n s i d ered i n S e c t i o n 4.1 a s w e l l as c o m p a r i s o n s of m e a n s w i t h t w o - s a m p l e m u l t i v a r i a t e d a t a (as d i s c u s s e d , for e x a m p l e , i n S e c t i o n 5.1). E x a m p l e s of r e s t r i c t e d combinations include all-pairwise comparisons, response surface comparisons, a n d g e n e r a l c o m p l e x c o n t r a s t sets. F i g u r e 4.9 s h o w s h o w the p r u n e d tree looks i n t h e c a s e of a l l p a i r w i s e c o m p a r i s o n s a m o n g four g r o u p s . T h e size of t h e tree is c o n s i d e r a b l y s m a l l e r t h a n t h e 2 — 1 = 63 i n t e r s e c t i o n h y p o t h e s e s t h a t w o u l d a p p l y if t h e t e s t s were i n free c o m b i n a t i o n s . B e c a u s e i n t h e r e s t r i c t e d c o m b i n a t i o n c a s e m a n y of t h e i n t e r s e c t i o n s p r o d u c e i d e n t i c a l i n t e r s e c t i o n h y p o t h e s e s , t h e r e a r e o n l y 14 d i s tinct subsets i n the closure. 6

Shaffer (1986) p r e s e n t e d a m e t h o d for u s i n g t h e p r u n e d c l o s u r e t r e e i n t h e case of r e s t r i c t e d c o m b i n a t i o n s to p r o d u c e m o r e p o w e r f u l t e s t s , u s i n g B o n f e r r o n i t e s t s for e a c h node; see also S e c t i o n 2.3.2. B e r g m a n n a n d H o m m e l (1988) e x t e n d e d t h e c o n s e p t of t h e Shaffer p r o c e d u r e a n d f o u n d a n a l g o r i t h m for p r u n i n g t h e c l o s u r e t e s t , w h i c h w a s s u b s e q u e n t l y i m p l e m e n t e d by B e r n h a r d ( 1 9 9 2 ) , see also H o m m e l a n d B e r n h a r d ( 1 9 9 2 ) . R o y e n ( 1 9 9 1 ) n o t e d t h a t t h e Shaffer m e t h o d is i n fact a truncated closed test procedure, w h e r e t h e i n d i v i d u a l tests a r e p e r f o r m e d i n the order of t h e u n a d j u s t e d p-values, s t o p p i n g as s o o n as a n insignificance occurs. H o m m e l a n d B e r n h a r d (1999) investigated more powerful closed test p r o c e d u r e s t h a t a r e o b t a i n e d if t e s t i n g is d o n e w i t h o u t t r u n c a t i o n , b u t t h e r e s u l t i n g inference m a y no longer b e m o n o t o n e i n t h e pv a l u e s . H o m m e l a n d B r e t z (2008) gave a n e x a m p l e w h e r e a n o n - m o n o t o n i c d e c i s i o n p a t t e r n s t i l l l e a d s to a m e a n i n g f u l i n t e r p r e t a t i o n . F i n a l l y , W e s t f a l l (1997) a n d W e s t f a l l a n d T o b i a s (2007) s h o w e d h o w to p e r f o r m t h e t r u n c a t e d closed test p r o c e d u r e for g e n e r a l c o n t r a s t s . T h e m e t h o d of W e s t f a l l ( 1 9 9 7 ) is a v a i l a b l e i n t h e m u l t c o m p package a n d w i l l b e i l l u s t r a t e d b e l o w w i t h t h e immer d a t a e x a m p l e . R e c a l l f r o m S e c t i o n 4.1.2 t h a t t h e s t e p - d o w n D u n n e t t p r o c e d u r e m a y be o b t a i n e d a s follows: (i) order t h e u n a d j u s t e d p-values p ( i ) < . . . < P( ) as u s u a l , c o r r e s p o n d i n g to h y p o t h e s e s i ? ( i ) , • • •, H( y, (ii) test i ^ ( i ) , . . •, H( ) s e q u e n t i a l l y a n d o b t a i n t h e a d j u s t e d p-values q^ = m a x / ^ / p / , stopping as s o o n a s q^ > a , w h e r e a is t h e d e s i r e d s i g n i f i c a n c e l e v e l to c o n t r o l t h e f a m i l y w i s e e r r o r r a t e . T r u n c a t e d closed test p r o c e d u r e s u s e t h e s a m e m e t h o d , b u t a p p l y it t o r e s t r i c t e d c o m b i n a t i o n s , w h e r e t h e set of s u b s e t s I c o n s i d e r e d i n t h e c o m p u t a t i o n of q^ = m a x ; i j pi is often m u c h s m a l l e r t h a n t h e s a m e set u n d e r t h e free c o m b i n a t i o n c o n d i t i o n , l e a d i n g t o s m a l l e r a d j u s t e d p-values <7(i) a n d h e n c e m o r e p o w e r f u l tests. T h e m e t h o d is c a l l e d " t r u n c a t e d " b e c a u s e it is a p p l i e d i n t h e o r d e r of t h e a d j u s t e d p-values; i t is p o s s i b l e to m a k e t h e method even more powerful by removing this constraint. O n the other h a n d , t r u n c a t i o n e n s u r e s t h a t t h e order of t h e a d j u s t e d p-values is t h e s a m e a s t h e m

F i g u r e 4.9

Closure principle for all pairwise comparisons of four means (schematic diagram). Here, i f 1 2 3 4 : Mi = M2 = M3 = M4, i2ijfc: Mi = MJ M*> ^{ij'Mfc,*} M< Mi d Mfc = M^J d : Mi Mj f Â° suitable i n dices i , j , and ÂŁ . Different line types are used to distinguish th dividual implications between hypotheses and have no other meaning. =

a n

ALL PAIRWISE COMPARISONS

order of t h e u n a d j u s a t e d p-values; w i t h o u t t r u n c a t i o n t h e o r d e r s m a y differ, l e a d i n g to possible i n t e r p r e t a t i o n difficulties. T o i l l u s t r a t e t h e m e t h o d , consider t h e immer d a t a f r o m a b o v e together w i t h m o d e l ( 4 . 5 ) . O f interest a r e t h e 10 e l e m e n t a r y h y p o t h e s e s Hij: ^ = Pji i,3 £ S T, V},i 7^ j . A s w i t h t h e s t e p - d o w n D u n n e t t p r o c e d u r e , it is helpful here to express t h e p r o c e d u r e i n t e r m s of t s t a t i s t i c s r a t h e r t h a n p - v a l u e s . L e t t^ denote t h e observed t s t a t i s t i c s for t e s t i n g Hij a n d let Uj d e n o t e t h e r a n d o m c o u n t e r p a r t s from E q u a t i o n ( 4 . 6 ) . F o r t h e immer d a t a , t h e o b s e r v e d a b s o l u t e t s t a t i s t i c s a r e given i n order a s = 4.453, i g j T = 3.917, = 3.129, tgjl = 2.576, igg = 1.877, tgjg = 1.341, = 1.324, tgjft = 0.788, = 0.553, = 0.536. T h e s e values c a n b e o b t a i n e d f r o m , for e x a m ple, t h e summary o u t p u t of t h e immer .mc o b j e c t , a s d e s c r i b e d i n S e c t i o n 4.2.1. t

U s i n g t h e t r u n c a t e d closed test p r o c e d u r e , w e s t e p t h r o u g h t h e r e l a t e d s e quence of h y p o t h e s e s HST, HMT, • • •> HMS a n d test r e l e v a n t s u b s e t s f r o m t h e full closure u s i n g m a x - i tests of t h e form (2.1) w h i l e i n c o r p o r a t i n g t h e p a r a m e t r i c m o d e l r e s u l t s f r o m C h a p t e r 3. A p p l i e d to t h e immer e x a m p l e , we o b t a i n t h e following i n d i v i d u a l steps: (i) A l l s u b s e t s a r e s t r i c t l y s m a l l e r t h a n t h e global i n t e r s e c t i o n s e t . T e s t HST using q = P ( m a x - Uj > 4.453) = 0.002. {1)

i j

(ii) R e l e v a n t s u b s e t s a n d t h e i r p-values

are given by

P(max{tMTi*TV,tpT,*MPj*MV,*pv}

> 3.917) = 0.004,

P ( m a x { t T , t p T , * M P , * s v } > 3.917) = 0.003, M

P ( m a x { t T , i T V , * P S i M v } > 3.917) = 0.003, a n d M

P(max{tMT,tps,*sv,tpv} T e s t HMT

> 3.917) = 0.003.

using q)

= m a x { 0 . 0 0 2 , 0 . 0 0 4 , 0 . 0 0 3 } = 0.004.

(iii) R e l e v a n t s u b s e t s a n d t h e i r p-values

a r e given b y

P ( m a x { £ v , * P S , * M P , * M s } > 3.129) = 0.019 a n d T

P ( m a x { t v , * P T , £ p v , t M s } > 3.129) = 0.019. T

T e s t HTV

using 3(3) = m a x { 0 . 0 0 4 , 0 . 0 1 9 } = 0.019.

( i v ) T h e r e is o n l y one relevant subset here, a n d i t s p-value

is g i v e n b y

P ( m a x { £ , * s v , * M V , * M s } > 2.576) = 0.062. P T

T e s t HPT

using q

(4)

= m a x { 0 . 0 1 9 , 0 . 0 6 2 } = 0.062.

( v ) T h e r e is o n l y one r e l e v a n t subset h e r e , a n d i t s p-value P(max{tps,tMP,*sv,^MV,^PV,*Ms}

> 1.877)

is g i v e n b y =0.269.

APPLICATIONS T e s t Hps

using (7(5) = m a x { 0 . 0 6 2 , 0 . 2 6 9 } -

0.209.

( v i ) R e l e v a n t s u b s e t s a n d t h e i r p - v a l u e s are g i v e n b y F ( m a x { t p , t M V , / ; p v } > 1-341) = 0.39, M

P(max{t p,*sv} M

T e s t HMP

> 1-341) -

and

0.347.

using (7( ) 6

m a x j O . 2 6 9 , 0.39} =

0.39.

( v i i ) T h e r e is o n l y o n e r e l e v a n t s u b s e t here, a n d i t s p - v a l u e is g i v e n b y P(max{t v,tMv,^Ms} S

T e s t Hsv

> 1-324) -

0.399.

using q

{7)

= m a x { 0 . 3 9 , 0 . 3 9 9 } = 0.399.

( v i i i ) T h e r e is o n l y one r e l e v a n t s u b s e t here, t h e s i n g l e t o n . I t s p - v a l u e is g i v e n by P O M V > 0.788) =

0.44,

t h a t is, t h e u n a d j u s t e d p-value. Test £ / M V u s i n g q) (8

m a x { 0 . 3 9 9 , 0.44} =

0.44,

( i x ) T h e r e is o n l y one r e l e v a n t s u b s e t h e r e , a n d i t s p - v a l u e is g i v e n b y P ( m a x { £ p , £ M s } > 0.553) = 0 . 8 2 6 . V

T e s t Hp-v

using q

(Q)

= m a x { 0 . 4 4 , 0 . 8 2 6 } — 0.826.

( x ) T h e r e is o n l y one r e l e v a n t s u b s e t here, t h e s i n g l e t o n . I t s p - v a l u e is g i v e n by P ( £ M S > 0.536) = 0 . 5 9 8 , t h a t i s , t h e u n a d j u s t e d p - v a l u e . T e s t -HMS u s i n g <? ) = m a x { 0 . 8 2 6 , 0 . 5 9 8 } = 0 . 8 2 6 . (10

I n t h i s e x a m p l e , t h i s is t h e o n l y case w h e r e " t r u n c a t i o n " t o o k p l a c e , s i n c e t h e a d j u s t e d p - v a l u e was t r u n c a t e d u p w a r d t o a c c o u n t f o r t h e o r d e r e d t e s t i n g . W e s t f a l l (1997) explained h o w t o a u t o m a t e t h e process of f i n d i n g t h e r e l e v a n t s u b s e t s b y u s i n g r a n k c o n d i t i o n s i n v o l v i n g c o n t r a s t m a t r i c e s for t h e v a r i o u s s u b s e t s . I n R, t r u n c a t e d closed t e s t p r o c e d u r e s are i m p l e m e n t e d i n the m u l t c o m p package v i a t h e t e s t = a d j u s t e d ( t y p e = " W e s t f a l l " ) o p t i o n . T h e code below provides exactly the results f r o m t h e m a n u a l calculations above:

DOSE RESPONSE ANALYSES R> s u m m a r y ( i m m e r . m c ,

test

Simultaneous Multiple

Fit:

aov(formula

Linear P-M S-M T-M V-M S-P T-P V-P T-S V-S V-T

== == == == == == == == == ==

Signif. (Adjusted

0 0 0 0 0 0 0 0 0 0

(YI

Hypotheses: Estimate Std. 8.15 -3.26 23. 81 4. 79 -11.41 15.66 -3.36 27. 07 8. 05 -19.02 codes: 0 p values

= adjusted(type

Tests

Comparisons

for

Means:

+ Y2)/2

General Tukey

~ Var

"Westfall"))

Linear

Hypotheses

Contrasts

+ Loc,

data

Error t value Pr(>\t\) 6. 08 1. 34 0.3899 0.8257 6. 08 -0.54 6. 08 3.92 0. 0045 6. 08 0. 79 0.4397 6. 08 -1. 88 0.2691 0. 0617 6. 08 2.58 0.8257 6. 08 -0.55 4.45 0.0020 6. 08 1.32 0.3985 6. 08 6. 08 0.0192 -3.13

' ***' 0.001 ' **' 0.01 reported â&#x20AC;&#x201D; Westfall

immer)

** *

'*' 0. 05 method)

I n T a b l e 4.2 we s u m m a r i z e t h e r e s u l t s f r o m four m e t h o d s a p p l i e d to the immer d a t a s e t . T o b e precise, T a b l e 4.2 r e p o r t s s i d e - b y - s i d e t h e u n a d j u s t e d p-values Pi from R> s u m m a r y ( i m m e r . m c ,

test

= adjusted(type

"none"))

together w i t h t h e a d j u s t e d p - v a l u e s from t h e T u k e y t e s t , i t s s t e p w i s e e x t e n s i o n u s i n g t r u n c a t e d c l o s u r e , a n d t h e S2 p r o c e d u r e (Shaffer 1986) R> s u m m a r y ( i m m e r . m c ,

test

= adjusted(type

"Shaffer"))

T h e S2 p r o c e d u r e differs from t y p e = " W e s t f a l l " b y u s i n g B o n f e r r o n i tests i n s t e a d of t h e p a r a m e t r i c r e s u l t s from C h a p t e r 3. W e f i r s t c o n c l u d e t h a t t h e p a i r w i s e c o m p a r i s o n s T - M, T - S, a n d V - T a r e a l l significant a t level a = 0.05, i r r e s p e c t i v e of w h i c h m u l t i p l e c o m p a r i s o n p r o c e d u r e is u s e d . A t a n u n a d j u s t e d l e v e l , T - P is also significant, a l t h o u g h t h e f a m i l y w i s e e r r o r r a t e m a y n o t be c o n t r o l l e d i n t h i s case. N o t e f u r t h e r , t h a t t h e t r u n c a t e d closed test p r o c e d u r e f r o m W e s t f a l l (1997) - b a s e d o n m a x - t tests w h i l e e x p l o i t i n g logical c o n s t r a i n t s a n d t h e p a r a m e t r i c r e s u l t s f r o m C h a p t e r 3 â&#x20AC;&#x201D; l e a d s to u n i f o r m l y s m a l l e r a d j u s t e d p - v a l u e s t h a n b o t h t h e T u k e y test a n d t h e Shaffer procedure. 4.3

Dose response

analyses

I n t h i s s e c t i o n w e c o n s i d e r a n A N C O V A m o d e l for a dose r e s p o n s e s t u d y w i t h t w o c o v a r i a t e s . I n S e c t i o n 4.3.1 we i n t r o d u c e t h e e x a m p l e a n d a p p l y t h e

APPLICATIONS

100

Comparison P S T V S T V T V V T a b l e 4.2

- M - M - M - M - P - P - P - S - S - T

A d j u s t e d p-values qi Tukey Shaffer Westfall

0.195 0.598 0.001 0.440 0.075 0.018 0.587 0.001 0.200 0.005

0.670 0.982 0.007 0.931 0.361 0.113 0.980 0.002 0.680 0.038

0.585 1.000 0.005 0.601 0.451 0.072 1.000 0.002 0.601 0.021

0.390 0.826 0.004 0.440 0.269 0.062 0.826 0.002 0.399 0.019

Comparison of several multiple comparison procedures for the immer data. Tukey: original Tukey (1953) test (single-step); Shaffer: S2 p r o cedure from Shaffer (1986); Westfall: truncated closed test procedure from Westfall (1997).

D u n n e t t test b a s e d o n p a i r w i s e differences a g a i n s t t h e c o n t r o l . I n S e c t i o n 4.3.2 we focus o n e v a l u a t i n g n o n - p a i r w i s e c o n t r a s t tests for d e t e c t i n g a dose r e l a t e d t r e n d . I n S e c t i o n 5.3 w e e x t e n d t h e d i s u s s i o n a n d d e s c r i b e r e l a t e d a p p r o a c h e s w h i c h c o m b i n e m u l t i p l e c o m p a r i s o n procedures w i t h m o d e l i n g t e c h n i q u e s .

4.3.1

A dose response

study

on litter weight in

mice

U n d e r s t a n d i n g t h e dose response r e l a t i o n s h i p is a f u n d a m e n t a l s t e p i n i n v e s t i g a t i n g a n y n e w c o m p o u n d , w h e t h e r i t is a n h e r b i c i d e or fertilizer, a m o l e c u l a r entity, a n e n v i r o n m e n t a l t o x i n , or a n i n d u s t r i a l c h e m i c a l ( R u b e r g 1995a,b; B r e t z , H s u , Pinheiro, a n d L i u 2008b). F o r example, determining a n adequate dose level for a m e d i c i n a l d r u g , a n d , more broadly, c h a r a c t e r i z i n g i t s dose r e sponse r e l a t i o n s h i p w i t h r e s p e c t to b o t h efficacy a n d safety, a r e k e y o b j e c t i v e s of a n y c l i n i c a l d r u g development p r o g r a m . I f t h e dose is s e t t o o h i g h , safety a n d t o l e r a b i l i t y p r o b l e m s m a y r e s u l t , b u t i f t h e dose is t o o l o w i t m a y b e c o m e difficult t o e s t a b l i s h a d e q u a t e efficacy. T h e r e is a v a s t l i t e r a t u r e o n dose findi n g , e s p e c i a l l y i n t h e a r e a of p h a r m a c e u t i c a l s t a t i s t i c s , a n d w e refer t h e reader to t h e e d i t e d b o o k s b y T i n g ( 2 0 0 6 ) ; C h e v r e t ( 2 0 0 6 ) , a n d K r i s h n a (2006) for general reading on this topic. W e u s e a dose response s t u d y i n v o l v i n g t w o c o v a r i a t e s to i l l u s t r a t e t h e m e t h o d s f r o m C h a p t e r 3. C o n s i d e r t h e s u m m a r y s t a t i s t i c s i n T a b l e 4.3 of a T h a l i d o m i d e s t u d y t a k e n from W e s t f a l l a n d Y o u n g (1993) a n d r e a n a l y z e d by B r e t z ( 2 0 0 6 ) . F o u r dose levels (0, 5, 50 a n d 500 u n i t s of t h e s t u d y c o m p o u n d ) w e r e a d m i n i s t e r e d to p r e g n a n t m i c e . T h e i r litters w e r e e v a l u a t e d for defects a n d weights. A c c o r d i n g to W e s t f a l l a n d Y o u n g ( 1 9 9 3 ) , t h e p r i m a r y response

DOSE RESPONSE

101

ANALYSES

v a r i a b l e w a s t h e average p o s t - b i r t h weight i n t h e e n t i r e l i t t e r for t h e first three weeks of t h e s t u d y . S i n c e t h e l i t t e r weight m a y d e p e n d o n g e s t a t i o n t i m e a n d n u m b e r of a n i m a l s i n t h e l i t t e r , these two v a r i a b l e s w e r e i n c l u d e d as covariates i n t h e a n a l y s i s . W e t h u s consider the A N C O V A m o d e l Vij = Po + Pi +

foziij

+ 0ez ij 2

+ eij,

(4.7)

w h e r e yij denotes t h e weight of t h e j - t h l i t t e r a t dose level i , z\ a n d z t h e two c o v a r i a t e s a n d / ? i , . . . , / ? 4 t h e p a r a m e t e r s of i n t e r e s t , n a m e l y , t h e effect of dose level i after a d j u s t i n g for t h e effects of t h e c o v a r i a t e s . F i n a l l y , t h e r a n d o m e r r o r s a r e a s s u m e d to be i n d e p e n d e n t , h o m o s c e d a s t i c a n d n o r m a l l y d i s t r i b u t e d , Sij ~ N(0 a ). N o t e t h a t m o d e l (4.7) is a s p e c i a l c a s e of t h e g e n e r a l l i n e a r m o d e l (3.1) a n d t h e m e t h o d s f r o m S e c t i o n 3.1.1 t h u s a p p l y to t h e T h a l i d o m i d e e x a m p l e . F i g u r e 4.10 d i s p l a y s t h e c o v a r i a t e - a d j u s t e d l i t t e r weight d a t a from t h e T h a l i d o m i d e s t u d y , t h a t i s , t h e least s q u a r e s m e a n s of t h e four t r e a t m e n t groups together w i t h t h e a s s o c i a t e d m a r g i n a l confidence intervals. 2

Weight Gestation time L i t t e r size

T a b l e 4.3

Dose G r o u p s a m p l e size

0 20

5 19

50 18

500 17

Mean Standard deviation Mean Standard deviation Mean Standard deviation

32.31 2.70 22.08 0.44 13.40 3.02

29.31 5.09 22.21 0.45 13.11 2.58

29.87 3.76 21.89 0.40 14.67 1.50

29.65 5.40 22.18 0.43 12.53 2.58

S u m m a r y d a t a of the Thalidomide dose response example from Westfall and Y o u n g (1993).

T h e e x p e r i m e n t a l q u e s t i o n is w h e t h e r one c a n c l a i m a s t a t i s t i c a l l y s i g n i f i c a n t decrease i n average p o s t - b i r t h weight w i t h i n c r e a s i n g doses of T h a l i d o m i d e . A s t r a i g h t f o r w a r d a p p r o a c h is to p e r f o r m t h e p a i r w i s e c o m p a r i s o n s b e t w e e n t h e t h r e e a c t i v e dose levels 5, 50, 500 a n d t h e zero-dose c o n t r o l , a d j u s t e d for c o v a r i a t e effects. I n t h i s case t h e s i n g l e - s t e p D u n n e t t t e s t i n t r o d u c e d i n S e c t i o n 4.1 is a r e a s o n a b l e a p p r o a c h . T o t h i s e n d , w e first fit t h e A N C O V A m o d e l (4.7) to t h e l i t t e r d a t a w i t h t h e a o v f u n c t i o n , R> d a t a ( " l i t t e r " , p a c k a g e = " m u l t c o m p " ) R> l i t t e r . a o v < - a o v ( w e i g h t ~ d o s e + g e s t t i m e + data = l i t t e r )

number,

a n d t h e n use t h e m u l t c o m p package to p e r f o r m t h e D u n n e t t t e s t , R> l i t t e r . m c < - g l h t ( l i t t e r . a o v , l i n f c t = m c p ( d o s e = " D u n n e t t " ) , + alternative = "less") R> s u m m a r y ( l i t t e r . m c , t e s t = a d j u s t e d ( t y p e = "single-step"))

102

APPLICATIONS

-2 32

< LU

-i

1 *3

100

200

400

300

500

Dose

F i g u r e 4.10

Multiple

S u m m a r y p l o t of t h e T h a l i d o m i d e dose response study.

Simultaneous

Tests

Comparisons

of Means:

F i t : aov(formula â&#x20AC;&#x201D; weight data = litter) Hypotheses: Estimate 5 - 0 >= 0 -3.35 50 - 0 >= 0 -2.29 500 - 0 >= 0 -2.68

f o r General

Linear

Dunnett

~ dose

Hypotheses

Contrasts

+ gesttime

number,

Linear

Signif. (Adjusted

codes: 0 p values

Std.

Error 1.29 1.34 1.33

'***' 0.001 reported â&#x20AC;&#x201D;

t value -2.60 -1.71 -2.00

Pr(<t) 0.016 0.112 0.063

'**' 0.01 '*' single-step

* .

0.05 '.' method)

0.1

' 1

A s seen f r o m t h e o u t p u t , t h e s m a l l e s t m u l t i p l i c i t y a d j u s t e d p - v a l u e is 0.016. W e c o n c l u d e t h a t t h e r e is a n o v e r a l l dose r e l a t e d s i g n i f i c a n t decrease i n l i t t e r w e i g h t . N o t e t h a t t h e m o d e l f i t l i t t e r , a o v a c c o u n t s for t h e t w o c o v a r i a t e s g e s t a t i o n t i m e g e s t t i m e a n d l i t t e r size n u m b e r . T h e a r i t h m e t i c m e a n s and t h e p o o l e d v a r i a n c e e s t i m a t e s f r o m t h e t e s t s t a t i s t i c s (4.2) for t h e D u n n e t t test i n a n A N O V A m o d e l w i t h o u t covariates have therefore t o be replaced by 2

DOSE RESPONSE

103

ANALYSES

t h e least s q u a r e s e s t i m a t e s from t h e m o r e g e n e r a l e x p r e s s i o n s (3.3) a n d ( 3 . 4 ) , w h i c h also i n c l u d e c o v a r i a t e i n f o r m a t i o n . C o n s e q u e n t l y , t h e c o r r e l a t i o n s b e t w e e n t h e test s t a t i s t i c s do not follow t h e s i m p l e p a t t e r n f r o m E q u a t i o n ( 4 . 3 ) . U s i n g t h e a d j u s t e d ( t y p e = " s i n g l e - s t e p " ) o p t i o n i n m u l t c o m p ensures t h a t t h e s t o c h a s t i c dependencies a r e a l l t a k e n i n t o a c c o u n t w h i l e p e r f o r m i n g t h e closed test p r o c e d u r e , as d e s c r i b e d i n C h a p t e r 3. I n a d d i t i o n t o t h e o v e r a l l significant r e s u l t , we c a n assess t h e a d j u s t e d p-values i n d i v i d u a l l y . W e therefore c o n c l u d e t h a t the s m a l l e s t dose level leads t o a significant r e s u l t a n d t h e r e m a i n i n g two dose levels l e a d to o n l y b o r d e r l i n e r e s u l t s . W e specified a l t e r n a t i v e = " l e s s " b e c a u s e w e a r e i n t e r e s t e d i n t e s t i n g for a d e c r e a s i n g t r e n d , t h a t i s , w h e t h e r t h e r e w a s a significant r e d u c t i o n i n l i t t e r weight. N o t e t h a t t h e m o r e powerful s t e p - d o w n D u n n e t t test c a n b e a p p l i e d i n s t e a d of t h e single-step D u n n e t t test by u s i n g t h e a d j u s t e d ( t y p e = " f r e e " ) o p t i o n ; see also S e c t i o n 4.1.2. U s i n g t h e s t e p - d o w n D u n n e t t t e s t , t h e o t h e r two dose levels b e c o m e b a r e l y significant: p = 0.046 a n d p = 0.045 for t h e m e d i u m a n d h i g h dose, respectively.

4.3.2

Trend

tests

T h e D u n n e t t test c o n s i d e r e d i n S e c t i o n 4.3.1 uses o n l y p a i r w i s e c o m p a r i s o n s of t h e i n d i v i d u a l dose levels w i t h t h e c o n t r o l . A v a r i e t y of t r e n d tests h a v e b e e n i n v e s t i g a t e d , w h i c h b o r r o w s t r e n g t h f r o m n e i g h b o r i n g dose levels w h e n t e s t i n g for a dose r e s p o n s e effect. T h e a p p r o a c h e s p r o p o s e d by, for e x a m p l e , B a r t h o l o m e w (1961); W i l l i a m s (1971), a n d M a r c u s (1976) are popular trend t e s t s , w h i c h a r e k n o w n to be m o r e p o w e r f u l t h a n t h e D u n n e t t t e s t . H o w e v e r , d u e to t h e i n c l u s i o n of c o v a r i a t e s a n d u n e q u a l g r o u p s a m p l e sizes, t h e s e t r e n d tests c a n n o t b e a p p l i e d to t h e T h a l i d o m i d e e x a m p l e . I n t h e following w e w i l l d e m o n s t r a t e h o w t h e f r a m e w o r k from C h a p t e r 3 c a n b e a p p l i e d to t h e c u r r e n t p r o b l e m , l e a d i n g t o powerful t r e n d tests for g e n e r a l p a r a m e t r i c m o d e l s . C o n s i d e r m o d e l ( 4 . 7 ) , w h e r e t h e p a r a m e t e r v e c t o r (3 c o n s i s t s of p = 7 e l e m e n t s Pi. W h e n t e s t i n g for a dose r e l a t e d t r e n d , one is t y p i c a l l y i n t e r e s t e d i n t e s t i n g t h e n u l l h y p o t h e s i s of no t r e a t m e n t effect, H:

/?! = . . . =

to reflect t h e u n c e r t a i n t y a b o u t t h e u n d e r l y i n g dose r e s p o n s e s h a p e . T h e first r o w of C ^ v e is a s i m p l e o r d i n a l t r e n d c o n t r a s t ; t h e s e c o n d r o w is a t r e n d S t f a l l

104

APPLICATIONS

c o n t r a s t suggested b y t h e a r i t h m e t i c dose levels; a n d t h e t h i r d r o w suggests a l o g - o r d i n a l dose response r e l a t i o n s h i p . T h u s , t h e use of n o n - p a i r w i s e c o n t r a s t s allows one to set e a c h of the i n d i v i d u a l tests into c o r r e s p o n d e n c e to s o m e dose response s h a p e s . B y u s i n g i n f o r m a t i o n from a l l dose levels u n d e r i n v e s t i g a t i o n , n o n - p a i r w i s e c o n t r a s t tests t e n d to be m o r e powerful t h a n t h e D u n n e t t t e s t , w h i c h is b a s e d o n p a i r w i s e c o n t r a s t s . N o t e t h a t i n t h e definition of t h e m a t r i x C ^ t h e weights for the intercept Po a n d t h e two c o v a r i a t e p a r a m e t e r s p$ a n d Pq are 0, so t h a t t h e linear f u n c t i o n C ^ e s t f a i i / ^ c o m p a r e s o n l y t h e p a r a m e t e r s of interest / 3 * . . . , j3 . I n S e c t i o n 5.3 we w i l l e x t e n d these ideas to f o r m a l l y i n c l u d e dose response m o d e l i n g i n a r i g i d h y p o t h e s e s t e s t i n g framework. V e s t f a l l

1 }

T h e a p p r o a c h from W e s t f a l l (1997) sought to let t h e c o n t r a s t coefficients m i r r o r p o t e n t i a l dose response s h a p e s . A l t e r n a t i v e l y , t h e c o n t r a s t m a t r i x c a n be specified i n a m o r e p r i n c i p l e d w a y by reflecting g e n e r a l c o m p a r i s o n s a m o n g t h e dose levels, as c o n s i d e r e d b y W i l l i a m s (1971) a n d M a r c u s ( 1 9 7 6 ) . D u e to t h e presence of covariates a n d u n e q u a l s a m p l e sizes, however, t h e i r o r i g i n a l a p p r o a c h e s c a n n o t be a p p l i e d to t h e T h a l i d o m i d e e x a m p l e . I n s t e a d , w e follow the p r o p o s a l of B r e t z (2006) a n d describe the o r i g i n a l t r e n d tests of W i l l i a m s (1971) a n d M a r c u s (1976) as c o n t r a s t tests, so t h a t the r e s u l t s from C h a p t e r 3 are a p p l i c a b l e . W e first consider the W i l l i a m s test extended to t h e framework of m u l t i ple c o n t r a s t tests. W i t h t h e s a m p l e sizes t a k e n f r o m T a b l e 4.3, t h e c o n t r a s t coefficients a r e given b y 0 Cv/niiams = (

0 -0.3519

0 -0.5143 -0.3333

-0.4857 -0.3148

F i g u r e 4.11 d i s p l a y s the c o n t r a s t coefficients for / ? i , . . . , / ? 4 i n t h e b a l a n c e d case. E a c h single c o n t r a s t test consists of c o m p a r i s o n s b e t w e e n t h e c o n t r o l group a n d t h e weighted average over t h e last t t r e a t m e n t groups, i = 1 , . . . , 3, respectively. T o i l l u s t r a t e t h i s , consider the second r o w of C ^ . The cont r o l group is c o m p a r e d w i t h the weighted average of t h e t w o highest dose levels. T h e weight for t h e c o n t r o l group is 1. T h e weights for t h e highest a n d s e c o n d highest dose levels are - 1 7 / ( 1 7 + 1 8 ) = - 0 . 4 8 5 7 a n d - 1 8 / ( 1 7 + 1 8 ) = - 0 . 5 1 4 3 , r e s p e c t i v e l y ; here, 17 a n d 18 are the respective group s a m p l e sizes t a k e n from T a b l e 4.3. T h e lowest dose level is not i n c l u d e d i n t h i s c o m p a r i s o n a n d is therefore assigned t h e weight 0. N o t e t h a t the weights for the i n t e r c e p t Pq a n d t h e two c o v a r i a t e p a r a m e t e r s P5 a n d Pq are set to 0, so t h a t C ^ /3 involves o n l y t h e c o m p a r i s o n of t h e p a r a m e t e r s of i n t e r e s t . A s s h o w n b y B r e t z (2006), using the m a t r i x C ^ ensures t h a t the s a m e t y p e of c o m p a r i s o n is p e r f o r m e d as w i t h t h e o r i g i n a l test of W i l l i a m s (1971) w h i l e t a k i n g into a c c o u n t t h e i n f o r m a t i o n of t h e covariates t h r o u g h t h e c o m p u t a t i o n of t h e least s q u a r e s e s t i m a t e s /3 a n d <x. i l l i a m s

i l l i a m s

W e can compute the contrast matrix C ^ w i t h t h e c o n t r Mat f u n c t i o n from t h e m u l t c o m p package. T h i s f u n c t i o n c o m p u t e s t h e c o n t r a s t m a t r i c e s i l l i a m s

DOSE RESPONSE

ANALYSES

105

0.0

0.5

Williams

1.0

1.5

2.0

2.5

3.0

modified Williams

\ G>_

b -

-, o o

;

•

o 0.0

0.5

1.0

1.5

2.0

2.5

3.0

Dose

F i g u r e 4.11

Plot of contrast coefficients for the W i l l i a m s test (left, dot dashed lines w i t h circles) and the modified W i l l i a m s test (right, solid lines with triangles and dot dashed lines w i t h circles), in the balanced case.

for s e v e r a l m u l t i p l e c o m p a r i s o n p r o c e d u r e s , i n c l u d i n g , a m o n g o t h e r s , t h e tests of D u n n e t t a n d T u k e y , as w e l l as t h e t r e n d tests of W i l l i a m s a n d M a r c u s ( c o n t r a s t v e r s i o n ) . I t s s y n t a x is s t r a i g h t f o r w a r d a n d w e o b t a i n t h e c o n t r a s t matrix C ^ w i t h the statements i l l i a m s

R> n < - c ( 2 0 , 1 9 , 1 8 , 17) R> - c o n t r M a t ( n , t y p e = " W i l l i a m s " ) Multiple 1 2 C 1 1 0.000 C 2 1 0.000 C 3 1 -0.352

Comparisons

3 0.000 -0.514 -0.333

Means:

Williams

Contrasts

4 -1.000 -0.486 -0.315

N o t e t h a t c o n t r M a t c o m p u t e s t h e c o n t r a s t coefficients u n d e r t h e a s s u m p t i o n of a n i n c r e a s i n g t r e n d . I n t h e T h a l i d o m i d e e x a m p l e w e a r e i n t e r e s t e d i n d e t e c t i n g a r e d u c t i o n i n l i t t e r weight a n d therefore t h e r e s u l t i n g coefficients from c o n t r M a t n e e d to be m u l t i p l i e d by — 1 . T h e m o d i f i e d W i l l i a m s test ( W i l l i a m s 1971; M a r c u s 1976) c a n b e e x t e n d e d

APPLICATIONS

106

s i m i l a r l y a n d t h e c o r r e s p o n d i n g c o n t r a s t coefficients a r e g i v e n b y /

0 0 0 0 0 \ 0

1 1 1 0.5128 0.5128 0.3509

0 0 -0.3519 0.4872 0.4872 0.3333

0 -0.5143 -0.3333 -0.5143 0 0.3158

- 1 -0.4857 -0.3148 -0.4857 - 1 - 1

0 0 0 0 0 0

0 \ 0 0 0 0 0 /

F i g u r e 4 . 1 1 also p l o t s t h e c o e f f i c i e n t s for t h e c o n t r a s t v e r s i o n o f t h e m o d i f i e d W i l l i a m s test. Note t h a t C is a s u b s e t o f C ^ n i n t h e sense t h a t t h e w e i g h t s for t h e f o r m e r t e s t are a l l c o n t a i n e d i n t h e c o n t r a s t m a t r i x o f t h e l a t t e r t e s t ( t h e f i r s t t h r e e r o w s ) . T h e s e w e i g h t s a r e p a r t i c u l a r l y s u i t a b l e for t e s t i n g c o n c a v e dose r e s p o n s e s h a p e s , as t h e h i g h e r dose g r o u p s a r e p o o l e d a n d c o m p a r e d w i t h t h e z e r o dose g r o u p . T h e r o w s five a n d s i x o f C ^ Williams a p p r o p r i a t e f o r d e t e c t i n g c o n v e x s h a p e s , as t h e y a v e r a g e t h e l o w e r t r e a t m e n t s . T h e f o u r t h r o w is p a r t i c u l a r l y p o w e r f u l for l i n e a r o r a p p r o x i m a t e l y l i n e a r r e l a t i o n s h i p s . T o i l l u s t r a t e t h e c o m p u t a t i o n o f t h e coefficients for t h e f o u r t h c o n t r a s t , w e n o t e t h a t t h e w e i g h t e d average o f t h e t w o h i g h e r d o s e levels is c o m p a r e d w i t h t h e w e i g h t e d average o f t h e r e m a i n i n g t r e a t m e n t s . W i t h t h e g r o u p s a m p l e sizes t a k e n f r o m T a b l e 4.3, we t h e r e f o r e o b t a i n 2 0 / ( 2 0 + 19) = 0 . 5 1 2 8 , 1 9 / ( 2 0 + 19) = 0 . 4 8 7 2 , 1 8 / ( 1 8 + 17) - 0 . 5 1 4 3 , a n d 1 7 / ( 1 8 + 17) = 0.4857. W e refer t o B r e t z ( 2 0 0 6 ) for t h e a n a l y t i c a l e x p r e s s i o n s o f t h e c o n t r a s t coefficients i n general linear models. U s i n g t h e c o n t r M a t f u n c t i o n , we can compute C w i t h the call W

o d

iple

1 c 1 c 2 c 3 c 4 c 5 c 6

1 . 000 1 . 000 0. 513 1 . 000 0. 513 0. 351

W i ] l i a m s

R> - c o n t r M a t ( n , t y p e Mult

2 -0. 352 0. 000 0. 487 0. 000 0. 487 0. 333

"Marcus")

Comparisons 3 -0. 333 -0. 514 -0. 514 0. 000 0. 000 0. 31 6

of Means:

Marcus

Contrasts

4 -0. -0. -0. -1. -1. -1.

315 486 486 000 000 000

W e n o w i l l u s t r a t e , how t h e m u l t c o m p package can be used t o analyze the T h a l i d o m i d e example w i t h t h e contrast versions of t h e t r e n d tests f r o m W i l l i a m s (1971) a n d M a r c u s (1976). Similar t o w h a t was done i n Section 4.3.1, w e c a n a p p l y t h e g l h t f u n c t i o n t o t h e fitted a o v o b j e c t . I n o r d e r t o p e r f o r m t h e W i l l i a m s c o n t r a s t t e s t w e pass t h e mcp ( d o s e = " W i l l i a m s " ) o p t i o n t o the l i n f c t a r g u m e n t , R> l i t t e r . m c 2 < - g l h t ( l i t t e r . a o v , a l t e r n a t i v e = " l e s s " , + l i n f c t - mcp(dose = " W i l l i a m s " ) ) R> s u m m a r y ( l i t t e r . m c 2 ) Simultaneous

Tests

f o r General

Linear

Hypotheses

DOSE RESPONSE

Multiple

ANALYSES

Comparisons

F i t : aov(formula data = litter) Linear C 1 >= C 2 >= C 3 >= Signif. (Adjusted

107

Means:

= weight

~ dose

Hypotheses: Estimate Std. Error 0 -2.68 1.33 0 -2.48 1.12 0 -2.79 1.05 codes: 0 p values

Williams

+ gesttime

t value -2.00 -2.21 -2.66

'***' 0.001 reported â&#x20AC;&#x201D;

Contrasts

Pr(<t) 0.0436 0.0285 0.0097

'**' 0.01 single-step

number,

* * **

'*'

0.05 '.' method)

0.1

' ' 1

T h e a d j u s t e d p - v a l u e for t h e W i l l i a m s c o n t r a s t t e s t is t h e m i n i m u m of t h e set of t h r e e p - v a l u e s l i s t e d i n t h e last c o l u m n . I n o u r e x a m p l e , t h e a d j u s t e d p - v a l u e is 0.01 a n d w e c o n c l u d e for a significant dose r e s p o n s e s i g n a l a t t h e 5 % significance l e v e l . N o t e t h a t t h i s p - v a l u e is s m a l l e r t h a n t h e a d j u s t e d p - v a l u e 0.016 for t h e D u n n e t t test from S e c t i o n 4.3.1. T h i s i n d i c a t e s t h a t t r e n d tests i n d e e d t e n d to b e m o r e powerful t h a n t h e p a i r w i s e c o m p a r i s o n s f r o m D u n n e t t . I n a n a l o g y to t h e p r e v i o u s g l h t c a l l , we c a n also a p p l y t h e m o d i f i e d W i l l i a m s test b y c a l l i n g R> g l h t ( l i t t e r . a o v , l i n f c t = m c p ( d o s e = + alternative = "less")

"Marcus"),

W e c a n i m p r o v e t h e W i l l i a m s test by a p p l y i n g t h e c l o s u r e p r i n c i p l e from S e c t i o n 2.2.3. T e c h n i c a l l y , t h e m = 3 e l e m e n t a r y o n e - s i d e d h y p o t h e s e s a r e g i v e n b y Hj : c J / 3 < 0, w h e r e f3 denotes t h e 7 x 1 p a r a m e t e r v e c t o r i n t r o d u c e d i n S e c t i o n 4.3.1 a n d cj is t h e j-th r o w of C , j f = 1 , 2 , 3 . Note that the c o n t r a s t coefficients h a v e b e e n s c a l e d s u c h t h a t l a r g e v a l u e s of j3 i n d i c a t e a dose r e l a t e d weight loss. W i t h t h e closure p r i n c i p l e , w e c o n s i d e r i n a d d i t i o n all p a i r w i s e i n t e r s e c t i o n h y p o t h e s e s Hi PI Hj,' 1 = i < j = 3, a n d t h e global i n t e r s e c t i o n n u l l h y p o t h e s i s Hi PI H n ^ 3 . T h e W i l l i a m s c o n t r a s t s s a t i s f y t h e free c o m b i n a t i o n p r o p e r t y ( S e c t i o n 2.1.2), b e c a u s e t h e r e is one a d d i t i o n a l free p a r a m e t e r i n e a c h s u c c e s s i v e c o n t r a s t . W e c a n t h u s a p p l y t h e mcp ( d o s e = " f r e e " ) o p t i o n d i s c u s s e d i n S e c t i o n 4.1.2, W i l l i a m s

R> s u m m a r y ( l i t t e r . m c 2 , Simultaneous Multiple

Comparisons

test Tests of

F i t : aov(formula = weight data = litter)

= adjusted(type for

Means:

~ dose

General

"free"))

Linear

Williams

+ gesttime

Hypotheses

Contrasts

number,

APPLICATIONS

108 Linear C 1 >= C 2 >= C 3 >=

Hypotheses: Estimate Std. -2.68 0 -2.48 0 -2.79 0

Signif. (Adjusted

codes: 0 p values

Error 1.33 1.12 1.05

t value -2.00 -2.21 -2.66

'***' 0.001 reported â&#x20AC;&#x201D;

'**' free

Pr(<t) 0.0245 0.0237 0.0091

* * **

0.01 '*' 0.05 method)

'.'

0.1

' ' 1

A s s e e n from the o u t p u t , the a p p l i c a t i o n of the closure p r i n c i p l e leads to a r e d u c t i o n of t h e a d j u s t e d p - v a l u e s . 4.4

Variable selection in regression

models

G a r c i a , W a g n e r , H o t h o r n , K o e b n i c k , Z u n f t , a n d T r i p p o (2005) a p p l i e d p r e d i c t i v e regression e q u a t i o n s to b o d y fat content w i t h n i n e c o m m o n a n t h r o p o m e t r i c m e a s u r e m e n t s , w h i c h were o b t a i n e d from 71 h e a l t h y G e r m a n w o m e n . I n addition, the body composition was measured by D u a l E n e r g y X - R a y A b s o r p t i o m e t r y ( D X A ) . T h i s reference m e t h o d is v e r y a c c u r a t e i n m e a s u r i n g b o d y fat b u t h a s l i m i t e d p r a c t i c a l a p p l i c a t i o n b e c a u s e of i t s h i g h c o s t s a n d c h a l l e n g i n g methodology. T h e r e f o r e , a s i m p l e regression e q u a t i o n for p r e d i c t ing D X A b o d y fat m e a s u r e m e n t s is of s p e c i a l interest for t h e p r a c t i t i o n e r . B a c k w a r d - e l i m i n a t i o n w a s a p p l i e d to select i m p o r t a n t v a r i a b l e s from t h e a v a i l able a n t h r o p o m e t r i c a l m e a s u r e m e n t s . G a r c i a et a l . (2005) r e p o r t e d a final linear model utilizing hip circumference, knee breadth a n d a compound c o v a r i a t e defined as t h e s u m of the l o g a r i t h m i z e d c h i n , t r i c e p s a n d s u b s c a p u l a r skinfolds. H e r e , we fit t h e s a t u r a t e d m o d e l to the d a t a a n d use a d j u s t e d p-values to select i m p o r t a n t v a r i a b l e s , w h e r e the m u l t i p l i c i t y a d j u s t m e n t a c c o u n t s for the c o r r e l a t i o n b e t w e e n t h e test s t a t i s t i c s ; see C h a p t e r 3. W e use t h e l m f u n c t i o n s u c h t h a t t h e l i n e a r m o d e l i n c l u d i n g a l l covariates w i t h t h e u n a d j u s t e d p-values is g i v e n b y R> d a t a ( " b o d y f a t " , package R> b o d y f a t . l m < - l m ( D E X f a t R> s u m m a r y ( b o d y f a t . l m ) Call: lm(formula

= DEXfat

Residuals: Min 1Q Median -6.954 -1.949 -0.219

= ~

., data

3Q 1.169

"mboost") ., data =

bodyfat)

Max 10.812

Coefficients: (Intercept) age waistcirc

Estimate -69.0283 0.0200 0.2105

Std.

Error 7.5169 0.0322 0.0671

t value -9.18 0.62 3.13

Pr(>\t\) 4.2e-13 0.5378 0.0026

*** **

V A R I A B L E S E L E C T I O N IN R E G R E S S I O N M O D E L S hipcirc elbowbreadth kneebreadth anthro3a anthro3b anthro3c anthro4 Signif.

0. -0. 1. 5. 9. 0. -6.

codes:

3435 4124 7580 7423 8664 3874 5744

'***'

Residual standard Multiple R-squared: F - s t a t i s t i c : 81.3

0. 0804 1 . 0229 0. 7250 5. 2075 5. 6579 2. 0875 6. 4892 0.001

4. 27 -0. 40 2. 42 1 . 10 1 . 74 0. 19 -1 . 01

'**'

0.01

109

6.9e-05 0. 6883 0.0183 0.2745 0.0862 0.8534 0.3150 '*'

error: 3.28 on 61 degrees 0.923, Adjusted on 9 and 61 DF, p-value:

0.05

'.'

0.1

of freedom R-squared: <2e-16

' ' 1

0.912

T h e m a t r i x C , w h i c h defines the e x p e r i m e n t a l q u e s t i o n s of i n t e r e s t , is e s s e n t i a l l y t h e i d e n t i t y m a t r i x , e x c e p t for t h e i n t e r c e p t , w h i c h is o m i t t e d . T h i s reflects our i n t e r e s t i n assessing e a c h i n d i v i d u a l v a r i a b l e : R> K < - c b i n d ( 0 , d i a g ( l e n g t h ( c o e f ( b o d y f a t . l m ) ) R> r o w n a m e s ( K ) < names(coef(bodyfat.lm))[-1] R> K 1,2] 1 0 0 0 0 0 0 0 0

[,1] 0 0 0 0 0 0 0 0 0

age waistcirc hipcirc elbowbreadth kneebreadth anthro3a anthro3b anthro3c anthro4

1,3) 0 1 0 0 0 0 0 0 0

[,4] 0 0 1 0 0 0 0 0 0

[,5] 0 0 0 1 0 0 0 0 0

L, 61 0 0 0 0 1 0 0 0 0

[, 7] 0 0 0 0 0 1 0 0 0

1))

[,8] 0 0 0 0 0 0 1 0 0

[, 9] 0 0 0 0 0 0 0 1 0

O n c e t h e m a t r i x C is defined, it c a n b e u s e d to set u p t h e m u l t i p l e parison problem using the g l h t function i n m u l t c o m p R> b o d y f a t . m c

glht(bodyfat.lm,

linfct

[,10] 0 0 0 0 0 0 0 0 1 com-

= K)

T r a d i t i o n a l l y , one w o u l d p e r f o r m a n F test to c h e c k i f a n y of t h e regression coefficients is s i g n i f i c a n t , R> summary ( b o d y f a t . m c , General Linear

test

Linear

Hypotheses:

age == 0 waistcirc == 0 hipcirc == 0 elbowbreadth == 0 kneebreadth == 0 anthro3a == 0

Estimate 0.0200 0.2105 0.3435 -0.4124 1. 7580 5.7423

FtestO)

Hypotheses

APPLICATIONS

110 anthro3b anthro3c anthro4

== 0 == 0 == 0

9.8664 0.3874 -6.5744

Global 1

Test: F DF1 DF2 81.3 9 61

Pr(>F) 1.39e-30

A s seen from the o u t p u t , t h e F test is h i g h l y significant. H o w e v e r , b e c a u s e t h e F test is a n o m n i b u s test, we c a n n o t assess w h i c h covariates significantly d e v i a t e f r o m 0 w h i l e controlling the f a m i l y w i s e error r a t e . C a l c u l a t i n g t h e i n d i v i d u a l a d j u s t e d p - v a l u e s provides the n e c e s s a r y i n f o r m a t i o n , R>

summary(bodyfat.mc) Simultaneous

Fit:

lm (formula

Linear

= DEXfat

for

General

., data

Linear

Hypotheses

bodyfat)

Hypotheses:

age == 0 waistcirc == 0 hipcirc == 0 elbowbreadth == 0 kneebreadth == 0 anthro3a == 0 anthro3b =- 0 anthro3c == 0 anthro4 == 0 Signif. (Adjusted

Tests

codes: p

Estimate Std. Error t value 0.0200 0.0322 0. 62 0.2105 0. 0671 3. 13 0.3435 0.0804 4. 27 -0.4124 1.0229 -0. 40 1.7550 0.7250 2. 42 5.7423 1 . 10 5.2075 9. 8664 1. 74 5.6579 0.3874 2.0875 0. 19 -6. 5744 6.4892 -1 . 01

0.001 ' * * ' 0.01 0 ' ***' values reported â&#x20AC;&#x201D; single -step

Pr(>\t\) 0. 996 0. 022 <0. 001 1. 000 0. 132 0. 895 0. 478 1. 000 0. 930

0. 05 '. ' method)

* ***

0.1

W e c o n c l u d e t h a t o n l y t w o c o v a r i a t e s , w a i s t a n d h i p c i r c u m f e r e n c e , s e e m to be relevant a n d c o n t r i b u t e to t h e significant F test. N o t e t h a t t h i s c o n c l u s i o n is d r a w n w h i l e c o n t r o l l i n g t h e f a m i l y w i s e error r a t e i n t h e s t r o n g sense; see S e c t i o n 2.1.1 for a d e s c r i p t i o n of s u i t a b l e error r a t e s i n m u l t i p l e h y p o t h e s e s testing. A l t e r n a t i v e l y , M M - e s t i m a t e s ( Y o h a i 1987) c a n be a p p l i e d a n d w e c a n use t h e r e s u l t s f r o m S e c t i o n 3.2 t o p e r f o r m s u i t a b l e m u l t i p l e c o m p a r i s o n s for the robustified l i n e a r m o d e l . W e use t h e l m r o b f u n c t i o n from t h e r o b u s t b a s e package ( R o u s s e e u w , C r o u x , T o d o r o v , R u c k s t u h l , S a l i b i a n - B a r r e r a , V e r b e k e , a n d M a e c h l e r 2009) to fit a r o b u s t v e r s i o n of the p r e v i o u s l i n e a r m o d e l . I n t e r estingly e n o u g h , t h e r e s u l t s coincide r a t h e r nicely, R> v c o v . l m r o b < - f u n c t i o n ( o b j e c t ) o b j e c t $ c o v R> s u m m a r y ( g l h t ( l m r o b ( D E X f a t ~ . , d a t a = b o d y f a t ) , + l i n f c t = K)) Simultaneous

Tests

for

General

Linear

Hypotheses

SIMULTANEOUS CONFIDENCE BANDS

Fit:

lmrob

Linear

(formula

4.5

~ ., data

bodyfat)

Hypotheses:

age == 0 waistcirc == 0 hipcirc == 0 elbowbreadth == 0 kneebreadth == 0 anthro3a == 0 anthro3b == 0 anthro3c == 0 anthro4 == 0 Signif. (Adjusted

= DEXfat

111

codes: 0 p values

Estimate 0. 0264 0. 2318 0. 3293 -0. 2549 0. 1138 2. 0106 10. 1589 1. 9133 -5. 6131

Std.

Error 0. 0193 0. 0665 0. 0155 0. 9245 0. 5280 3. 6591 4. 1110 1. 4040 5. 5195

z value Pr ( > I z I ) .31 1. 0. 1242 3.. 49 0. 0041 4..36 <0 .001 -0. .28 1. 0000 1 ..41 0. 6523 0..51 0. 9914 2.. 13 0. 2190 1 ..36 0. 1262 -1 .. 02 0. 9194 1

** ***

•' 0.001 ' * *•' O . i 01 ' *' 0.05 ' . ' 0.1 reported — s i ngle- -step method)

S i m u l t a n e o u s confidence b a n d s for t h e c o m p a r i s o n of t w o linear regression models

K l e i n b a u m , K u p p e r , M u l l e r , a n d N i z a m (1998) c o n s i d e r e d h o w s y s t o l i c b l o o d p r e s s u r e c h a n g e s w i t h age for b o t h females a n d m a l e s . F r o m t h e d a t a t h e y h a v e c o l l e c t e d i t is c l e a r t h a t t h e r e l a t i o n s h i p b e t w e e n s y s t o l i c b l o o d p r e s s u r e y a n d age x c a n b e r e a s o n a b l y d e s c r i b e d b y a l i n e a r r e g r e s s i o n m o d e l of y o n x for b o t h gender g r o u p s . A n a t u r a l q u e s t i o n is w h e t h e r t h e t w o l i n e a r regression m o d e l s for female a n d m a l e a r e t h e s a m e . T h a t i s , w e w a n t to c o m p a r e t h e two l i n e a r regression m o d e l s for a c o n t i n u o u s r a n g e of a c o v a r i a t e x ( t h a t is, age i n o u r e x a m p l e ) . Let y%i = A ) i + PiiXij

+ Sij

(4.8)

denote t w o s i m p l e l i n e a r regression m o d e l s , one for females {i = F) a n d one for m a l e s (i = M). E a c h gender h a s i t s o w n regression p a r a m e t e r s , t h e i n t e r c e p t Poi a n d t h e slope Pu. I n E q u a t i o n ( 4 . 8 ) , t h e i n d e x j denotes t h e j-th o b s e r v a t i o n ( w i t h i n gender i = F,M). Finally, we assume independent a n d normally distributed residuals ~ N(0,cr ). I n t h e m a t r i x n o t a t i o n from S e c t i o n 3.1, m o d e l (4.8) c a n b e w r i t t e n a s yi = X i / 3 4- £i,i = F, M. 2

T h e frequently u s e d a p p r o a c h to t h e p r o b l e m of c o m p a r i n g t w o l i n e a r r e g r e s s i o n m o d e l s is t o u s e a n F test for t h e n u l l h y p o t h e s i s H : / 3 = / 3 . I f H is r e j e c t e d t h e n t h e t w o regression m o d e l s a r e d e e m e d different. O t h e r w i s e , if H is n o t r e j e c t e d , t h e n t h e r e i s insufficient s t a t i s t i c a l e v i d e n c e to c o n c l u d e t h e two regression m o d e l s a r e different. B u t n o t a n g i b l e i n f o r m a t i o n o n t h e m a g n i t u d e of t h e difference b e t w e e n t h e two m o d e l s is p r o v i d e d b y t h i s h y p o t h e s e s t e s t i n g a p p r o a c h , w h e t h e r H is r e j e c t e d or n o t . F o r i n s t a n c e , i n t h e e x a m p l e from K l e i n b a u m et a l . (1998) t h e p - v a l u e for t h e F test is less t h a n 0.0001 a n d so t h e r e is a s t r o n g s t a t i s t i c a l i n d i c a t i o n t h a t t h e t w o r e g r e s s i o n m o d e l s a r e X

APPLICATIONS

112

different. B u t no m e a s u r e m e n t is p r o v i d e d for the degree of difference b e t w e e n these two regression models. F o l l o w i n g L i u , J a m s h i d i a n , Z h a n g , B r e t z , a n d H a n ( 2 0 0 7 a ) , we n o w d i s c u s s t h e c o n s t r u c t i o n of a t w o - s i d e d simultaneous confidence band as a m o r e i n t u itive a l t e r n a t i v e to t h e F t e s t . T h e n u l l h y p o t h e s i s H is r e j e c t e d i f t h e y = 0 line does not lie c o m p l e t e l y inside t h e confidence b a n d . T h e a d v a n t a g e of t h i s confidence b a n d a p p r o a c h over t h e F test is t h a t it p r o v i d e s i n f o r m a t i o n o n t h e m a g n i t u d e of the difference b e t w e e n the two regression m o d e l s , w h e t h e r or not H is r e j e c t e d . N o t e t h a t i n p r a c t i c e we often h a v e l i n e a r regression m o d e l s t h a t a r e of interest o n l y over a r e s t r i c t e d region of t h e c o v a r i a t e s . I n t h e s y s t o l i c b l o o d p r e s s u r e e x a m p l e above, it is n a t u r a l to exclude n e g a t i v e x ( = age) v a l u e s a n d p e r h a p s r e s t r i c t t h e r a n g e for x even further. T h e p a r t of t h e confidence b a n d outside t h i s r e s t r i c t e d region is useless for inference. I t is therefore u n n e c e s s a r y to g u a r a n t e e t h e 1 — a s i m u l t a n e o u s coverage p r o b a b i l i t y over t h e e n t i r e r a n g e of t h e c o v a r i a t e . F u r t h e r m o r e , inferences d e d u c e d f r o m t h e confidence b a n d outside t h e r e s t r i c t e d region, s u c h as t h e r e j e c t i o n of H, m a y not b e v a l i d s i n c e t h e a s s u m e d m o d e l m a y be w r o n g outside the r e s t r i c t e d region. T h i s c a l l s for t h e c o n s t r u c t i o n of a 1 — a. s i m u l t a n e o u s confidence b a n d over a r e s t r i c t e d region of t h e c o v a r i a t e . N o t e t h a t u n l i k e other a p p l i c a t i o n s covered i n t h i s b o o k , t h i s is a n e x a m p l e of inferences for infinite sets of l i n e a r f u n c t i o n s . B e c a u s e w e test for e a c h single x €E [a, 6], say, w h e t h e r t h e two regression m o d e l s a r e t h e s a m e , we h a v e a n infinite n u m b e r of h y p o t h e s e s . T h e s i m u l t a n e o u s confidence b a n d to be c o n s t r u c t e d is of t h e f o r m T

x /3

- x j3

€ [x /3i - x / 3 T

± ui_ sVx Ax] ,

w h e r e x = ( l , x ) , x e [a, 6], A = ( X 7 X i ) + ( X X ) ~ , and s denotes t h e p o o l e d v a r i a n c e e s t i m a t e . T h e c o n s t r u c t i o n of t h e s i m u l t a n e o u s confidence b a n d s d e p e n d s o n t h e c r i t i c a l v a l u e U\- . I n order to e n s u r e a s i m u l t a n e o u s coverage p r o b a b i l i t y of a t least 1 — a , the c r i t i c a l v a l u e t ^ i - a is c h o s e n s u c h t h a t P ( t m a x < ui-oc) — 1 —OJ, w h e r e T

l * K & - / 9 i ) - ( & - 02)11 T

*max =

SUp x€[a,6]

(4.9)

A x

B e c a u s e i n t h e present c a s e t h e d i s t r i b u t i o n of £ f o r m , efficient s i m u l a t i o n s m e t h o d s h a v e to be u s e d to c a l c u l a t e t h e c r i t i c a l v a l u e u i _ . L i u et a l . ( 2 0 0 7 a ) , e x t e n d i n g earlier m e t h o d s from L i u , J a m s h i d i a n , a n d Z h a n g ( 2 0 0 4 ) , suggested s i m u l a t i n g a large n u m b e r of t i m e s , R say, replicates of t h e r a n d o m v a r i a b l e t from (4.9) a n d t a k e t h e (1 — a)R-th largest s i m u l a t e d v a l u e as t h e c r i t i c a l values ui- . a

m & x

T h e m e t h o d of L i u et a l . (2007a) is r a t h e r c o m p l e x a n d r e q u i r e s s p e c i a l s o f t w a r e i m p l e m e n t a t i o n s . H e r e , w e a p p r o x i m a t e t h e c r i t i c a l v a l u e u\— b y disc r e t i z i n g t h e c o v a r i a t e region [a, 6]. T h a t i s , i n s t e a d of c o n s i d e r i n g t h e s u p r e m u m over t h e c o n t i n u o u s i n t e r v a l [a, 6] i n (4.9), we suggest c o n s i d e r i n g the m a x i m u m over t h e e q u a l l y s p a c e d values a = x\ < X2 < . . • < Xfc = 6; see a

SIMULTANEOUS CONFIDENCE

113

BANDS

also W e s t f a l l et a l . (1999) for a r e l a t e d a p p r o a c h t o c o n s t r u c t s i m u l t a n e o u s confidence b a n d s for a single l i n e a r regression m o d e l . T h e r e s u l t i n g c r i t i c a l value s h o u l d b e v e r y close to t h e c o r r e c t one w h e n k is large. T h e a d v a n t a g e of t h i s a p p r o a c h is t h a t one c a n u s e t h e m u l t c o m p package for t h e n e c e s s a r y c a l c u l a t i o n s . S u p p o s e w e fit l i n e a r r e g r e s s i o n m o d e l s for b o t h gender w i t h t h e l m f u n c t i o n to t h e s b p d a t a f r o m K l e i n b a u m et a l . (1998) R> d a t a ( " s b p " , p a c k a g e = " m u l t c o m p " ) R> s b p . l m < - l m ( s b p ~ g e n d e r * a g e , d a t a = s b p ) R> c o e f ( s b p . l m ) (Intercept) 110.0385 genderfemale:age -0.0120

genderfemale -12.9614

age 0.9614

W e t h e n define a g r i d for t h e c o v a r i a t e r e g i o n , t h e r e b y defining t h e l i n e a r functions, by calling R> a g e < - s e q ( f r o m = 1 7 , t o = 4 7 , b y = 1 ) R> K < - c b i n d ( 0 , 1, 0 , a g e ) R> r o w n a m e s ( K ) < - p a s t e ( " a g e " , a g e , s e p =

"")

Finally, we call the g l h t function a n d obtain a n approximate 9 9 % s i m u l t a n e ous confidence b a n d w i t h R> s b p . m c

<- glht(sbp.lm,

R> s b p . c i

linfct

confint(sbp.mc,

= K)

level =

0.99)

N o t e t h a t t h e r e s u l t i n g c r i t i c a l v a l u e is R> a t t r ( s b p . c i $ c o n f i n t , [1]

"calpha")

2.97

w h i c h is v e r y close t o t h e v a l u e uiâ&#x20AC;&#x201D;ce â&#x20AC;&#x201D; 2.969 o b t a i n e d b y L i u et a l . (2007a) based on R = 1,000,000 simulations using their sophisticated method. T h e r e s u l t i n g confidence b a n d c a n b e p l o t t e d u s i n g R> R> R> R>

plot(age, coef(sbp.mc), type = " 1 " , y l i m = c ( - 3 0 , lines(age, sbp.ci$confint[,"upr"] ) lines(age, sbp.ci$confint[,"lwr"]) a b l i n e ( h = 0, l t y = 2)

2))

I t follows f r o m F i g u r e 4.12 t h a t H is r e j e c t e d ( a t a = 0.01) b e c a u s e t h e y = 0 line is n o t i n c l u d e d i n t h e b a n d for a n y 20 < x < 4 7 . F u r t h e r m o r e , o n e c a n infer f r o m t h e b a n d t h a t females t e n d to h a v e s i g n i f i c a n t l y lower b l o o d p r e s s u r e t h a n m a l e s b e t w e e n t h e ages of 20 a n d 47, b e c a u s e t h e u p p e r c u r v e of t h e b a n d lies b e l o w t h e zero line for 20 < x < 47. T h i s inference c a n n o t b e m a d e from t h e F t e s t , b e c a u s e i t does n o t p r o v i d e i n f o r m a t i o n b e y o n d w h e t h e r or n o t two regression m o d e l s a r e t h e s a m e . T h i s i l l u s t r a t e s t h e a d v a n t a g e of u s i n g a s i m u l t a n e o u s confidence b a n d i n s t e a d of t h e F t e s t .

114

APPLICATIONS

W e conclude this example by noting that the approximation methods p r e s e n t e d here c a n b e e x t e n d e d e a s i l y u s i n g m u l t c o m p t o a c c o m m o d a t e m o r e t h a n one c o v a r i a t e , c o n s t r u c t one-sided s i m u l t a n e o u s confidence b a n d s , a n d derive s i m u l t a n e o u s confidence b a n d s for other t y p e s of a p p l i c a t i o n s , s u c h as a single l i n e a r regression m o d e l . W e refer to L i u (2010) for d e t a i l s o n e x a c t s i m u l t a n e o u s inference i n regression m o d e l s . 4.6

Multiple comparisons under heteroscedasticity

V a r i o u s studies h a v e l i n k e d a l c o h o l dependence p h e n o t y p e s to c h r o m o s o m e 4. O n e c a n d i d a t e gene is NACP ( n o n - a m y l o i d c o m p o n e n t of p l a q u e s ) , c o d i n g for a l p h a s y n u c l e i n . B o n s c h , L e d e r e r , R e u l b a c h , H o t h o r n , K o r n h u b e r , a n d B l e i c h (2005) found longer alleles of ATA C P - R E P 1 i n a l c o h o l - d e p e n d e n t p a -

MULTIPLE COMPARISONS UNDER HETEROSCEDASTICITY

115

tients c o m p a r e d w i t h h e a l t h y volunteers. T h e y r e p o r t e d t h a t allele lengths s h o w s o m e a s s o c i a t i o n w i t h t h e e x p r e s s i o n level a l p h a s y n u c l e i n m R N A i n a l c o h o l - d e p e n d e n t s u b j e c t s (see F i g u r e 4 . 1 3 ) . A l l e l e l e n g t h is m e a s u r e d as a s u m m a r y score f r o m a d d i t i v e dinucleotide r e p e a t l e n g t h a n d c a t e g o r i z e d into three groups: s h o r t (0 — 4, n = 24 s u b j e c t s ) , m e d i u m (5 — 9, n = 5 8 ) , a n d long (10 — 12, n = 15). H e r e , we are interested i n c o m p a r i n g t h e m e a n expression level of a l p h a s y n u c l e i n m R N A i n t h e t h r e e groups of s u b j e c t s defined b y allele length. T o s t a r t w i t h , we l o a d t h e d a t a s e t , R> d a t a ( " a l p h a " ,

package

n= CO

—

"4-

—

"coin")

n =

s CO CO <D Q. X LU

CM I

short

long

med N A C P - R E P 1 Allele Length

F i g u r e 4.13

Distribution of expression levels for a l p h a synuclein m R N A in three groups defined by the NACP-KEPl allele lengths.

fit a s i m p l e o n e - w a y A N O V A m o d e l to t h e d a t a w i t h t h e a o v f u n c t i o n R> a l p h a . a o v

aov(elevel

~ alength,

data =

alpha)

116

APPLICATIONS

a n d a p p l y t h e (single-step) T u k e y test u s i n g t h e g l h t f u n c t i o n i n m u l t c o m p R> a l p h a . m c < - g l h t ( a l p h a . a o v ,

linfct

= mcp(alength

"Tukey"))

W e therefore let t h e c o n t r a s t m a t r i x , w h i c h defines t h e e x p e r i m e n t a l questions of i n t e r e s t , c o n t a i n a l l p a i r w i s e differences b e t w e e n t h e t h r e e groups a n d u s e t h e mcp ( a l e n g t h = " T u k e y " ) o p t i o n for t h e l i n f c t a r g u m e n t ; see S e c t i o n 4.2 for a d e t a i l e d d e s c r i p t i o n of t h e T u k e y test. A s e x p l a i n e d i n S e c t i o n 3.3.1, t h e default i n R to fit A N O V A a n d regression m o d e l s is u s i n g a s u i t a b l e r e p a r a m e t r i z a t i o n of t h e p a r a m e t e r v e c t o r b a s e d o n t h e s o - c a l l e d t r e a t m e n t c o n t r a s t s . T h e first g r o u p i s t r e a t e d a s a c o n t r o l g r o u p , w i t h w h i c h t h e other groups a r e c o m p a r e d . A c c o r d i n g l y , t h e c o n t r a s t m a t r i x a c c o u n t i n g for t h e r e p a r a m e t r i z a t i o n is g i v e n b y R>

alpha.mc$linfct (Intercept)

med - short long - short long - med attr(, "type") [1] "Tukey"

alengthmed 0 0 0

1 0 -

alengthlong 0 1 1

W i t h t h e f r a m e w o r k of m o d e l (3.1) we o b t a i n a l l p a i r w i s e c o m p a r i s o n s of t h e m e a n e x p r e s s i o n levels t h r o u g h 02 ~ Pi P3-P2 w h e r e PQ denotes t h e i n t e r c e p t a n d pi denotes t h e m e a n effect of g r o u p % = 1,2,3. T h e a l p h a . m c o b j e c t also c o n t a i n s i n f o r m a t i o n a b o u t t h e e s t i m a t e d l i n e a r functions of interest a n d t h e a s s o c i a t e d c o v a r i a n c e m a t r i x . T h e s e q u a n t i t i e s can be retrieved w i t h the c o e f a n d vcov methods: R>

coef(alpha.mc)

med R>

- short 0.434

long

- short 1.189

long

- med 0.755

vcov(alpha.mc) med

med - short long - short long - med

- short 0.1472 0.1041 -0.0431

long

- short 0.104 0.271 0.167

long

- med -0.0431 0.1666 0.2096

T h e summary a n d c o n f i n t m e t h o d s c a n b e u s e d to c o m p u t e t h e s u m m a r y s t a t i s t i c s , i n c l u d i n g a d j u s t e d p - v a l u e s a n d s i m u l t a n e o u s confidence i n t e r v a l s , R> c o n f i n t ( a l p h a . m c )

MULTIPLE COMPARISONS UNDER HETEROSCEDASTICITY Simultaneous Multiple

Fit:

Confidence

Comparisons

aov(formula

Means:

= elevel

Linear

Tukey

Contrasts

data

alpha)

level

Hypotheses:

Estimate med - short == 0 0.4342 long - short == 0 1.1888 long - med == 0 0.1546 R>

Intervals

~ alength,

Quantile = 2.31 95% family-wise confidence

117

lwr -0.4158 -0.0452 -0.3314

upr 1.3441 2.4221 1.8406

summary(alpha.mc) Simultaneous

Multiple

Fit:

Comparisons

aov(formula

Linear

Tests of

Means:

= elevel

General Tukey

~ alength,

Linear

Hypotheses

Contrasts

data

alpha)

Hypotheses:

Estimate med - short == 0 0.434 long - short == 0 1.189 long - med == 0 0.155 Signif. (Adjusted

for

codes: 0 p values

Std.

'***' 0.001 reported â&#x20AC;&#x201D;

Error 0.384 0.520 0.458 '**' 0.01 single-step

t value 1.13 2.28 1.65 '*'

Pr(>\t\) 0.492 0.061 . 0.221 0.05 '.' method)

0.1

' ' 1

F r o m t h i s o u t p u t w e c o n c l u d e t h a t t h e r e is n o significant difference b e t w e e n a n y c o m b i n a t i o n of t h e t h r e e allele lengths. L o o k i n g a t F i g u r e 4.13, however, t h e v a r i a n c e h o m o g e n e i t y a s s u m p t i o n is questionable a n d one m i g h t challenge t h e v a l i d i t y of t h e s e r e s u l t s . O n e m a y argue t h a t a s a n d w i c h e s t i m a t e is m o r e a p p r o p r i a t e i n t h i s s i t u a t i o n . B a s e d o n t h e r e s u l t s f r o m S e c t i o n 3.2, we use the s a n d w i c h f u n c t i o n f r o m t h e s a n d w i c h package (Zeileis 2 0 0 6 ) , w h i c h provides a h e t e r o s c e d a s t i c i t y - c o n s i s t e n t e s t i m a t e of t h e c o v a r i a n c e m a t r i x . T h e v c o v a r g u m e n t of g l h t c a n b e u s e d to specify the alternative estimate,

118

APPLICATIONS

R> a l p h a . m c 2 < - g l h t ( a l p h a . a o v , l i n f c t + vcov = sandwich) R> s u m m a r y ( a l p h a . m c 2 ) Simultaneous Multiple

Fit:,

Comparisons

aov (formula

Linear

for

of Means:

= elevel

General

Linear

Tukey

~ alength,

= "Tukey"),

Hypotheses

Contrasts

data

alpha)

Hypotheses:

med - short == 0 long - short == 0 long - med == 0 Signif. (Adjusted

Tests

= mcp(alength

codes: 0 p values

Estimate 0.434 1.189 0.755

Std.

'***' 0.001 reported â&#x20AC;&#x201D;

Error 0.424 0.443 0.318

t value 1.02 2.68 2.37

'**' 0.01 single-step

P r f>/1/J 0.559 0.023 * 0.050 .

'*' 0.05 '.' 0.1 method)

' ' 1

N o w , h a v i n g a p p l i e d t h e s a n d w i c h e s t i m a t e , t h e g r o u p w i t h l o n g allele lengths is s i g n i f i c a n t l y different from t h e other t w o groups a t t h e 5 % significance level. T h i s r e s u l t m a t c h e s p r e v i o u s l y p u b l i s h e d s t u d y r e s u l t s b a s e d o n n o n p a r a m e t r i c a n a l y s e s . A c o m p a r i s o n of t h e s i m u l t a n e o u s confidence i n t e r v a l s b a s e d o n t h e o r d i n a r y e s t i m a t e of t h e c o v a r i a n c e m a t r i x a n d t h e s a n d w i c h e s t i m a t e is g i v e n i n F i g u r e 4.14. 4.7

M u l t i p l e c o m p a r i s o n s i n logistic regression m o d e l s

S a l i b a n d H i l l i e r (1997) r e p o r t e d t h e results of a c a s e - c o n t r o l s t u d y to i n vestigate A l z h e i m e r ' s disease a n d s m o k i n g b e h a v i o r of 198 female a n d m a l e A l z h e i m e r p a t i e n t s a n d 164 controls. T h e A l z h e i m e r d a t a s h o w n i n T a b l e 4.4 h a v e b e e n r e c o n s t r u c t e d from T a b l e 4 i n S a l i b a n d H i l l i e r (1997) a n d are d e p i c t e d i n F i g u r e 4.15; see also H o t h o r n , H o r n i k , v a n de W i e l , a n d Zeileis ( 2 0 0 6 ) . O r i g i n a l l y , t h e a u t h o r s were interested i n a s s e s s i n g w h e t h e r t h e r e is a n y a s s o c i a t i o n b e t w e e n s m o k i n g a n d A l z h e i m e r ' s diseases (or other t y p e s of d e m e n t i a ) . A f t e r a n a l y z i n g t h e s t u d y d a t a , t h e y c o n c l u d e d t h a t cigarette s m o k i n g i s less frequent i n m e n w i t h A l z h e i m e r ' s disease. I n t h i s s e c t i o n we describe how a potential association c a n be investigated i n subgroup analyses using suitable multiple comparison procedures. I n w h a t follows below w e consider a logistic regression m o d e l w i t h t h e two factors g e n d e r ( m a l e a n d female) a n d s m o k i n g . S m o k i n g h a b i t w a s classified into four levels, d e p e n d i n g o n d a i l y cigarette c o n s u m p t i o n : no s m o k i n g , less t h a n 10, b e t w e e n 10 a n d 20, a n d m o r e t h a n 20 c i g a r e t t e s p e r day. T h e response is a b i n a r y v a r i a b l e d e s c r i b i n g t h e diagnosis of t h e p a t i e n t (either suffering or n o t f r o m A l z h e i m e r ' s d i s e a s e ) . U s i n g t h e g l m f u n c t i o n , w e fit a logistic regression m o d e l t h a t i n c l u d e s b o t h m a i n effects a n d a n i n t e r a c t i o n effect of smoking a n d gender,

MULTIPLE COMPARISONS IN LOGISTIC REGRESSION MODELS

119

Ordinary covariance matrix estimate med - s h o r t long - s h o r t long - m e d 2.5 Difference

Sandwich estimate med - s h o r t long - s h o r t long - m e d -0.5

0.0

0.5

1.0

1.5

2.0

Difference

Figure 4.14

Simultaneous confidence intervals based on the ordinary estimate of the covariance matrix (top) and a sandwich estimate (bottom).

R> d a t a ( " a l z h e i m e r " , p a c k a g e = " c o i n " ) R> y < - f a c t o r ( a l z h e i m e r $ d i s e a s e â&#x20AC;&#x201D; " A l z h e i m e r " , + labels = cO'other", "Alzheimer")) R> a l z h e i m e r . g l m < - g l m ( y ~ s m o k i n g * g e n d e r , + data = alzheimer, family = binomial()) a n d use t h e summary m e t h o d a s s o c i a t e d w i t h t h e g l h t f u n c t i o n t o get a d e tailed output: R> summary ( a l z h e i m e r . g l m ) Call: glm (formula data â&#x20AC;&#x201D; Deviance

= y ~ smoking alzheimer) Residuals:

* gender,

family

= binomial

(),

120

APPLICATIONS N o . of c i g a r e t t e s d a i l y None <10 10-20 >20 Female Alzheimer Other dementias O t h e r diagnoses

91 55 80

7 7 3

15 16 25

21 9 9

Male Alzheimer Other dementias O t h e r diagnoses

35 24 24

8 1 2

15 17 22

6 35 11

T a b l e 4.4

Min -1.61

1Q -1.02

S u m m a r y of the a l z h e i m e r data.

Median -0. 79

3Q 1. 31

Max 2.08

Coefficients: Estimate -0. 3944 0. 0377 -0. 6111 0. 5486 0. 0786 1 .2589 -0 . 0286 -2. 2696

(Intercept) smoking<10 smokinglO-20 smoking>20 genderMale smoking<l0:genderMale smokingl0â&#x20AC;&#x201D;20:genderMale smoking>20:genderMale (Dispersion

parameter

Null deviance: Residual deviance: AIC: 689.5 Number

Fisher

f o r binomial 707.90 673.55

Scoring

on on

537 530

Std.

Error 0. 1356 0. 5111 0. 3308 0. 3487 0. 2604 0. 8769 0. 5012 0. 5995

z val ue Pr ( > l z f ) -2. 91 0. 00364 0. 07 0. 94114 - 1 . 85 0. 0 64 73 1 .57 0. 11565 0.30 0. 76287 1. 44 0. 15111 -0.06 0. 95457 -3. 79 0. 00015

family

taken

t o be

degrees degrees

of of

freedom freedom

iterations:

T h e n e g a t i v e r e g r e s s i o n c o e f f i c i e n t for m a l e s w h o a r e h e a v y s m o k e r s i n d i cates t h a t A l z h e i m e r ' s disease m i g h t be less f r e q u e n t i n t h i s g r o u p . B u t t h e m o d e l remains d i f f i c u l t t o i n t e r p r e t based on the coefficients a n d c o r r e s p o n d i n g p-values only. Therefore, we c o m p u t e simultaneous confidence intervals o n the p r o b a b i l i t y scale for t h e d i f f e r e n t r i s k g r o u p s . F o r e a c h f a c t o r c o m b i n a t i o n o f g e n d e r a n d s m o k i n g , t h e p r o b a b i l i t y o f s u f f e r i n g f r o m A l z h e i m e r ' s disease c a n be e s t i m a t e d by c o m p u t i n g t h e logit f u n c t i o n of t h e linear p r e d i c t o r f r o m the a l z h e i m e r . g l m m o d e l f i t . U s i n g the p r e d i c t m e t h o d for generalized linear m o d e l s is a c o n v e n i e n t w a y t o c o m p u t e these p r o b a b i l i t y e s t i m a t e s . A l t e r n a -

122

APPLICATIONS

R> K s>20 gMale ( I c p t ) s<10 s 10-20 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 0 s i 0-20: gMale s>20:gMale 0 0 None:Fema1e 0 0 <10:Female 0 0 10-20 -.Female 0 0 >20 -.Female 0 0 None:Male 0 0 <10:Male 0 1 10-20-.Male 1 0 >20:Male

s<10:gMale 0 0 0 0 0 1 0 0

None:Fema1e <10:Female 10-20 -.Female >20:Female None:Male <10:Male 10-20:Male >20:Male

w h e r e t h e r o w s represent t h e eight factor c o m b i n a t i o n s of g e n d e r a n d s m o k i n g a n d t h e c o l u m n s m a t c h t h e p a r a m e t e r vector. R e c a l l f r o m S e c t i o n 3.3.1 t h a t t h e default i n R to fit A N O V A a n d regression m o d e l s is u s i n g a s u i t a b l e r e p a r a m e t r i z a t i o n of t h e p a r a m e t e r vector based o n t h e s o - c a l l e d t r e a t m e n t c o n t r a s t s . I n t h e a l z h e i m e r e x a m p l e , t h e n o n - s m o k i n g female group N o n e : F e m a l e is t r e a t e d as a c o m m o n reference group, to w h i c h t h e other groups a r e c o m p a r e d . F o r e x a m p l e , to o b t a i n t h e effect < 1 0 : M a l e of m a l e s s m o k i n g less t h a n 10 cigarettes, we need to a d d the female effect s<10 of s m o k i n g less t h a n 10 cigarettes, t h e m a l e gender effect g M a l e a n d t h e i r i n t e r a c t i o n effect s < 1 0 : g M a l e . T h u s , r o w 6 i n t h e m a t r i x above h a s t h e entries 1 i n t h e a s s o c i a t e d c o l u m n s a n d 0 o t h e r w i s e . T h e other seven r o w s a r e i n t e r p r e t e d s i m i l a r l y . H a v i n g set u p t h e m a t r i x C t h i s way, we c a n c a l l R> a l z h e i m e r . c i < -

confint(glht(alzheimer.glm,

linfct

K))

where R> a t t r ( a l z h e i m e r . c i $ c o n f i n t , [IJ

"calpha")

2.73

gives t h e c r i t i c a l v a l u e f r o m t h e j o i n t a s y m p t o t i c m u l t i v a r i a t e n o r m a l d i s t r i b u t i o n . N o t e , however, t h a t we h a v e eight factor c o m b i n a t i o n s w i t h i n d e p e n d e n t p a t i e n t s . T h e test s t a t i s t i c s a r e therefore u n c o r r e l a t e d a s l o n g as no c o v a r i a t e s are i n c l u d e d i n t h e a n a l y s i s . T h e S i d a k a p p r o a c h d e s c r i b e d i n S e c t i o n 2.3.3 is e x a c t i n t h i s s i t u a t i o n a n d t h e c r i t i c a l value a b o v e c a n also b e o b t a i n e d by calling R> q n o r m ( ( l - ( l - 0 . 0 5 ) ~ ( l / 8 ) ) / 2 , [1 ] 2. 73

lower.tail *

FALSE)

M U L T I P L E COMPARISONS IN L O G I S T I C R E G R E S S I O N M O D E L S

123

T h i s e x a m p l e s h o w s t h a t a l t h o u g h m u l t c o m p a l w a y s gives c o r r e c t r e s u l t s , it c a n be r e p l a c e d by m o r e efficient d i r e c t a p p r o a c h e s i n s p e c i a l s i t u a t i o n s . F i n a l l y , w e c a n c o m p u t e t h e s i m u l t a n e o u s confidence i n t e r v a l s (4.10) a n d plot t h e m w i t h R> a l z h e i m e r . c i $ c o n f i n t < - a p p l y ( a l z h e i m e r . c i $ c o n f i n t , + binomial()$linkinv) R> p l o t ( a l z h e i m e r . c i , m a i n = , xlim = c(0, 1))

see F i g u r e 4.16. U s i n g t h i s g r a p h i c a l d i s p l a y of t h e r e s u l t s , it b e c o m e s evident t h a t A l z h e i m e r ' s disease is less frequent i n m e n w h o s m o k e h e a v i l y c o m p a r e d to a l l other c o m b i n a t i o n s of t h e two c o v a r i a t e s . T h i s r e s u l t m a t c h e s t h e findings from S a l i b a n d H i l l i e r (1997). N o t e t h a t these c o n c l u s i o n s a r e d r a w n w h i l e c o n t r o l l i n g t h e f a m i l y w i s e error r a t e i n t h e s t r o n g sense.

None:Female < 10: Female

(

â&#x20AC;˘

)

1 0 - 2 0 : Female >20:Female

None:Male

<10:Male

10-20:Male

>20:Male

0.0

0.2

0.4

0.6

0.8

1.0

Probability of suffering from Alzheimer's d i s e a s e

F i g u r e 4.16

Simultaneous confidence intervals for the probability of suffering from Alzheimer's disease.

124 4.8

APPLICATIONS Multiple comparisons in survival models

T h e t r e a t m e n t of p a t i e n t s suffering from a c u t e m y e l o i d l e u k e m i a ( A M L ) is d e t e r m i n e d by a t u m o r classification s c h e m e w h i c h t a k e s v a r i o u s c y t o g e n e t i c a b e r r a t i o n s t a t u s e s into a c c o u n t . B u l l i n g e r , D o h n e r , B a i r , F r o h l i c h , S c h l e n k , T i b s h i r a n i , D o h n e r , a n d P o l l a c k (2004) i n v e s t i g a t e d a n e x t e n d e d t u m o r c l a s sification s c h e m e i n c o r p o r a t i n g m o l e c u l a r s u b g r o u p s of t h e disease o b t a i n e d b y gene e x p r e s s i o n profiling. T h e a n a l y s e s r e p o r t e d h e r e a r e b a s e d o n c l i n i c a l d a t a ( t h u s o m i t t i n g a v a i l a b l e gene e x p r e s s i o n d a t a ) p u b l i s h e d online at www.ncbi.nlm.nih.gov/geo, accession number G S E 4 2 5 . T h e overall survival t i m e a n d c e n s o r i n g i n d i c a t o r a s w e l l as t h e c l i n i c a l v a r i a b l e s age, sex, l a c t i c d e h y d r o g e n a s e level ( L D H ) , w h i t e b l o o d cell count ( W B C ) , a n d t r e a t m e n t group a r e t a k e n f r o m S u p p l e m e n t a r y T a b l e 1 i n B u l l i n g e r et a l . ( 2 0 0 4 ) . I n a d d i t i o n , t h i s t a b l e p r o v i d e s two m o l e c u l a r m a r k e r s , t h e fmslike t y r o s i n e k i n a s e 3 ( F L T 3 ) a n d t h e m i x e d - l i n e a g e l e u k e m i a ( M L L ) gene, as w e l l as c y t o g e n e t i c i n f o r m a t i o n helpful i n defining a r i s k score ( l o w : k a r y o t y p e t ( 8 ; 2 1 ) , t ( 1 5 ; 1 7 ) a n d i n v ( 1 6 ) ; i n t e r m e d i a t e : n o r m a l k a r y o t y p e a n d t ( 9 ; l l ) ; a n d h i g h : a l l other f o r m s ) . O n e i n t e r e s t i n g q u e s t i o n focuses o n t h e usefulness of t h i s r i s k score. U s i n g t h e s u r v r e g f u n c t i o n from t h e s u r v i v a l package, we fit a W e i b u l l s u r v i v a l m o d e l t h a t i n c l u d e s a l l above m e n t i o n e d c o v a r i a t e s as w e l l as t h e i r i n t e r a c t i o n s w i t h t h e p a t i e n t ' s gender. W e use t h e r e s u l t s f r o m C h a p t e r 3 to a p p l y a s u i t a b l e m u l t i p l e c o m p a r i s o n p r o c e d u r e , w h i c h a c c o u n t s for t h e a s y m p t o t i c c o r r e l a t i o n s b e t w e e n t h e test s t a t i s t i c s . W e c o m p a r e t h e three groups l o w , i n t e r m e d i a t e , a n d h i g h u s i n g T u k e y ' s a l l p a i r w i s e c o m p a r i s o n s m e t h o d d e s c r i b e d i n S e c t i o n 4.2. T h e o u t p u t below i n d i c a t e s a difference w h e n c o m p a r i n g h i g h to b o t h l o w a n d i n t e r m e d i a t e . T h e o u t p u t also s h o w s t h a t low a n d i n t e r m e d i a t e are indistinguishable, R> l i b r a r y ( " s u r v i v a l " ) R> a m l . s u r v < - s u r v r e g ( S u r v ( t i m e , e v e n t ) ~ S e x + + Age + WBC + LDH + F L T 3 + r i s k , + data - c l i n i c a l ) R> s u m m a r y ( g l h t ( a m i . s u r v , l i n f c t = m c p ( r i s k = " T u k e y " ) ) ) Simultaneous Multiple

Comparisons

F i t : survreg(formula LDH + FLT3 + risk, Linear

Tests of

for

Means:

General Tukey

= Surv(time, event) data = clinical)

Linear

Hypotheses

Contrasts

~ Sex

+ Age

+ WBC

Hypotheses:

intermediate - high == 0 low - high == 0 low - intermediate == 0 (Adjusted p values reported

Estimate Std. 1.110 1.477 0.367 â&#x20AC;&#x201D; single-step

Error 0.385 0.458 0.430

z value 2.88 3.22 0.85 method)

(>/zI) 0.0108 0.0036 0.6692

MULTIPLE COMPARISONS IN MIXED-EFFECTS MODELS 4.9

Multiple comparisons i n mixed-effects

125

models

I n m o s t p a r t s of G e r m a n y , t h e n a t u r a l or a r t i f i c i a l r e g e n e r a t i o n of forests is difficult b e c a u s e of intensive a n i m a l b r o w s i n g . Y o u n g t r e e s suffer from b r o w s i n g d a m a g e , m o s t l y b y r o e a n d r e d deer. I n order t o e s t i m a t e t h e b r o w s i n g i n t e n s i t y for s e v e r a l tree species, t h e B a v a r i a n S t a t e M i n i s t r y o f A g r i c u l t u r e a n d F o r e s t r y c o n d u c t s a s u r v e y e v e r y three y e a r s . B a s e d o n t h e e s t i m a t e d p e r centage of d a m a g e d trees, t h e m i n i s t r y m a k e s suggestions for i m p l e m e n t i n g or m o d i f y i n g deer m a n a g e m e n t p r o g r a m s . T h e s u r v e y t a k e s p l a c e i n a l l 756 g a m e m a n a g e m e n t d i s t r i c t s i n B a v a r i a . H e r e , we focus o n t h e 2006 d a t a of t h e g a m e m a n a g e m e n t d i s t r i c t n u m b e r 513 " H e g e g e m e i n s c h a f t U n t e r e r A i s c h g r u n d " , R> d a t a ( t r e e s 5 1 3 , M

package

"multcomp")

T h e d a t a of 2700 trees i n c l u d e t h e species a n d a b i n a r y v a r i a b l e i n d i c a t i n g w h e t h e r or n o t t h e tree suffers from d a m a g e c a u s e d b y deer b r o w s i n g . F i g u r e 4.17 d i s p l a y s t h e l a y o u t of t h e field e x p e r i m e n t . A p r e - s p e c i f i e d , e q u a l l y s p a c e d g r i d w a s l a i d over t h e d i s t r i c t . W i t h i n e a c h of t h e r e s u l t i n g 36 r e c t a n g u l a r a r e a s , a 1 0 0 m t r a n s e c t w a s identified o n w h i c h 5 e q u i d i s t a n t plots were d e t e r m i n e d ( t h a t i s , a plot after every 2 5 m ) . A t e a c h p l o t , 15 trees were o b s e r v e d , r e s u l t i n g i n 75 trees for e a c h a r e a , or 2700 t r e e s i n t o t a l .

o o o o

F i g u r e 4.17

L a y o u t of the t r e e s 5 1 3 field experiment. E a c h open dot denotes a plot w i t h 15 trees. Further explanations are provided in the text.

126

APPLICATIONS

W e are interested i n t h e p r o b a b i l i t y e s t i m a t e s a n d t h e confidence i n t e r v a l s for e a c h of s i x tree species u n d e r investigation. W e fit a mixed-effects logistic regression m o d e l u s i n g t h e l m e r f u n c t i o n from t h e l m e 4 package ( B a t e s a n d S a r k a r 2 0 1 0 ) . W e exclude t h e intercept b u t i n c l u d e r a n d o m effects a c c o u n t i n g for t h e s p a t i a l v a r i a t i o n of t h e trees. T h u s , we h a v e s i x fixed p a r a m e t e r s , one for e a c h of t h e s i x tree species, a n d C = d i a g ( 6 ) is t h e m a t r i x defining our e x p e r i m e n t a l questions, R> t r e e s 5 1 3 . l m e < - l m e r ( d a m a g e ~ s p e c i e s - 1 + (1 I l a t t i c e / p l o t ) , + data = trees513, family = binomial()) R> K < diag(length(fixef(trees513.1me))) W e first c o m p u t e s i m u l t a n e o u s confidence i n t e r v a l s for t h e l i n e a r functions of interest a n d t h e n t r a n s f o r m these into t h e r e q u i r e d p r o b a b i l i t i e s , R> R> R> R>

t r e e s 5 1 3 . c i <- c o n f i n t ( g l h t ( t r e e s 5 1 3 . l m e , l i n f c t = K)) prob <binomial()$linkinv(trees513.ci$confint) t r e e s 5 1 3 . c i $ c o n f i n t <- 1 - prob trees513.ci$confint[, 2:3] <- t r e e s 5 1 3 . c i $ c o n f i n t [ , 3:2]

T h e r e s u l t s a r e s h o w n i n F i g u r e 4.18. B r o w s i n g is less frequent i n h a r d w o o d . I n p a r t i c u l a r , s m a l l oak trees are a t h i g h r i s k . A s a consequence, t h e m i n i s t r y h a s d e c i d e d to i n c r e a s e t h e h a r v e s t r y of roe deers i n t h e following y e a r s . R e l a t i v e l y s m a l l s a m p l e sizes c a u s e d t h e large confidence i n t e r v a l s for a s h , m a p l e , e l m a n d l i m e trees.

spruce (119)

pine (823)

beech (266)

oak (1258)

ash/maple/elm (30)

hardwood (191)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Probability of browsing damage

F i g u r e 4.18

Probability of roe deer browsing damage for six tree species. Sample sizes are given in brackets.

CHAPTER 5

Further Topics I n this chapter we review a selection of multiple comparison problems, which do not quite fit into the framework of Chapters 3 and 4. T h e selection reflects a biased choice from a large number of possible topics and was partly driven by the availability of software implementations i n R. I n Section 5.1 we discuss the comparison of means with two-sample multivariate d a t a using resamplingbased methods. I n Section 5.2 we review group sequential and adaptive d e signs, which allow for repeated significance testing while controlling the T y p e I error rate. Finally, in Section 5.3 we extend the ideas from Section 4.3 and describe a hybrid methodology combining multiple comparisons with m o d e l ing techniques for dose finding experiments. Because the individual sections in this chapter are not directly linked to previous parts of this book, separate notation is introduced where needed. 5.1

Resampling-based multiple comparison procedures

Resampling-based multiple comparison procedures have been investigated s y s tematically since Westfall and Young (1993) and Troendle (1995); see also Dudoit and van der L a a n (2008) for a description of such multiple comparison procedures relevant to genomics. I n this section we discuss i n detail the c o m parison of means w i t h two-sample multivariate data using resampling-based methods. T h e general case of multi-sample multivariate data is not considered here and we refer to Westfall and Troendle (2008) for a recent discussion. I n Section 5.1.1, we describe the technical details underlying permutationbased multiple test procedures for two-sample multivariate data. T o illustrate the methodology, we analyze two datasets in Section 5.1.2 using the c o i n package (Hothorn, Hornik, van de Wiel, and Zeileis 2010b), which allows us to use R for permutation multiple testing. I n Section 5.1.3 we briefly review related bootstrap-based multiple test procedures. 5.1.1

Permutation

General

methods

considerations

W i t h the closure principle from Section 2.2.3 in place, permutation-based m u l tiple test procedures are often quite simple. A l l that is required is a p e r m u tation test for every intersection null hypothesis. Advantages of permutation tests are 127

128

FURTHER TOPICS

(i) the resulting m u l t i p l e test inference procedure is exact, despite n o n n o r m a l characteristics of the data., and despite u n k n o w n covariance s t r u c tures; (ii) t h e r e s u l t i n g m u l t i p l e comparison procedures incorporate correlations, like the max-ÂŁ-based methods, to become less conservative t h a n t y p i c a l Bonferroni-based procedures; (iii) i n cases w i t h sparse d a t a , permutation-based methods offer q u a n t u m i m provements over s t a n d a r d methods, i n the sense t h a t the effective number of tests m' can be m u c h less t h a n the actual n u m b e r of tests m. Coupled w i t h (i) and ( i i ) , the final p o i n t ( i i i ) suggests a m a j o r benefit. A s sume, for example, t h a t m = 30 tests are performed for a specific analysis. A p p l y i n g the simple-minded B o n f e r r o n i test leads t o a c r i t i c a l value at level 0.05/30 = 0.0017. B u t using p e r m u t a t i o n tests t h e effective n u m b e r of tests can be s u b s t a n t i a l l y lower. I f the d a t a are sparse i t m i g h t be as s m a l l as m' = 4 (or even smaller, as seen i n the example f r o m Table 5.1). T h i s gives a B o n f e r r o n i c r i t i c a l value a t level 0.05/4 = 0.0125, thus leading t o considerably higher power. A further benefit is t h a t p e r m u t a t i o n - b a s e d m e t h o d s use the correlation s t r u c t u r e , so t h a t the actual c r i t i c a l value w i l l be higher s t i l l t h a n the Bonferroni-based 0.05/4 == 0.0125, as h i g h as 0.05 itself ( t h a t is, no a d j u s t m e n t a t a l l ) , depending on the n a t u r e of the dependence s t r u c t u r e . T o u n d e r s t a n d the key concepts, i t helps t o go t h r o u g h an example, stepby-step. T h e dataset displayed i n Table 5.1 is a m o c k - u p adverse event (side effects) dataset f r o m a safety analysis for a new d r u g , w i t h three possible events E\ and There are four t o t a l adverse events i n t h e t r e a t m e n t group for Ei, and none i n the c o n t r o l group. There is o n l y one t o t a l adverse event observed for E a n d E3. t

Table 5.1

Group

Trt Trt Trt Trt Ctrl Ctrl Ctrl

1 1 1 1 0 0 0

1 0 0 0 0 0 0

M o c k - u p dataset f r o m a safety analysis for a new d r u g . T h e e n t r y " 1 " denotes an observed adverse event; otherwise " 0 " is used.

T y p i c a l l y for two-sample b i n a r y d a t a p e r m u t a t i o n tests, the test statistic is the t o t a l number of occurrences i n the t r e a t m e n t group, a n d the one-sided pvalue is the p r o p o r t i o n of p e r m u t a t i o n s y i e l d i n g a t o t a l greater or equal t o the observed t o t a l . T h i s is k n o w n as Fisher's exact test. I n the example dataset

RESAMPLING-BASED MULTIPLE COMPARISON PROCEDURES

129

from Table 5.1, there are 7! possible permutations of the observations, but many of these are redundant. For example, the permutation that switches observations 1 and 2 (i.e., the first two rows), but leaves the remaining o b servations unchanged is redundant with the original dataset, since the same observations are in the treatment group as before, and the test statistic cannot change. There are 7!/(4!3!) = 35 non-redundant permutations of the observations. For Ei, only one of these permutations yields 4 occurrences in the treated group, hence the Fisher exact upper-tailed p-value is 1/35 = 0.0286. For Ei and E%, the single observed occurrence can take place in the treatment group for 4 / 7 = 0.5714 of the permutations, hence the upper tailed p-values for E and E are 0.5714. To implement the closed test procedure to judge the significance of the tests when the family wise error rate is controlled, we need to specify suitable tests for all intersection hypotheses. Figure 5.1 shows all the intersection hypotheses schematically. T h e null hypotheses are 2

H,:Ff

= Ff,

J C {1,2,3},

which states that the joint distributions F f and F f of the measurements in the set / are identical for treated and control groups, respectively. E a c h of these hypotheses can be tested "exactly" by (i) selecting a statistic ÂŁ

/ to test Hj\

(ii) enumerating the permutations of the treatment/control labels, and finding the value of t} for each permutation; (iii) rejecting Hj at level a if the proportion of permutation samples yielding t*i > ti is less than or equal to a . T h e reason that "exactly" is used in quotes is that the familywise error rate may be strictly less than OL due to discreteness of the permutation d i s t r i b u tion of t j . However, the familywise error rate is guaranteed to be no more than a. Another reason for quotes around the term "exactly" is that with a large sample size, typically the permutation distribution must be randomly sampled rather than enumerated completely, and in such cases the method is not "exact", even though the simulation error can be reduced with sufficient sampling. W i t h these caveats, the method controls the familywise error rate in the strong sense, with no approximation, provided that the true null state of nature is one of the states shown in the closure tree of Figure 5.1. A comment is in order concerning the statement, "provided that the true null state of nature is one of the states shown i n the closure tree." T h i s s t a t e ment is i n fact a n assumption needed for permutation tests based on the closure principle. I t is possible, for example, that the marginal distributions are equal but the joint distributions are not, and the methods discussed in this chapter do not control the familywise error rate under this scenario. O n the other hand, if the joint distributions are not equal for some subsets, then it is a fact that the treatment differs from the control, so the determination of some significant difference in that case is correct, even though some spe-

FURTHER TOPICS

130

#123 : ^i23 =

H12 : f S =

.Ff =

Hj.:

F i g u r e 5.1

\23

H23 â&#x20AC;˘ F23 =

H: 2

: F

Schematic diagram of the closed hypotheses set for the adverse event data.

cine determinations regarding marginal significances might be erroneous. A n example where this assumption fails is where the treatment has no effect on the marginal rates of occurrences of adverse events E\ and E , but does affect the joint distribution of the occurrences. T h i s assumption is called the joint exchangeability assumption by Westfall and Troendle (2008). I n the following we continue the discussion in terms of min-p tests instead of the max-t tests introduced i n Equation (2.1) and used throughout this book. T h e results are equivalent i n standard parametric applications when one takes U = i ^ ( l â&#x20AC;&#x201D; P i ) , where Fi denotes the distribution of the test statistic U. T h e explanation that follows differs slightly in that permutation-based p-values are used. 2

_ 1

I n the adverse events example above, the hypothesis # 1 2 3 : F^ = ^1^3 * tested using the min-p statistic ÂŁ123 = m i n { p i , p 2 , P 3 } . Note that for each permutation of the dataset, different p-values are obtained; call p* the pvalue for a random permutation. T h e global p-value for testing # 1 2 3 is then the proportion of the 7!/(4!3!) = 35 permutations of the dataset (where the rows are kept intact and only the treatment/control labels are permuted) for which m i n { p J , p j , P 3 } is less than or equal to m i n { p i , p 2 , P 3 } . I n this example m i n { p , p , p } = min{0.0286,0.5714,0.5714} = 0.0286. Here is where the advantage of permutation testing for sparse data becomes evident. Notice that the possible p-values p\ for E\ are either 1.0000, 0.8857, 0.3714, or 0.0286, corresponding to 1,2,3, or 4 occurrences in the treatment 3

RESAMPLING-BASED MULTIPLE COMPARISON PROCEDURES

131

group, respectively. For E and E , the possible p* are either 1.0000 or 0.5714, corresponding to 0 or 1 occurrences in the treatment group, respectively. T h u s , it is impossible for p*, i = 2,3, to be smaller than 0.0286, hence the p-value for #123, P123 = P ( m i n { p t , p J , p 5 } < 0.0286) = P(p? < 0.0286) = 0.0286. Similarly, p = 0.0286, p = 0.0286, and of course p i = 0.0286. Hence, the exact multiplicity-adjusted p-value for testing E\ is identical to the unadjusted p-value, due to the sparseness of the data in the E and E variables. E v e n when the data are not sparse, there are still benefits from permutation tests. T o illustrate, consider the 2 x 2 contingency tables in Table 5.2 that show the pattern of occurrences for two variables E\ and E . T h e Fisher exact upper-tailed p-value for E\ is p i = 0.0261, and that for E is p 2 = 0.0790. T h e standard Holm and Hochberg methods fail to find significances at level OL â&#x20AC;&#x201D; 0.05, with adjusted p-values q\ â&#x20AC;&#x201D; 0.0523 and q = 0.0790. 2

Trt Ctrl Total T a b l e 5.2

Event

Non-Event

Total

Event

Non-Event

Total

7 1 8

43 51 94

50 52 102

8 3 11

43 52 95

51 55 106

Two contingency tables for two types of adverse events E i (left) and E2 (right). Differing sample sizes are due to missing values.

However, using permutation tests within the closure paradigm we obtain Qi < 0.0451 and q = 0.0790. T o see why, examine the graphs in Figure 5.2 of the distribution of the possible number of treatment occurrences of E\ and E based on the hypergeometric distributions H(x, 50,8,102) and # ( 2 ; , 5 1 , 1 1 , 1 0 6 ) when the total number of events is 8 and 11, respectively. F r o m these d i s t r i butions, the tail probabilities give the possible Fisher upper-tailed p-values for each of the two tests; see Table 5.3. T h e p-value for Hi is P ( m i n {Pi,p } < 0.0261). B u t 2

P ( m i n { p J , p ; } < 0.0261)

P(pJ < 0.0261) + P(pJ < 0.0261)

0.0261 + 0.0190

0.0451,

where, from Table 5.2, 0.0190 is the probability that p j is less than or equal to 0.0261, and where the inequality in the first line follows from the Bonferroni inequality (2.3). T h u s , the adjusted p-value for Hi is m a x { P ( m i n { p J , p j } < 0.0261),P(pJ < 0.0261)} < 0.0451. Permutation tests take advantage of d i s creteness of the permutation distributions to reduce the p-values and hence improve power.

£2. 05

ct-

o CD'

§- ^ S- p

I *

w to

Probability

§ CD

^ 6. 3

CD CD cr PTjq O p *d ctctCD P ct- ft- CD p o O P P" o o co P o CD CO *d £ D (I 00 CO t> (D CD CD C DO a C " p S <! CO — P p ct- CD g . B » £l CD pr ^

0.0

0.1

0.2

0.3

C T t-j 2 P CO e+ ft CD O O C 8D P§ " 0 ! ~ cijj* ty cr CO 0 D 5 R PO C 3 2 O cr S ^ O o CD (jq O c »-•• CD CD g: P CO CO p CD 3 CD *D 6 H P i__> Pi cr M' cf C cV O or P ^ *d * g s CD P P 8 Cr 3 ff co 2 p U a » *d CD S- o cr H S* P CD PCD cr cr o P P P o P 5 CD tf DT CD CO ct- ^ O C cr 8 a 8? ^ ctr P a o p 8 cT M § B P- o cRo g3 aP 98 CD CD |j P - ^ CD g CD fl, CO CD *0 8§ P s CD o £ Ci CD Tp cr S 2- P Oq C D & O CD O P^ CD P P >1 3 o P & CD P CO cr *< Sr. CSJ cr CO CD 1—' o CD P o i-» P £T. cr P a P P; o CD <jq CO P CD B cr ct- ch <-i s s C^D *-i P •d C D t-j v.. P C D CD ctP XCT ^ CD ct- M o CD C? Q ^ O p CO CD CDCD t-j C ctCD co rj 2 <^ 2 CD ^ & a) a c+^ ^ a p- CD 3 52 ^ CD ^ p 3 CL cr crq . O co* ct> " i-J CD CD < CO CD B 2. CD < O P . ^ CD CO 5 « f « ^ " 1 7 P CD

z -

CO o

* CD

c CD

ffl

CD 1

o OS PCD »1 ¬

CD CD

I CD CO

m 00

S. (W

C oD

CO CD

Probability

p ci

0.0

d C O cr HI

c 3 cr

»-»• R cr P P

CD ^ Cn H to Cn • p

00 O to

§ a

O) -

0.1

0.2

0.3 .

n H

O id

1—1

RESAMPLING-BASED MULTIPLE COMPARISON PROCEDURES

Trt Events

p-value

T r t Events

p-value

0 1 2 3 4 5 6 7 8

1.00 0.9966 0.9661 0.8523 0.6201 0.3357 0.1222 0.0261 0.0025

0 1 2 3 4 5 6 7 8 9 10 11

1.00 0.9996 0.9942 0.9650 0.8738 0.6914 0.4465 0.2211 0.0790 0.0190 0.0027 0.0002

T a b l e 5.3

133

Upper-tailed p-values for the adverse event data in Table 5.2.

can make two simplifying assumptions that drastically reduce the computing burden from 0(2 ) to 0(m). These concepts are described here briefly, but see Westfall and Troendle (2008) for further details. T h e first simplifying assumption, one made throughout this book, is that each intersection hypothesis Hi is tested using either a min-p statistic m i n i / pi or a max-t statistic m a x { / U; see also E q u a t i o n (2.1). T h e second is the " s u b set pivotality" assumption coined by Westfall and Young (1993), which states that the distribution of m a x i / ti is identical under Hi a n d H = HM for all / C M = { 1 , . . . , m } . I f the first assumption is met, one need only test m h y potheses corresponding to the ordered U rather than all 2 — 1intersections. I n addition, if the second assumption is met, permutation resampling can be done simultaneously for all m tests using the global null hypothesis i f , rather than perform permutation resampling separately for each of the m intersection hypotheses corresponding to the ordered test statistics. T h e subset pivotality assumption fails notably when testing pairwise comparisons of distributions in the case of multiple treatment groups. I n this case, use of the global p e r m u tation distribution, which states exhangeability among a l l treatment groups, cannot be used to test for exchangeability among two of the treatment groups, as differences i n variability among the treatment groups can cause bias i n the statistical tests. T h e r e are several ways that one might perform permutation tests correctly i n this case; see Petrondas and Gabriel (1983) for one solution. T h e subset pivotality assumption is mainly used to simplify computations in the closed test algorithm. Along with the use of m a x - i statistics, it provides a shortcut procedure that allows the researcher to test the m hypotheses corresponding to the ordered p-values, rather than all 2 — 1intersection hyTn

€

134

FURTHER TOPICS

potheses. If it is possible to test all 2 — 1 hypotheses using valid permutation tests, then subset pivotality is not needed for multiple testing using p e r m u tation tests. Note that shortcut procedures can also be derived under other assumptions than subset pivotality; see Calian, L i , and H s u (2008) for a recent discussion. Suppose the observed test statistics are £ ° > . . . > t^ , corresponding to hypotheses Hi,... , £ f , and that larger t° suggest alternative hypotheses. Suppose a p-value for testing Hj using the statistic max^g/ ti is available. Then, m

b s

and Hi is rejected at unadjusted level a if pi < a. T h e closure method along with the assumptions of use of m a x - i tests and of subset pivotality give the following algorithm for rejecting sequentially the ordered hypotheses Hi, H ,... Step 1: B y closure, 2

reject Hi if

P [ max£i > max£?

max

I:/D{1}

B u t if I D { 1 } , then m a x

i € J

i€l

iei

tf> = t j s

b s

max P ( m a x t j > £? I: J D { 1 } \ iei Using subset pivotality, the rule becomes I:

max

/D{i}

and the rule is

reject Hi if

Hi J < ex.

max** > £ ? \ iei

Hi J < a. )

< a.

Use of the m a x - £ statistic implies

Hence, by subset pivotality and by use of the max-t statistic, the rule by which we reject Hi simplifies to reject H

if Pi

maxt; > t ?

b s

< a.

Step 2: Again by closure and subset pivotality,

we require P

maxti > £?

< a. for all / G S\

RESAMPLING-BASED MULTIPLE COMPARISON PROCEDURES

135

and P

^ m a x i i >

H ) < a for all I e

S. 2

Since we are using the max-t statistic, these conditions allow us to reject H if 2

(maxii

£?

H ) < a

b s

and P I

max

i€{2,...,m}

iobs H)

U >

<Ct.

Step j : Continuing in this fashion, we reject Hj if P

( m a x i i > £?

b s

) <

and

(

P I

max

ie{2,...,m}

ti >

iobs H)

< a,

obs 3t° H

. ..,

and max

ie{j,..-,m}

ti >

ct.

Note that at step j in the algorithm we can equivalently reject Hj, maxi<7 Pi...m < ct. Hence, the rejection rule reduces to

reject Hj if qj < a , where qj = maxi<j Pi...m denotes the adjusted p-value. One need not use resampling at all to apply the algorithm above. It only becomes a resampling-based procedure if one uses resampling to obtain the probabilities P ( m a x i g ^ . . . ^ } U >t? \H). I n Section 4.1.2 we presented a similar algorithm, where the necessary probabilities are computed from the multivariate normal or t distribution. I n addition to subset pivotality and use of max-t statistics, we also assumed that there are no logical constraints among the hypotheses (recall Section 2.1.2 for the definition of the restricted combination property). T h i s assumption is needed to assert that the algorithm is identical to closed testing using m a x - £ statistics. If there are logical constraints, power can be improved by r e s t r i c t ing attention only to admissible subsets I. I n this case, shortcut procedures are more difficult to derive; see B r a n n a t h and Bretz (2010) for details. T h e algorithm above can still be used for logically constrained hypotheses, though, despite not being fully closed, as it provides a conservative procedure relative to the fully closed algorithm. hs

FURTHER TOPICS

136

Comments

on subset pivotality

Subset pivotality is most problematic in cases where there are multiple (i.e., m > 2) groups, especially when there are drastic differences in variability b e tween the groups. Specifically, suppose the data are univariate, in three groups i = 1, 2,3. Assume further that we are interested i n the .pairwise equality h y potheses Hij : Fi = Fj, tested using permutation distributions of test statistics tij = Xi â&#x20AC;&#x201D; Xj. Subset pivotality requires, for example, that the permutation distribution of X\ â&#x20AC;&#x201D; x is identical, no matter whether one permutes the data between the groups 1 and 2 alone, excluding group 3, or by permuting the entire dataset. I n the latter case, data from all three groups is involved, in the former case only data from groups 1 and 2 is involved. W h e n data from group 3 differ dramatically, particularly i n the variance, the global permutation d i s tribution can be far different from the local distribution, and provide very bad results. T h e problem is nearly identical to the use of a pooled variance in linear model contrasts when the variances are quite different. A n example similar to one found in Westfall et al. (1999) illustrates the problem with count data. Suppose there are four treatment groups A , B , C , and D with the outcomes summarized in Table 5.4. T h e goal is to compare the three treatment groups B , C , and D with A , using permutation tests. T h e unadjusted upper-tailed Fisher exact p-values are P B A = 0.9041, P C A = 0.0297, and P D A = 0.1811. I f one uses the algorithm above, blindly, without considering the subset pivotality assumption, then the adjusted p-value for comparing C with A would be # C A = P(miiiiPi < 0.0297|i2ABCD) = 0.0049, which is smaller than the unadjusted p-value. T h e problem is caused by the large amount of data in the B group with small variance; pooling this data with all other groups and permuting similarly drives down the variability, inappropriately, for the individual comparisons. 2

T a b l e 5.4

Group

Events

Sample Size

A B C D

1 5 7 4

50 500 50 50

Mock-up dataset with four treatment groups.

There are various fixes to the problem. One is to fit a binomial model allowing different variances for the groups, and proceeding with the multiple comparisons in approximate (asymptotic) fashion as described in Section 3 . 2 . If exact inference is desired, it appears necessary to use the full closure. For example, to compare C with A , one needs exact p-values for HAC, which is P A C = 0 . 0 2 9 7 , as well as the intersection hypotheses HABC-> -HACD> l # A B C D - T h e p-value for HXBCD is shown above as P A B C D = 0 . 0 0 4 9 . Deleta

(

RESAMPLING-BASED MULTIPLE COMPARISON PROCEDURES

137

ing group B and D successively and recomputing, the remaining intersection p-values are P A C D = 0.0296 and P A B C = 0.0012. Hence, the correct, exact a d justed p-value is q A = max{0.0297,0.0296,0.0012, 0.0049} = 0.0297. Once again, we see the fortunate case where the adjusted and unadjusted p-values do not differ, when exact permutation-based tests are used. C

5.1.2

Using R for permutation

multiple

testing

R software for resampling-based multiple testing includes the m u l t t e s t p a c k age (Pollard, Gilbert, Ge, Taylor, and Dudoit 2010); see Dudoit and van der L a a n (2008) for mathematical details. Here, we analyze adverse event and multiple endpoint data with the c o i n package in R, which uses slightly different test statistics than disucussed in Section 5.1.1. Suppose that we are given observations ( y i , X i ) , where y ; denotes the individiual, possibly m u l t i variate outcome and denotes a binary treatment indicator, i = 1 , . . . , n . T h e multivariate linear test statistic t = vec is a vector containing the sum over all observations within each treatment group. Strasser and Weber (1999) derived the conditional expectation fj, and covariance matrix X under the null hypothesis of independence between y and x; see also Hothorn et al. (2006). T h u s , we can use the standardized test statistic t â&#x20AC;&#x201D; /x diag(S) / 1

and therefore obtain one statistic for each of the multiple outcome variables in yi. For binary responses, this corresponds (up to a factor (n â&#x20AC;&#x201D; l)/n) to the root of the x test statistic in a two-by-two contingency table. For continuous responses, this is roughly equivalent to a t statistic. 2

Adverse events data Testing adverse events in clinical trials is a common application. T h e dataset provided by Westfall et al. (1999, p. 242) contains binary indicators of 28 adverse events, denoted as Ei through E s- I n R? we can set up a formula with the adverse event data on the left hand side and the treatment indicator on the right hand side: 2

R> R> R> + + R>

d a t a O ' a d e v e n t " , package = "multcomp") library("coin") fm < - a s . f o r m u l a ( p a s t e ( p a s t e ( " E " , 1 : 2 8 , sep = " " , c o l l a p s e = "~ g r o u p " ) ) fm

"+"),

138 El

FURTHER TOPICS + E2 + E3 + E4 + E5 + E6 + E7 + E8 + E9 + E10 + E l l + E12 + E l 3 + E l 4 + E l 5 + E l 6 + E l 7 + E l 8 + E l 9 + E20 + E21 + E22 + E23 + E24 + E25 + E26 + E2 7 + E28 ~ group

T h e permutation distribution of the data (for 10, 000 random permutations) is evaluated using the i n d e p e n d e n c e _ t e s t function. Its output provides the necessary information to inspect the standardized statistics and the c o r r e sponding adjusted p-values: R> i t < - i n d e p e n d e n c e _ t e s t ( f m , data = adevent, + d i s t r i b u t i o n = approximate(B = 10000)) R> s t a t i s t i c ( i t , "standardized") El.no A E6.no A Ell.no A E15.no A E19.no A E23.no A E27.no A

event E2.no event E3.no event E4.no event 3.31 -0.256 0.311 -0.342 event E7.no event E8.no event E9.no event 2.02 -0.453 2.26 -1.01 event E12.no event E13.no event E14.no -1 -1.42 -1 event E16.no event E17.no event E18.no 1.42 0 -1 event E20.no event E21.no event E22.no -1.42 1 0 event E24.no event E25.no event E26.no 1 1 1 event E28.no event -1 0.318

R> p v a l u e ( i t , El.no A E6.no A Ell.no A E15.no A E19.no A E23.no A E27.no A

event 0.003 event 0.276 event

method =

E5.no E10.no

event 1.16 event 1.42

event -1.42 event -1 event -1 event -1

"single-step")

E2.no

event

E3.no

E7.no

event

E8.no

event 1

E4.no 1 E9.no

event

E5.no 1 E10.no

event event 1 0.113 1 E12.no event E13.no event E14.no event 1 0 . 708 1 0. 708 event E16.no event E17.no event E18.no event 0. 708 1 1 1 event E20.no event E21.no event E22.no event 0.708 1 1 1 event E24.no event E25.no event E26.no event 1 1 1 1 event E28.no event l l

event 1 event 0.708

T h e null hypothesis of equal distribution of the events in both treatment arms can be rejected only for E\. T h e above analysis is the single-step test procedure implemented i n the c o i n package. Step-down testing can be accomplished by successively excluding adverse events according to the size of the test statistics, and repeating the analysis.

RESAMPLING-BASED MULTIPLE COMPARISON PROCEDURES Analysis

of multiple

139

endpoints

A dataset described i n Westfall, Krishen, and Young (1998) contains measurements of patients in treatment (active drug) and control (placebo) groups, with four outcome variables (i.e., endpoints) labeled JE^L, E%, and E 4 . Table 5.5 displays the summary statistics.

Sample Size

Mean

Standard deviation

54 54 54 54

2.4444 3.2222 2.7778 3.2593

4.3683 1.4623 1.6673 1.6390

57 57 57 57

0.9298 2.5439 2.4035 2.5088

1.3740 1.3372 1.3740 1.6810

Control E x

Ez E 4

Treatment Ek E2 E3

E T a b l e 5.5

Summary results of a study with two treatments and four outcome variables.

Using the c o i n package, we again apply the i n d e p e n d e n c e _ t e s t function to investigate deviations from the null hypotheses that each response variables' distribution is the same in both arms. Here, the permutation distribution of all four test statistics are approximated by 50.000 random permutations of the data: R> d a t a ( " m t e p t " , package = "multcomp") R> i t < - i n d e p e n d e n c e s e s t ( E l + E 2 + E 3 + E 4 ~ t r e a t m e n t , + d a t a = mtept, d i s t r i b u t i o n = approximate(B = 50000)) R> s t a t i s t i c ( i t , " s t a n d a r d i z e d " )

Drug

El -2.49

E2 -2.43

E3 -1.29

R> p v a l u e ( i t , method =

Drug

El 0.0343

E2 0.0409

E4 2.33

"single-step")

E3 0.489

E4 0.0525

Differences between treatment groups can be postulated for all except the third outcome variable.

140 5.1.3

FURTHER TOPICS Bootstrap

testing - Brief

overview

Bootstrap testing is related to permutation testing. T h e simplest comparison is that bootstrap sampling is done with replacement, while permutation s a m pling is done without replacement, but there is more to it than that. Briefly, one advantage of bootstrap testing is that it can be performed using a wider class of models, for example, those with covariates, while permutation testing applies to a more rigid class of models. A disadvantage of bootstrap testing is that it is always inexact and approximate, especially for a popular separate sample type of bootstrap. Troendle, K o r n , and McShane (2004) demonstrated spectacular failure of the separate sample bootstrap for genomics applications compared to the more reasonable approximations of the pooled bootstrap and the exactness of the permutational approach. For typical linear models a p plications involving pairwise comparisons, the parametric approximation and bootstrap approximation give reasonably similar results, despite nonnormality (Bretz and Hothorn 2003). A case where bootstrapping is necessary for multiple comparisons, works well, and where other methods are not available is given by Westfall and Young (1993, Section 2.5.1).

5.2

M e t h o d s for g r o u p s e q u e n t i a l a n d a d a p t i v e d e s i g n s

I n standard experiments, the sample size is fixed and calculated based on the assumed effect sizes, variability, etc. to achieve a designated power level while controlling the T y p e I error rate. I n sequential sampling schemes, the sample size is not fixed, but a random variable. Interim analyses are performed during an ongoing experiment to make conclusions before the end of the experiment or allow one to adjust for incorrect assumptions at the design stage. For example, when interim results suggest a change in sample sizes for the subsequent stages, sample size reestimation methods become an important tool (Chuang-Stein, Anderson, Gallo, and Collins 2006). However, repeatedly looking at the data with the possibility for interim decision making may inflate the overall T y p e I error rate and appropriate analysis methods are required to guarantee its strong control at a pre-specified significance level a. Group sequential and adaptive designs are particularly widely applied in clinical trials because of ethical and financial reasons. Patients should not be treated with inefficacious treatments and interim analyses offer the possibility to stop a trial early for futility (i.e., lack of efficacy). O n the other hand, efficacious and safe treatments should be released quickly to the market so that patients in need can benefit from the potentially ground-breaking therapy. I n Section 5.2.1 we describe the basic theory of group sequential designs and illustrate the methodology with a numerical example using the g s D e s i g n package in R (Anderson 2010). I n Section 5.2.2 we briefly review adaptive designs and describe the a s d package (Parsons 2010).

GROUP SEQUENTIAL ANDA D A P T I V E DESIGNS 5.2.1

Group sequential

141

designs

I n this section we focus on group sequential designs, which have received much attention since Pocock (1977) and O ' B r i e n and Fleming (1979). I t is not the aim of this section to develop the theory i n detail. Instead, we refer to the books by Jennison and Turnbull (2000) and Proschan, L a n , and Wittes (2006) for a complete mathematical treatise of this subject. For details on g s D e s i g n we refer to the manual accompanying the package (Anderson 2010). Basic

theory

Consider normally distributed responses with unknown mean 9 and known variance a . Assume that we are interested i n testing the null hypothesis H: 9 = 9 (the one-sided case is treated similarly). Assume further that there is interest in stopping the trial early either for success (reject the null h y p o t h esis H) or for futility (retain H). I n group sequential designs, inspections are made after groups of observations. Assume a maximum number k of i n s p e c tions for some positive integer k. T h e first k — 1analyses are referred to as interim analyses, while the &-th analysis is referred to as the final analysis. L e t n i , i = 1 , . . . , /c, denote the sample sizes i n the k sequences of observations. Further, let Xi denote the mean value at stage i. For i = 1 , . . . , k consider the overall standardized test statistics 2

Ej=i

\/ ~j j n

no-

where Zi = y/h~i(xi — 9o)/cr denotes the test statistic from the ith stage of the trial. T h e test statistics z\,..., z£ are jointly multivariate normally distributed with means _ , ^ 9 Vi = E « ) = N

— On

i=i

and covariances V ( i i , t i 0 = W Towner e i < i' and ti = Y^j=i j/Yljz=i j denotes the information fraction at stage i. A s shown i n Jennison and Turnbull (2000), this set-up is a s y m p t o t i cally valid i n many practically relevant situations, including two-armed trials with normal, binary or time-to-event outcomes; see also Wassmer (1999) and Wassmer (2009). A group sequential design consists of specifying the continuation regions C i , i = 1 , . . . , k — 1, at theinterim analyses a n d the acceptance region Ck at the final analysis. I f at any interim analysis z* & Ci, the trial stops and the null hypothesis H is rejected. Otherwise, the trial is continued as long as z* 6 Ci. I f the trial continues until the final analysis and z* 6 Cfc, the null hypothesis H is retained. T h a t is, in this simplest case, the rejection region is n

142

FURTHER TOPICS

the complement of the continuation region d, i = 1,..., k — 1, and no other stopping rule is considered. Accordingly, any group sequential design needs to satisfy — a. T h e crossing boundaries defining the decision regions C< can be computed efficiently by accounting for the special structure of the covariance matrix described above (Armitage, McPherson, and Rowe 1969). Once a group s e quential design has been defined, the average sample size A S N is readily given by

where TJ = (771, ... ,rjk) and rji is defined above. A s seen from the example f u r ther below, the average sample size can be used to assees the efficiency of group sequential designs as compared to fixed designs without interim analyses. Many choices of Ci lead to valid group sequential designs, thus allowing the investigator to fine tune the trial design according to the study objectives. A common choice of group sequential designs is the A - c l a s s of boundaries introduced by Wang and Tsiatis (1987). T h e y introduced a one-parameter class of symmetric boundaries, which in case of two-sided testing and equally spaced interim analyses are given by A-0.5

Ci = (—Ui\u^, Ui = c(k, a, A)i where the constant c(k, a, A ) is chosen appropriately to control the T y p e I error rate. Note that for A = 0.5 the boundary values are all equal and thus lead to the well-known design of Pocock (1977). T h e value A = 0 generates the design of O ' B r i e n and Fleming (1979) as special case. Figure 5.3 displays the upper rejection boundaries from Wang and Tsiatis (1987) for A = 0,0.25,0.5 in a group sequential trial with k = 4 analyses and a = 0.025 (one-sided test problem). T h e error spending function method proposed by L a n and DeMets (1983) is an alternative approach to define a group sequential design. T h e idea is to specify the cumulative T y p e I error rate ex*(U) spent up to the i-th i n terim analysis and derive the critical values based on these values. T h e f u n c tion a*(ti) must be specified in advance and is assumed to be non-decreasing with a*(0) = 0 and a * ( l ) = ex. T h e time points U do not need to be prespecified before the actual course of the trial. Consequently, the number of observations at the z-th analysis and the maximum number k of analyses are flexible, although their determination must be independent of any emerging data. I n the two-sided case, the critical value for the first analysis is given by Ui = (1 — ex*(t\)/2), where 3> denotes the cumulative distribution function of the standard normal distribution. T h e remaining critical values

GROUP S E Q U E N T I A L AND A D A P T I V E DESIGNS

143

Information fraction F i g u r e 5.3

Several examples of rejection boundaries from Wang and Tsiatis (1987). Here, A = 0 generates an O'Brien-Fleming design, while A = 0.5 produces a Pocock design.

U2 > • • •, Uk are computed successively through P H tf]

{|*;| < ui} n

> Ui}\ = a * f a ) -

a*(U-i)-

T h e one-sided case is treated analogously. Several choices for the form of the error spending function a*(ti) posed. T h e choices a*(ti) = a l n ( l + (e - l)t<) and 2 (l - $ ( -° ))

(one-sided case)

(two-sided case)

1 - $ ( ^/ )j Ul

were p r o -

144

FURTHER TOPICS

approximate the group sequential boundaries from Pocock (1977) and O ' B r i e n and Fleming (1979), respectively, where u\- = <J> (1 —p). Likewise, Hwang, Shih, and D e C a n i (1990) introduced the one-parameter family -1

I ati

for 7 = 0

and showed that its use yields approximately optimal plans similar to the A class of W a n g and Tsiatis (1987). A value of 7 = — 4is used to approximate an O'Brien-Fleming design, while a value of 7 = 1 approximates a Pocock design well. Figure 5.4 displays the error spending functions from Hwang et a l . (1990) for 7 = - 4 , - 2 , 1 .

Information fraction Figure 5 . 4

Several examples of error spending functions from Hwang et al. (1990) for ex. = 0.025. Here, 7 = — 4approximates an O'Brien-Fleming design, while 7 = 1 approximates a Pocock design.

GROUP S E Q U E N T I A L AND A D A P T I V E DESIGNS

145

An example I n this section we use a numerical example to illustrate the above methodology with the g s D e s i g n package, which supports the design of group sequential trials in R. Assume that we plan a one-sided group sequential design w i t h 90% power to detect a standardized treatment effect size of 6Q = 0.15 to compare patients under treatment (active drug) and control (placebo). I n the following, we show details for an O'Brien-Fleming design, which employs more stringent boundaries at early interim analyses than later. We plan for k = 4 equally spaced analyses and a = 0.025, although the concepts below also apply to other design options. T h e g s D e s i g n function from the g s D e s i g n package provides sample size and boundaries for a group sequential design based on treatment effects, spending functions for boundary crossing probabilities, and relative timing of each analysis. According to the trial specifications above, the call R> l i b r a r y ( " g s D e s i g n " ) R> g s d . O F < - g s D e s i g n ( k = 4, t e s t . t y p e = 1, s f u = " O F " , + alpha = 0.025, beta = 0.1, timing = + d e l t a = 0.15)

provides the relevant information. I n this call, t e s t . t y p e = 1 gives a o n e sided test problem, t i m i n g = 1 produces equally spaced analyses and s f u defines the error spending function (that is, O ' B r i e n - F l e m i n g i n our example). A l l of the spending functions described in Section 5.2.1 as well as many other functions are implemented in the g s D e s i g n package. Finally, d e l t a defines the standardized treatment effect size for which the design is powered. Because of a different standardization used by the g s D e s i g n function, d e l t a has to be set as 0o/2 = 0.15; see Anderson (2010) for details. T h e next line prints a summary of g s d . O F using the p r i n t method a s s o c i ated with g s D e s i g n objects, R> g s d . O F One-sided group sequential 90 % power and 2.5 % Type Analysis

N 1 120 2 239 3 359 4 478

design with Error.

Z Nominal p 0.0000 4.05 0.0021 2.86 0.0097 2.34 0.0215 2.02

Total ++ alpha spending: O 'Brien-Fleming boundary Boundary crossing assume any cross Upper

boundary

probabilities stops the (power

Spend 0.0000 0.0021 0.0083 0.0145 0.0250

and

expected

trial

Type

Error)

sample

size

146

FURTHER TOPICS

Theta 0.00 0.15

Analysis 1 2 3 4 Total 0.000 0.0021 0.0083 0.0145 0.025 0.008 0.2850 0.4031 0.2040 0.900

EfN} 476 358

T h e upper table of the output contains for each of the k = 4 analyses the cumulative total sample size N, the (upper) rejection bound Z, the one-sided significance level Nominal p, and the error spent. I n this example, roughly 120 patients are required at each of the four stages to ensure a power of 9 0 % (recall that we asked for equally spaced interim analyses). T h i s leads to a maximum of 478 patients that are required for this design. For comparison, the sample size in a fixed design (without interim analyses) is 467. T h u s , the maximum sample size in the group sequential design is 1.02 times the sample size in a fixed-sample design. T h i s inflation factor relates the sample size of a group sequential design to its corresponding fixed design and can be obtained with g s D e s i g n by setting d e l t a = 0: R> g s d . 0 F 2 < - g s D e s i g n ( k = 4 , t e s t . t y p e = 1, + s f u = " O F " , alpha = 0.025, beta = 0 . 1 , + d e l t a = 0) R> g s d . 0 F 2 $ n . I [ 4 ] [1 ]

t i m i n g = 1,

1.02

Figure 5.5 plots the upper stopping boundaries [1]

4.05

2.86

2.34

2.02

against the cumulative sample sizes [1]

119

239

358

477

for the selected O'Brien-Fleming design. T h e area above the displayed curve is the rejection region: Whenever the test statistic crosses the boundary at an interim analysis, one can reject the null hypothesis and stop the trial early for success. Figure 5.5 was produced using the p l o t method associated w i t h the g s D e s i g n function; see below for further plotting capabilities. T h e inflation factor introduced above is independent of the standardized effect size, the test and the outcome of interest and serves as basis for s a m ple size calculations in group sequential designs. However, it relates to the maximum sample size and ignores the fact that the sample size is a random variable, as the study could be stopped early for success. T h e average s a m ple size A S N introduced i n Section 5.2.1 is a more realistic measure and is displayed in the bottom table of the g s d . O F summary output shown above. It displays the boundary crossing probabilities at each interim analysis and the expected sample size E ( N ) assuming any cross stops the trial under the null and the alternative hypothesis. I f there is no treatment effect, the a v erage sample size of 476 is slightly lower than the maximum sample size of 478 patients. However, under the alternative the advantage of group sequential designs becomes evident, as the average sample size is reduced to 358 patients, almost 2 5 % less than i n a fixed design. I t should be remembered,

GROUP S E Q U E N T I A L AND A D A P T I V E DESIGNS

147

4.05 4.0-

\ \

A 3.5

\ \

5 S

3 0

\ - v

E o

. . . .

. . Âť \ x

-z. '

2.5-

â&#x20AC;˘

/2.02

2.0-

N=120 i

N=239 t

150

200

250

N=359 i 300

350

N=478 I

400

i 450

Cumulative sample size

F i g u r e 5.5

Upper stopping boundaries for the O'Brien-Fleming design used in the numerical example.

however, that the computations assume normally distributed outcomes with known variance. These are approximate results that should be viewed with caution if the sample sizes are small. T h e previous considerations can be extended by using the g s P r o b a b i l i t y function. T h i s function computes boundary crossing probabilities and e x pected sample size of a design for arbitrary user-specified treatment effects, bounds, and interim analysis sample sizes. Accordingly, we can call R> g s d . 0 F 3 < - g s P r o b a b i l i t y ( t h e t a = gsd.0F$delta*seq(0,2,0.25), + d = gsd.OF) R> g s d . 0 F 3 One-sided group sequential 90 % power and 2.5 % Type Analysis

Nominal

design with Error. p

Spend

FURTHER TOPICS

148 1 2 3 4

120 239 359 478

4.05 2.86 2.34 2.02

0.0000 0.0021 0.0097 0.0215

Total ++ alpha spending: O'Brien-Fleming boundary Boundary crossing assume any cross

probabilities stops the

0.0000 0.0021 0.0083 0.0145 0.0250

and

expected

sample

size

trial

Upper

boundary (power or Type Analysis Theta 1 2 3 0. 0000 0. 0000 0. 0021 0. 0083 0. 0375 0. 0001 0. 0111 0. 0429 0. 0750 0. 0006 0. 0437 0. 1395 0. 1125 0. 0024 0. 1281 0. 2924 0. 1500 0. 0080 0. 2850 0. 4031 0. 1875 0. 022 7 0. 4910 0. 3752 0. 2250 0. 0558 0. 6745 0. 2428 0. 2625 0. 1188 0. 7648 0. 1123 0. 3000 0. 2202 0. 7416 0. 0378

Error)

4 0. 0145 0. 0700 0. 1811 0. 2569 0. 2040 0. 0931 0. 0251 0. 0041 0. 0004

Total 0. 025 0. 124 0. 365 0. 680 0. 900 0. 982 0. 998 1. 000 1. 000

E{N) 476 4 70 450 411 358 307 267 239 217

for the grid of 6 values R> [1] [9]

gsd.0F3$theta 0.0000 0.0375 0.3000

0.0750

0.1125

0.1500

0.1875

0.2250

0.2625

to obtain a better understanding of the operating characteristics for the s e lected design. I n addition, the g s D e s i g n provides extensive plotting capabilities. S e v eral plot types are available and each of them highlights a different aspect of the selected design. Figure 5.5 shows the default plot by displaying the O'Brien-Fleming stopping boundaries. We now present two further relevant plots. Figure 5.6 displays the cumulative boundary crossing probabilities (i.e., power) separately for the three interim analyses and the final analysis as a function of the treatment effects. Note that the x-axis is scaled relative to the effect size d e l t a for which the trial is powered. Finally, Figure 5.7 displays the average sample sizes by treatment difference (same scale for the rc-axis as in Figure 5.6). T h e horizontal line indicates the sample size of 467 for a comparable fixed design without interim analyses. T h e g s D e s i g n provides more functionalities than presented here. For e x ample, conditional power can be computed for evaluating interim trends in a group sequential design, but may also be used to adapt a trial design at an interim analysis using the methods of Miiller and Schafer (2001). T h e gsCP function provides the basis for applying these adaptive methods by computing

GROUP SEQUENTIAL AND A D A P T I V E DESIGNS

0.0

0.5

1.0

1.5

149

2.0

6/6!

F i g u r e 5.6

Cumulative boundary crossing probabilities for the O'Brien-Fleming design used in the numerical example.

the conditional probability of future boundary crossing given the interim a n a l ysis results. T h e g s D e s i g n package can also be used w i t h decision theoretic methods to derive optimal designs. Anderson (2010) describe, for example, the application of Bayesian computations to update the probability of a s u c cessful trial based on knowing a bound has not been crossed, but without knowledge of unblinded treatment results. We conclude this section by referring to the A G S D e s t package i n R(Hack and B r a n n a t h 2009), which allows the computation of repeated confidence intervals as well as confidence i n t e r vals based on the stagewise ordering in group sequential designs and adaptive group sequential designs; see Mehta, Bauer, Posch, and B r a n n a t h (2007) and Brannath, Mehta, and Posch (2009a) for the technical background.

FURTHER TOPICS

150

1 0.5

~T~

0.0

1 1.0

1 1.5

r~ 2.0

e/e.

F i g u r e 5.7

5.2.2

Average sample sizes by treatment difference for the O'Brien-Fleming design used in the numerical example. The horizontal line indicates the sample size of a comparable fixed design.

Adaptive

designs

Adaptive designs use accumulating data of an ongoing trial to decide on how to modify design aspects without undermining the validity and integrity of the trial. I f appropriate methods are used, adaptive designs provide more flexibility than the group sequential designs described in Section 5.2.1 while controlling the T y p e I error rate. For this reason, adaptive designs providing confirmatory evidence have gained much attention in recent years. Especially in the drug development area the expectation has arisen that carefully planned and conducted studies based on adaptive designs help increasing the efficiency by making better use of the observed data. There is a rich literature on such approaches for trials with a single null hypothesis, including p-value combination methods (Bauer and Kohne 1994),

GROUP S E Q U E N T I A L AND A D A P T I V E DESIGNS

151

self-designing trials (Fisher 1998), and methods based on the conditional e r ror function (Proschan and Hunsberger 1995; Muller and Schafer 2001). I n this section, we consider adaptive designs with multiple hypotheses. Bauer and Kieser (1999) proposed an analysis method for adaptive designs involving treatment selection at interim. Kieser, Bauer, and Lehmacher (1999) used a similar approach in the context of multiple outcome variables. Hommel (2001) subsequently formulated a general framework to select and add hypotheses in an adaptive design. Posch, Konig, Branson, B r a n n a t h , Dunger-Baldauf, and Bauer (2005) derived simultaneous confidence intervals and adjusted p-values for adaptive designs with treatment selection. I n the sequel, we follow the d e scription of Bretz, Schmidli, Konig, Racine, and Maurer (2006) and show how to test adaptively multiple hypotheses by applying p-value combination m e t h ods to closed test procedures. We refer to Bretz, Konig, B r a n n a t h , G l i m m , and Posch (2009a) and the references therein for details on adaptive designs p r o viding confirmatory evidence. Basic

theory

For simplicity, we start considering a single one-sided null hypothesis i f in a two-stage design, that is, with one single interim analysis. Based on the data from the first stage it is decided at the interim analysis, whether the study is continued (conducting the second stage) or not (early stopping either due to futility or due to early rejection of i f ) . I n case that one continues with the second stage, the final analysis at the end of the study combines the results of both stages. L e t pj denote the p-value for stage j = 1,2. Following Bauer and Kohne (1994), an adaptive design is specified as follows: (i) Define a test procedure for stage 1, determine the stopping rules for the interim decision and pre-specify a combination function C = C ( p i > P 2 ) of p i and p2 for the final analysis. (ii) Conduct stage 1 of the study, resulting in p i . (iii) Based on p i , decide whether to stop at interim (either rejecting or r e t a i n ing i f ) or to continue the study. (iv) I f the study is continued, use all information (also external to the study, if available) to design the second stage, such as reassessing the stage 2 sample size. (v) Conduct stage 2 of the study, resulting in p of P2 under i f .

w i t h p i being independent

(vi) Combine p i and p 2 using C and test i f by comparing C w i t h an a p p r o priate critical value. Adaptive designs offer a high degree of flexibility during the conduct of the trial. Among multistage designs, they require the least amount of decision rule pre-specification prior to a study. Furthermore, the total information available at the interim analysis can be used in designing the second stage.

152

FURTHER TOPICS

Different possibilities exist to combine the data from both stages. A common choice is Fisher's combination test which rejects H at the final stage if C(p p ) u

= P1P2 < c = e x p ( - X 4 i - « / 2 ) ,

where xt i-oc denotes the critical value of the x distribution with v degrees of freedom at level 1 — ct (Bauer and Kohne 1994). Assume that early stopping boundaries ctQ and ct\ are available such that (i) if p\ < ct\ the trial stops after the interim analysis with an early rejection of H, and (ii) if p\ > ctQ the trial stops after the interim analysis for futility (H is retained). I n order to maintain the T y p e I error rate at pre-specified level a simultaneously across both stages, ct\ is computed by solving ct\ + c(lnao — l n a i ) = ct for given ct and ctQ. Note that if OLQ = 1 no stopping for futility is foreseen and if ct\ = 0 no early rejection of H is possible. Another popular choice is the weighted inverse normal combination function 2

C(p p ) u

= 1 - $ [wi $

- 1

( l -pi)

+ w * 2

_ 1

-p )] 2

where W\ and w denote pre-specified weights such that wf-\-W2 = l and <I> denotes the cumulative distribution function of the standard normal d i s t r i b u tion (Lehmacher and Wassmer 1999; C u i , Hung, and Wang 1999). Note that for the one-sided test of the mean of normally distributed observations with known variance, the inverse normal combination test with pre-planned stagewise sample sizes r i i , n and weights w\ = ni/(ni 4- 722), w\ = n /(ni + n) is equivalent to a classical two-stage group sequential test if no adaptations are performed (the term in squared brackets is simply the standardized t o tal mean). T h u s , the quantities a i , a o and c required for the inverse normal method can be computed with standard software for group sequential trials; see Section 5.2.1. We now consider the case of testing m elementary null hypotheses H\,. -., H. To make the ideas concrete, we assume the comparison of m t r e a t ments with a control, although the methodology reviewed below holds more generally and covers many other applications; see Schmidli, Bretz, Racine, and Maurer (2006); Wang, O'Neill, and Hung (2007); and B r a n n a t h , Zuber, Branson, Bretz, Gallo, Posch, and Racine-Poon (2009b) for examples. L e t Hi : Qi < 0 , i = l , . . . , m , denote the related m one-sided null hypotheses, where Qi denotes the mean effect difference of treatment % against control. T h e general rule is to apply the closure principle from Section 2.2.3 by c o n structing all intersection hypotheses and to test each of them with a suitable combination test (Bauer and Kieser 1999; Kieser et al. 1999; Hommel 2001). Following the closure principle, a null hypothesis Hi is rejected if all intersection hypotheses contained in Hi are also rejected by their combination tests. Consider Figure 5.8 for a n example of testing adaptively m = 2 hypotheses. Let H\ and H denote the elementary hypotheses and H\ the single i n t e r section hypothesis to be tested according to the closure principle. L e t further Pij denote the one-sided p-value for hypothesis Hi,i € { 1 , 2 , 1 2 } , at stage j = 1,2. Finally, let C(pi,\,Pi ), i G { 1 , 2 , 1 2 } , denote the combination func2

GROUP S E Q U E N T I A L AND A D A P T I V E DESIGNS

153

tion C applied to the p-values pij from stage j = 1,2. Note that different combination functions as well as different stopping boundaries could be used within the closed hypotheses set (for simplicity we omit this generalization here). According to the closure principle, Hi (say) is rejected at familywise error rate a , if Hi and H12 are both rejected at level a. T h a t is, Hi is rejected if C ( p i i , p i 2 ) < Ci and C(pi2,i,Pi2,2) < C12, where c i and C12 are suitable critical values, as described above. )

)

C(Pi,l,Pi,2)

H12 = Hi n H2

Hi F i g u r e 5.8

Schematic diagram of the closure principle for testing adaptively two null hypotheses H i and H2 and their intersection.

An example I n this section we use a numerical example to illustrate the above methodology with the a s d package, which supports the design of adaptive trials in R. For details on a s d we refer to Parsons, Friede, Todd, Valdes-Marquez, C h a t away, Nicholas, and Stallard (2010) and the manual accompanying the package (Parsons 2010). For adaptive designs involving a single null hypothesis one can alternatively use the a d a p t T e s t package in R(Vandemeulebroecke 2009). T h e a s d . s i m function from the a s d package runs simulations for a trial design that tests a number of treatments against a single control. Test t r e a t ments are compared to the control treatment using the Dunnett test from Section 4.1. A n interim analysis is undertaken using an early outcome m e a sure, assumed to be normally distributed, and characterized by standardized treatment effects with variances assumed to be equal to one, for each t r e a t ment (and control). A decision is made on which of the treatments to take

154

FURTHER TOPICS

forward, using a pre-defined selection rule. D a t a are simulated for the final outcome measure, also characterized by standardized treatment effects, with variance assumed equal to one and a fixed correlation between the final and the early outcomes. D a t a from the interim and final analyses for the final o u t come measure are combined together using either the inverse normal or Fisher combination test and hypotheses are tested at the selected level. For illustration, suppose that we plan an adaptive trial design to compare 771 = 2 dose levels with placebo for a normal distributed outcome variable. A s sume that we target at 9 0 % power to detect a standardized treatment effect size of 0.3 to compare patients under treatment (one of the two dose levels) and control (placebo). For simplicity, we assume further that at the interim analysis we already observe the final outcome measure and we select the t r e a t ment with the larger observed effect size for the second stage. Assuming equal treatment effect sizes and a pre-planned group sample size of 110 patients per stage, we can call R> l i b r a r y ( " a s d " ) R> r e s <- a s d . s i m ( n s a m p = c ( 1 1 0 , 1 1 0 ) , e a r l y = c ( 0 . 3 , 0.3), + f i n a l = c ( 0 . 3 , 0 . 3 ) , nsim = 10000, c o r r = 1, s e l e c t = + ptest = c ( l , 2)) R> r e s

$count.total 1 2 n 10000 0 $select.total 1 2 n 4996 5004 $reject.total HI H2 n 4525 4496 $sim.reject Total n 9021

I n this call, e a r l y = c ( 0 . 3 , 0 . 3 ) and f i n a l = c ( 0 . 3 , 0 . 3 ) specify the vectors with the standardized treatment effects for the early and the final outcome variable, respectively. Further, the correlation c o r r = 1 between the early and final outcome measure reflects the fact that at the interim analysis we already observe the final outcome measure. T h e s e l e c t = 1 option ensures that we select the treatment with the larger observed effect size for the second stage. Finally, p t e s t = c ( l , 2) will count rejections for testing treatments 1 and 2 against control. We conclude from the output that each treatment has a chance of 50% to be selected for the second stage, as seen from r e s $ s e l e c t . t o t a l . F u r ther, r e s $ r e j e c t . t o t a l gives the individual power of about 4 5 % to reject

GROUP S E Q U E N T I A L AND A D A P T I V E DESIGNS

155

a selected treatment at study end and r e s $ s i m . r e j e c t gives the disjunctive power of about 90% to declare any of the two effective treatments significantly better than placebo (recall Section 2.1.1 for an overview of power in multiple hypotheses testing). I n practice, when designing a real clinical trial, uncertainty about the true effect sizes and other parameters exist. I n addition, the interim decision rule of selecting the better of the two doses may not apply because of unforeseen safety signals. Extensive clinical trial simulations have therefore to be conducted to investigate the operating characteristics of an adaptive design (Bretz and Wang 2010). To give an example of what could be done at the planning stage, we i n vestigate in Figure 5.9 the disjunctive power as a function of Q\ for different designs options: (A) select the best treatment ( s e l e c t = 1), (B) select both treatments ( s e l e c t = 2), and ( C ) select randomly one of the two treatments at the interim analysis ( s e l e c t = 5). Further design options are available with the a s d package but will not be considered here. Note that the total sample size n is not the same for the three options. For example, we have n = 3 x 110 + 3 x 110 = 660 for option ( B ) , but only n = 3 x 110 + 2 x 110 = 550 for option ( A ) . Although the unequal total sample sizes makes it difficult to compare the three options, Figure 5.9 reflects current practice of calculating sample sizes and is therefore of relevance. W i t h the methods implemented in a s d , sample size reallocation could be applied to ensure a constant total sample size. For illustration, we include in Figure 5.9 design option (D) that selects the best treatment at the interim analysis and performs a sample size reallocation for the second stage such that we have 3 3 0 / 2 = 165 patients for the two remaining treatment arms (selected treatment and control). We conclude from Figure 5.9 that design option ( C ) is considerably less powerful than the competing designs. T h i s is not surprising, as for small v a l ues of 6\ there is a chance that the truly inferior treatment is selected with option ( C ) . A s mentioned above, however, unforeseen safety signals may in practice lead to the selection of the treatment with the smaller observed i n terim effect, which may lead to a potentially substantial reduction in power, as seen in Figure 5.9. Options ( A ) and ( B ) have similar power performances, although option ( A ) saves about 16% of sample size, which will be perceived as an efficiency gain. A s expected, option (D) is more powerful t h a n option (A) because of the additional 110 patients in the second stage. Interestingly, comparing options ( B ) and (D) reveals that for a given constant total sample size selecting the better of the two treatments at the interim analysis leads to considerably higher power than continuing with both treatments in the second stage.

156

FURTHER TOPICS

0.00

F i g u r e 5.9

5.3

0.05

0.10

0.15

0.20

0.25

0.30

Disjunctive power for four design options: (A) select the best treatment without sample size reallocation, (B) select both treatments, (C) select randomly one of the two treatments, and (D) select the best treatment with sample size reallocation.

C o m b i n a t i o n of multiple comparison procedures w i t h modeling techniques

I n Section 4.3 we described a variety of powerful trend tests to detect significant dose response signals. Using these methods, we can assess whether or not changes in dose lead to desirable changes in an (efficacy or safety) outcome variable of interest. Once such a dose response signal has been shown (that is, the so-called proof-of-concept ( P o C ) has been established), the question of what comprises a "good" dose arises. To address this question, stepwise test procedures based on the closure principle can be used. To mention only a few papers, we refer the reader to Tamhane, Hochberg, and Dunnett (1996); Bauer (1997); H s u and Berger (1999); Tamhane, Dunnett, Green, and Wetherington

COMBINING MULTIPLE COMPARISONS W I T H MODELING

157

(2001); Bretz et al. (2003), and Strassburger et al. (2007), which all address different aspects of dose finding using multiple comparison procedures; see also Tamhane and Logan (2006) for a review. Multiple comparison procedures regard the dose as a qualitative factor and generally make few, if any, assumptions about the underlying dose response relationship. However, inferences about the target dose are restricted to the discrete, possibly small set of doses used in the trial. O n the other hand, m o d e l ing approaches can be used which assume a parametric (typically non-linear) functional relationship between dose and response. T h e dose is assumed to be a quantitative factor, allowing greater flexibility for estimating a target dose. B u t its validity depends on an appropriately pre-specified dose response model for the final analysis. I n this section we describe a hybrid methodology combining multiple comparison procedures with modeling techniques, denoted as M C P - M o d procedure (Bretz, Pinheiro, and Branson 2005). T h i s approach maintains the flexibility of modeling dose response relationships, while p r e serving robustness w i t h respect to model misspecification through the use of appropriate multiple comparison procedures. I n Section 5.3.1, we describe the M C P - M o d procedure in more detail. T o illustrate the methodology, we analyze in Section 5.3.2 a dose response study and describe the D o s e F i n d i n g package (Bornkamp, Pinheiro, and Bretz 2010), which allows one to design and analyze dose finding studies using the M C P - M o d procedure. 5.3.1

MCP-Mod:

Dose response analyses under model

uncertainty

T h e motivation for M C P - M o d is based on the work by Tukey, C i m i n e r a , and Heyse (1985), who recognized that the power of standard dose response trend tests depends on the (unknown) dose response relationship. T h e y proposed to simultaneously use several trend tests and subsequently adjust the resulting pvalues for multiplicity. Related ideas were also proposed by Westfall (1997) and Stewart and Ruberg (2000), who all use contrast tests and aim at letting the contrast coefficients mirror potential dose response shapes. Bretz, Pinheiro, and Branson (2004); Bretz et al. (2005) formalized these ideas, leading to the M C P - M o d procedure described below. We refer to Pinheiro, Bornkamp, and Bretz (2006a) for advice on design issues when planning a dose finding e x p e r iment using M C P - M o d and Dette, Bretz, Pepelyshev, and Pinheiro (2008) for related optimal designs under model uncertainty. T h i s section can be viewed as a continuation of the discussion started in Section 4.3, where we described the approach from Westfall (1997) and contrasted it with the trend tests from Williams (1971) and Marcus (1976). Note that the theoretical framework from Chapter 3 also applies to the M C P - M o d procedure, thus leading to a powerful method combining multiple comparisons and modeling i n general parametric models. We begin with a n overview of the M C P - M o d procedure in Figure 5.10, which has several steps. T h e procedure starts by defining a set AA of candidate

158

FURTHER TOPICS

models covering a suitable range of dose response shapes. E a c h of the dose response shapes in the candidate set is tested using optimal contrasts and employing a max-t test of the form (2.1) while controlling the familywise error rate. A dose response signal (i.e., P o C ) is established when at least one of the model tests is significant. Once P o C is verified, either a "best" model or a weighted average over the set of models corrsponding to the significant contrasts is used to estimate the dose response profile and the target doses of interest.

Set of candidate models M

Optimum contrast coefficients

Test for a significant dose response signal (PoC)

Model selection or averaging

Dose response and target dose estimation F i g u r e 5.10

Schematic overview of the MCP-Mod procedure.

Below, we assume that we observe a response y for a given set of p a r a l lel groups of subjects corresponding to doses cfe, cfe, • • . , dfc plus placebo d i , for a total of k arms. For the purpose of testing a dose response signal and estimating target doses, we consider the one-way layout (5.1)

Vij = Vdi+eij,

where fi^ = f(di) denotes the mean response at dose di for some dose r e sponse model / ( • ) , rii denotes the number of subjects allocated to dose di, n = Y^i=i i denotes the total sample size, and Sij ~ iV(0, a ) denotes the error term for subject j — 1 , . . . , n», within dose group i = 1 , . . . , k. Following Bretz et al. (2005), we note that most parametric dose response models / ( d , 0) used in practice can be written as n

f{d,0)

= 9 + 0

6 f{d,e*), 1

(5.2)

COMBINING MULTIPLE COMPARISONS W I T H MODELING

159

where f°(d,0*) denotes the standardized model, parameterized by the vector 0*. I n this parametrization, 9Q is a location and Q\ is a scale parameter, so only the parameter 0* determines the shape of the model function. For the d e r i v a tion of optimal contrasts further below it is sufficient to use the standardized model f° instead of the full model / , which motivates the reparametrization in (5.2). Table 5.6 summarizes a selection of linear and non-linear regression models frequently used to represent dose response relationships, together with their respective standardized versions. A common problem is the specification of the model contrasts based on knowledge about the standardized model p a r a m e ters 0* before study begins. Such best guesses ("guesstimates") are typically derived from available information of the expected percentage p* of the m a x imum response associated with a given dose d*. We refer to Pinheiro, Bretz, and Branson (2006b) for strategies to elicit the necessary information. Assume that a set M of M parameterized candidate models is given, with corresponding model functions / ( ( 2 , 0 ) , m = 1 , . . . , M , and parameters 0^ of the standardized models / ° , which determine the model shapes. E a c h of the dose response models in the candidate set is tested using a single contrast test statistic m

^=i ™gi_ c

l , . . . , M ,

where s = Y*,j—i yVj—i(ij ~ a ^ ) / ( ~~ * the pooled variance estimate and c i , . . . , Cmk denote contrast coefficients subject to ^2i=i m i = 0. A s d e scribed below, the contrast coefficients c i , . . . , c fc are chosen to maximize the power of detecting the m - t h model. E v e r y single contrast test thus t r a n s lates into a decision procedure to determine whether the given dose response shape is statistically significant, based on the observed data. It can be shown that the optimal contrast coefficients for detecting the m-th model depend only on the model shape, that is, only on the parameters of the standardized models f°. Letting 2

the i-th entry of the optimal contrast c

for detecting the shape m is propor-

tional to ™i(Mmi- A),

i =

where p, = ra ] C i = i Mmi i (Bretz et al. 2005; Bornkamp 2006). A unique r e p resentation of the optimal contrast can be obtained by imposing the constraint -1

T h e detection of a significant dose response signal is based on the max-t test statistic *max = m a x { £ i , . . . , £ M } Under the null hypothesis of no dose response effect, fjbd = ... = fid , and the distributional assumptions stated in (5.1) it follows that £ i , . . . , £ M are jointly x

f(d,e)

Name linear linlog quadratic emax logistic exponential sigEmax betaMod Table 5.6

+ 8d

+ 8\og(d + c)

log(d + c)

Eo + hd + fod* E + E d/(ED 0

max

+ E

+ d)

m a x

d/(ED

/ {1 + exp [(ED - d) /8}}

max

d /(ED%

+ E B(5 5 )(d/D)^(l

+ d)

Eb + J 5 i ( e x p ( d / ^ - l ) max

l / { l + exp[(ED o- d)/8}} exp(d/J) - 1

ÂŁ0 + E d /(EB

d + <ta if/?

d/D) > 6

B(8 8 )(d/D) >(lu

â&#x20AC;˘ d/D) > 6

Dose response models implemented in the DoseFinding package. For the beta model B(8 8 ) U

= (Si + 8 ) /(8i 8 ) 2

Sl+S2

and for the quadratic model 8 =

fo.

For the quadratic model the standardized model function is given for the concaveshaped form.

COMBINING MULTIPLE COMPARISONS W I T H MODELING

161

multivariate t distributed with n — k degrees of freedom and correlation matrix = (Pij)iji where p..

YA=l U jl/ l c

\Jlu=i

l/ i n

Ef=i ^JTH

Note that the description above parallels the development of multiple c o m parison procedures for the general linear models in Section 3.1, and, more broadly speaking, for the general parametric models in Section 3.2. Numerical integration methods to calculate the required multivariate t (or, if required, normal) probabilities are implemented in the m v t n o r m package (Hothorn et al. 2001); see Genz and Bretz (1999, 2002, 2009) for the mathematical details. Proof-of-concept is established if t > ui- , where ui-a denotes the multiplicity adjusted critical value at level 1 — a from the multivariate t d i s tribution. Furthermore, all dose response shapes with associated contrast test statistics larger than u\can be declared statistically significant at level 1 — a under a strong control of the familywise error rate. These models then form a reference set M* = { M i , . . . , M L } C M of L significant models. If no contrast test is statistically significant, the procedure stops indicating that a dose response relationship cannot be established from the observed data. If a significant dose response signal is established, the next step is to e s timate the dose response curve and the target doses of interest. T h i s can be achieved by either selecting a single model out of A4* or applying model a v eraging techniques to M*. A common target dose of interest is the minimum effective dose ( M E D ) , which is defined as the smallest dose ensuring a c l i n i cally relevant and statistically significant improvement over placebo (Ruberg 1995a). Formally, max

M E D = m i n { d e {d d \ u

: /(d) > /(di) + A } ,

where A is a given relevance threshold. A common estimate for the M E D is M E D = m i n { d e (d d ] u

: / ( d ) > / ( 0 ) + A , L(d) >

/(0)}

where / ( d ) is the predicted response at dose d, and U(d) and L(d) are the corresponding pointwise confidence intervals of level 1 — 27. T h e choice of 7 is not driven by the purpose of controlling T y p e I error rates, in contrast to the selection of a for controlling the familywise error rate in the P o C declaration. Other estimates are possible and we refer to Bretz et al. (2005) for further details. Several possibilities exist to select a single model from M* for the target dose estimation step. Because m a x - i tests have been used above to detect a significant dose response signal, a natural approach is to select the "most significant" model shape, that is, the one associated w i t h the largest contrast test statistic. Alternatively, standard information criteria like the A I C or B I C

FURTHER TOPICS

162

might be used. T h e estimate of the model function is obtained by maximizing the likelihood of the model with respect to its parameters 6. A n alternative to selecting a single dose response model is to apply model averaging and produce weighted estimates across all models in M* for a given quantity ip of interest. I n dose response analyses, the parameter tp could be either a target dose of interest (such as the M E D ) or the mean responses at specific doses d € [di, dk]> Buckland, B u r n h a m , and Augustin (1997) proposed using the weighted estimate i where tp has the same meaning under all models and ip£ is the estimate of ip under model t for given probabilities we. T h e idea is to use estimates for the final data analysis which rely on the averaged estimates across all L models. Buckland et al. (1997) proposed using of the weights exp(-^) £j=i

e x

p(—2-)

which are dependent on a common information criterion I C , such as A I C or B I C applied to each of the L models. 5.3.2

A dose response

example

To illustrate the M C P - M o d procedure we use the biom dose response data from Bretz et al. (2005). T h e data are from a randomized double-blind parallel group trial with a total of 100 patients allocated to either placebo or one of four active doses coded as 0.05, 0.20, 0.60, and 1, with 20 patients per group. T h e clinical threshold is A = 0.4, the significance level is a = 0.05 and the dose estimation model is selected according to the maximum test statistic. We use the D o s e F i n d i n g package to perform the computations. A detailed description of this package is given in Bornkamp, Pinheiro, and Bretz (2009). A s described in Section 5.3.1, the M C P - M o d procedure starts by defining a set A4 of candidate models covering a suitable range of dose response shapes. T h e candidate set of models needs to be constructed as a list, where the list elements should be named as the underlying dose response model function (see Table 5.6) and the individual list entries should correspond to the specified guesstimates. Suppose we want to include in our candidate set a linear model, an E model and a logistic model. Concluding from the standardized model functions in Table 5.6 we need to specify one value for the E model shape (for the E D 5 0 parameter), two values for the logistic model shape (for the E D 5 0 and 6 parameters) but no guesstimate for the linear model as its standardized model function does not contain an unknown parameter. These guesstimates are used below for the calculation of optimal contrast coefficients. Suppose our guesstimate for the E D parameter of the E model is 0.2, while the m

5 0

COMBINING MULTIPLE COMPARISONS W I T H MODELING

163

guesstimate for ( E D , ÂŁ ) for the logistic model is (0.4,0.09). T h e model list is then defined through 5 0

R> l i b r a r y ( " D o s e F i n d i n g " ) R> candMods < - l i s t ( l i n e a r = NULL, emax = 0 . 2 , + l o g i s t i c = c(0.25, 0.09)) We visualize the model shapes from the candidate set with the p l o t M o d e l s function. A s the candidate model shapes do not determine the location (baseline effect) and scale (maximum effect) of the model, we also need to specify those parameters v i a the base and maxEf f arguments. Using the candidate set candMods from above, we define a vector for the dose levels and then call plotModels, R> doses < - c ( 0 , 0 . 0 5 , 0 . 2 , 0 . 6 , 1) R> plotModels(candMods, d o s e s , base = 0 , maxEff = 1) see Figure 5.11 for the plot. After these preparations, we can now use the MCPMod function, which i m plements the full M C P - M o d procedure. T h u s , we can call R> d a t a ( " b i o m " , package = " D o s e F i n d i n g " ) R> r e s < - MCPMod(resp ~ d o s e , biom, candMods, + p V a l = TRUE, c l i n R e l = 0 . 4 )

alpha = 0.05,

to run the M C P - M o d procedure. I n the previous call the arguments to the MCPMod function are mostly self-explanatory; the p V a l = TRUE options s p e c i fies that multiplicity adjusted p-values for the max-t test should be calculated. A brief summary of the results is available v i a the p r i n t method for the MCPMod objects R> r e s MCPMod PoC (alpha = 0.05, one-sided): yes Model with highest t - s t a t i s t i c : emax Model used for dose estimation: emax Dose estimate: MED2, 80% 0.17

We conclude from the PoC line that there is significant dose response signal, which means that the max-t test is significant at the one-sided significance level a = 0.05. Additionally, the E contrast has the largest test statistic and consequently the E model was used for the dose estimation step. T h e M E D estimate is 0.17. A more detailed summary of the results is available v i a the summary method m

summary(res)

MCPMod

164

FURTHER TOPICS

tz CD CD

E "a> â&#x20AC;˘o o

Dose

F i g u r e 5.11

Model shapes for the selected candidate model set.

Input parameters: alpha =0.05 alternative: one.sided, one sided model selection: maxT c l i n i c a l relevance = 0.4 dose estimator: MED2 (gamma = 0.1) optimizer: nls Optimal Contrasts: linear emax 0 -0.431 -0.643 0.05 -0.378 -0.361 0.2 -0.201 0.061 0.6 0.271 0.413 1 0.743 0.530

logistic -0.478 -0.435 -0.147 0.519 0.540

COMBINING M U L T I P L E COMPARISONS W I T H M O D E L I N G Contrast linear emax logistic Multiple emax logistic linear Selected emax

Correlation: linear emax 1.000 0.912 0.912 1.000 0.945 0.956 Contrast Tvalue 3.46 3.23 2.97

logistic 0.945 0.956 1.000

Test: pValue 0.001 0.002 0.003

for dose

Parameter emax model: eO eMax 0.322 0. 746

165

estimation:

estimates: ed50 0.142

Dose estimate MED2, 80% 0.17

From the output above, we first obtain some information about important input parameters that were used when performing the M C P - M o d procedure. T h e n the output displays the optimal contrasts and their correlations. T h e contrast test statistics, their multiplicity adjusted p-values a n d the critical value are shown i n the next table. Finally, the output contains information about the fitted dose response model, its parameter estimates and the target dose estimate. T h e core results are of course the same as the results displayed earlier with the p r i n t method. A graphical display of the dose response model used for dose estimation can be obtained v i a the p l o t method for MCPMod objects, R> p l o t ( r e s , complData = TRUE, c l i n R e l = TRUE, C I = TRUE, + d o s e E s t = TRUE) The resulting plot is shown in Figure 5.12. T h e options complData, C I , c l i n R e l , and d o s e E s t specify, which of the following should be included i n the plot when set to TRUE: the full dose response dataset (instead of displaying only the group means), the confidence intervals for the mean of the function, the clinical relevance threshold and the dose estimate, respectively.

FURTHER TOPICS Responses

Model Predictions

0.8 Pointwise CI J

Dose

F i g u r e 5.12

Fitted model for the biom data.

â&#x20AC;˘

Estim. Dose L

Bibliography Agresti, A . , B i n i , M . , Bertaccini, B . , and R y u , E . (2008), "Simultaneous confidence intervals for comparing binomial parameters," Biometrics, 64, 1 2 7 0 ÂŹ 1275. Cited on p. 52. A l t m a n , D . G . (1991), Practical Statistics C h a p m a n & Hall. Cited on p. 3.

for

Medical

Research,

London:

Anderson, K . (2010), g s D e s i g n : Group Sequential Design, U R L h t t p : / / C R A N . R - p r o j e c t . o r g / p a c k a g e = g s D e s i g n , R package version 2.3. C i t e d on p. 140, 141, 145, 149. Armitage, R , McPherson, C . K . , and Rowe, B . C . (1969), "Repeated significance tests on accumulating data," Journal of the Royal Statistical Society, Series A, 132, 235-244. Cited on p. 142. Bartholomew, D . J . (1961), "Ordered tests in the analysis of variance," Biometrika, 48, 325-332. C i t e d on p. 103. Bates, D . and Sarkar, D . (2010), l m e 4 : Linear Mixed-Effects Models Using S4 Classes, U R L h t t p : / / C R A N . R - p r o j e c t . o r g / p a c k a g e = l m e 4 , R package version 0.999375-33. Cited on p. 126. Bauer, R (1997), " A note on multiple testing procedures in dose finding," Biometrics, 53, 1125-1128. Cited on p. 156. Bauer, R , Hackl, R , Hommel, G . , and Sonnemann, E . (1986), "Multiple testing of pairs of one-sided hypotheses," Metrika, 33, 121-127. C i t e d on p. 17. Bauer, P. and Kieser, M . (1999), "Combining different phases i n the development of medical treatments within a single t r i a l , " Statistics in Medicine, 18, 1833-1848. C i t e d on p. 151, 152. Bauer, P. and Kohne, K . (1994), " E v a l u a t i o n of experiments w i t h adaptive interim analyses," Biometrics, 50, 1029-1041. C i t e d on p. 150, 151, 152. Bauer, P., Rohmel, J . , Maurer, W . , and Hothorn, L . (1998), "Testing strategies in multi-dose experiments including active control," Statistics in Medicine, 17, 2133-2146. Cited on p. 28. Benjamini, Y . (2010), "Simultaneous and selective inference: current successes and future challenges," Biometrical Journal, (in press). C i t e d on p. 14. Benjamini, Y . and Hochberg, Y . (1995), "Controlling the false discovery rate: A practical and powerful approach to multiple testing," Journal of the Royal Statistical Society, Series B, 57, 289-300. C i t e d on p. 13, 61. 167

168

BIBLIOGRAPHY

Benjamini, Y . and Hochberg, Y . (2000), " O n the adaptive control of the false discovery rate in multiple testing with independent statistics," Journal of Educational and Behavioural Statistics, 25, 60-83. Cited on p. 35. Benjamini, Y . and Yekutieli, D . L . (2001), " T h e control of the false discovery rate in multiple testing under dependency," The Annals of Statistics, 29, 1165-1188. Cited on p. 61. Berger, R . L . (1982), "Multi-parameter hypothesis testing and acceptance s a m pling," Technometrics, 24, 295-300. Cited on p. 22. Bergmann, B . and Hommel, G . (1988), "Improvements of general multiple test procedures for redundant systems of hypotheses," in Multiple Hypothesenpriifung - Multiple Hypotheses Testing, eds. P. Bauer, G . Hommel, and E . Sonnemann, Heidelberg: Springer, pp. 100-115. Cited on p. 34, 95. Bernhard, G . (1992), Computergestutzte Durchfuhrung von multiplen Testprozeduren, Universitat Mainz, Germany, P h D thesis. Cited on p. 95. Berry, D . (2007), " T h e difficult and ubiquitous problems of multiplicities," Pharmaceutical Statistics, 6, 155â&#x20AC;&#x201D;160. Cited on p. xiv. Bofinger, E . (1987), "Step down procedures for comparison with a control," Australian Journal of Statistics, 29, 348-364. Cited on p. 81. Bonsch, D . , Lederer, T . , Reulbach, U . , Hothorn, T . , Kornhuber, J . , and B l e ich, S. (2005), "Joint analysis of the N A C P - R E P 1 marker within the a l pha synuclein gene concludes association with alcohol dependence," Human Molecular Genetics, 14, 967-971. Cited on p. 114. Bornkamp, B . (2006), Comparison of model-based and model-free approaches for the analysis of dose-response studies, Fachbereich Statistik, Universitat Dortmund, Germany, Diploma thesis. Cited on p. 159. Bornkamp, B . , Pinheiro, J . , and Bretz, F . (2009), " M C P M o d - an R package for the design and analysis of dose-finding studies," Journal of Statistical Software, 29, 1-23. Cited on p. 162. Bornkamp, B . , Pinheiro, J . , and Bretz, F . (2010), D o s e F i n d i n g : Planning and Analyzing Dose Finding Experiments, U R L h t t p : / / C R A N . R - p r o j e c t . o r g / p a c k a g e = D o s e F i n d i n g , R package version 0.2-1. Cited on p. 157. Brannath, W . and Bretz, F . (2010), "Shortcuts for locally consonant closed test procedures," Journal of the American Statistical Association, (in press). Cited on p. 20, 28, 135. Brannath, W . , Mehta, C . R . , and Posch, M . (2009a), " E x a c t confidence bounds following adaptive group sequential tests," Biometrics, 65, 539â&#x20AC;&#x201D;546. Cited on p. 149. Brannath, W . , Zuber, E . , Branson, M . , Bretz, F . , Gallo, P., Posch, M . , and Racine-Poon, A . (2009b), "Confirmatory adaptive designs with Bayesian decision tools for a targeted therapy in oncology," Statistics in Medicine, 28, 1445-1463. Cited on p. 152.

BIBLIOGRAPHY

169

Bretz, F . (2006), " A n extension of the Williams trend test to general u n b a l anced linear models," Computational Statistics & Data Analysis, 50, 1 7 3 5 ¬ 1748. Cited on p. 100, 104, 106. Bretz, F . , Hayter, A . , and Genz, A. (2001), " C r i t i c a l point and power c a l c u l a tions for the studentized range test for generally correlated means," Journal of Statistical Computation and Simulation, 71, 85-99. Cited on p. 84. Bretz, F . and Hothorn, L . A . (2003), "Comparison of exact amd resampling based multiple test procedures," Communications in Statistics Computation and Simulation, 32, 461-473. Cited on p. 140. Bretz, F . , Hothorn, L . A . , and Hsu, J . C . (2003), "Identifying effective a n d / o r safe doses by stepwise confidence intervals for ratios," Statistics in Medicine, 22, 847-858. Cited on p. 30, 157. Bretz, F . , Hothorn, T . , and Westfall, P. (2008a), "Multiple comparison p r o cedures in linear models," in COMPSTAT 2008 - Proceedings in Computational Statistics, ed. P. Brito, Heidelberg: P h y s i c a Verlag, pp. 423-431. Cited on p. 41. Bretz, F . , H s u , J . C , Pinheiro, J . C , and L i u , Y . (2008b), "Dose finding - a challenge in statistics," Biometrical Journal, 50, 480-504. C i t e d on p. 100. Bretz, F . , Konig, F . , B r a n n a t h , W . , G l i m m , E . , and Posch, M . (2009a), " A d a p tive designs for confirmatory clinical trials," Statistics in Medicine, 28, 1 1 8 1 — 1217. Cited on p. 151. Bretz, F . , Maurer, W . , B r a n n a t h , W . , and Posch, M . (2009b), " A graphical approach to sequentially rejective multiple test procedures," Statistics in Medicine, 28, 586-604. Cited on p. 28, 35. Bretz, F . , Maurer, W . , and Gallo, P. (2009c), "Discussion of some controversial multiple testing problems in regulatory applications by H . M . J . Hung and S . J . W a n g , " Journal of Biopharmaceutical Statistics, 19, 25-34. Cited on p. 7. Bretz, F . , Maurer, W . , and Hommel, G . (2010), " T e s t and power considerations for multiple endpoint analyses using sequentially rejective graphical procedures," Statistics in Medicine, (in press). C i t e d on p. 16, 28. Bretz, F . , Pinheiro, J . C , and Branson, M . (2004), " O n a hybrid method in dose finding studies," Methods of Information in Medicine, 43, 457-461. Cited on p. 157. Bretz, F . , Pinheiro, J . C , and Branson, M . (2005), "Combining multiple c o m parisons and modeling techniques in dose-response studies," Biometrics, 61, 738-748. Cited on p. 157, 158, 159, 161, 162. Bretz, F . , Schmidli, H . , Konig, F . , Racine, A . , and Maurer, W . (2006), " C o n firmatory seamless phase I I / I I I clinical trials with hypotheses selection at interim: General concepts (with discussion)," Biometrical Journal, 48, 6 2 3 ¬ 634. Cited on p. 151.

BIBLIOGRAPHY

170

Bretz, F . and Wang, S. J . (2010), "Recent developments in adaptive designs. P a r t I I : Success probabilities and effect estimates for Phase I I I development programs v i a modern protocol design," Drug Information Journal, 44, 3 3 3 ÂŹ 342. Cited on p. 155. Brown, L . D . (1984), " A note on the T u k e y - K r a m e r procedure for pairwise comparisons of correlated means," in Design of Experiments: Ranking and Selection (Essays in Honour of Robert E. Bechhofer), eds. T . J . Santner and A . C . Tamhane, New York: Marcel Dekker, pp. 1-6. Cited on p. 84. Buckland, S. T . , B u r n h a m , K . P., and Augustin, N. H . (1997), "Model s e lection: A n integral part of inference," Biometrics, 53, 603-618. Cited on p. 162. Bullinger, L . , Dohner, K . , B a i r , E . , Frohlich, S., Schlenk, R . F . , T i b s h i r a n i , R., Dohner, H . , and Pollack, J . R . (2004), "Use of gene-expression profiling to identify prognostic subclasses in adult acute myloid leukemia," New England Journal of Medicine, 350, 1605-1616. Cited on p. 124. B u r m a n , C . F . , Sonesson, C , and Guilbaud, O. (2009), " A recycling framework for the construction of Bonferroni-based multiple tests," Statistics in Medicine, 28, 739-761. Cited on p. 28. Calian, V . , L i , D . , and H s u , J . C . (2008), "Partitioning to uncover conditions for permutation tests to control multiple testing error rates," Biometrical Journal, 50, 756-766. Cited on p. 134. Chevret, S. (2006), Statistical Methods for Dose Finding York: Wiley. Cited on p. 100.

Experiments,

New

Chuang-Stein, C , Anderson, K . , Gallo, P., and Collins, S. (2006), "Sample size re-estimation: A review and recommendations," Drug Information Journal, 40, 475-484. Cited on p. 140. C u i , L . , Hung, H . M . J . , and Wang, S. (1999), "Modification of sample size in group sequential clinical trials," Biometrics, 55, 321-324. Cited on p. 152. Dalgaard, P. (2002), Introductory on p. xv, 3.

Statistics with R, New York: Springer. Cited

Dalgaard, P. (2010), I S w R : Introductory Statistics with R , U R L h t t p : / / CRAN.R-pro j e c t . org/package=ISwR, R package version 2.0-5. Cited on p. 3. Dette, H . , Bretz, F . , Pepelyshev, A . , and Pinheiro, J . C . (2008), " O p t i m a l designs for dose finding studies," Journal of the American Statistical Association, 103, 1225-1237. Cited on p. 157. Dmitrienko, A . , Offen, W . W . , and Westfall, P. H . (2003), "Gatekeeping s t r a t e gies for clinical trials that do not require all primary effects to be significant," Statistics in Medicine, 22, 2387-2400. Cited on p. 28, 35. Dmitrienko, A . , Tamhane, A . C , and Bretz, F . (2009), Multiple Testing Problems in Pharmaceutical Statistics, B o c a Raton: Taylor and Francis. Cited on p. xiii.

BIBLIOGRAPHY

171

Dudoit, S. and van der L a a n , M . J . (2008), Multiple Testing Procedures with Applications to Genomics, New York: Springer. C i t e d on p. xiii, 12, 127, 137. Dunnett, C . W . (1955), " A multiple comparison procedure for comparing s e v eral treatments with a control," Journal of the American Statistical Association, 50, 1096-1121. Cited on p. 72. Dunnett, C . W . and Tamhane, A . C . (1991), "Step-down multiple tests for comparing treatments with a control in unbalanced one-way layouts," Statistics in Medicine, 10, 939-947. Cited on p. 78. E M E A (2002), C P M P points to consider on multiplicity issues i n clinical trials, European Agency for the Evaluation of Medicinal Products, London, U R L http://www.emea.europa.eu/pdfs/human/ewp/090899en.pdf. C i t e d on p. 9. E M E A (2009), C H M P guideline on clinical development of fixed c o m b i n a tion medicinal products, European Agency for the Evaluation of Medicinal Products, London, U R L h t t p : / / w w w . e m e a . e u r o p a . e u / p d f s / h u m a n / e w p / 0 2 4 0 9 5 e n f i n . p d f . C i t e d on p. 22. Everitt, B . S. and Hothorn, T . (2009), A Handbook of Statistical Analyses Using R, B o c a R a t o n , Florida: C h a p m a n &; H a l l / C R C Press, 2nd edition. Cited on p. xv. Finner, H . (1988), "Abgeschlossene multiple Spannweitentests," in Multiple Hypothesenprufung - Multiple Hypotheses Testing, eds. P. Bauer, G . H o m mel, and E . Sonnemann, Berlin: Springer, pp. 10-32. C i t e d on p. 93. Finner, H . (1999), "Stepwise multiple test procedures and control of directional errors," The Annals of Statistics, 27, 274-289. C i t e d on p. 17. Finner, H . , G i a n i , G . , and Strassburger, K . (2006), "Partitioning principle and selection of good populations," Journal of Statistical Planning and Inference, 136, 2053-2069. Cited on p. 30. Finner, H . and Gontscharuk, V . (2009), "Controlling the familywise error rate with plug-in estimator for the proportion of true null hypotheses," Journal of the Royal Statistical Society, Series B, 71, 1031-1048. C i t e d on p. 35. Finner, H . and Strassburger, K . (2002), " T h e partitioning principle: A powerful tool i n multiple decision theory," The Annals of Statistics, 30, 1194-1213. Cited on p. 11, 18, 29, 30. Finner, H . and Strassburger, K . (2006), " O n ^-equivalence with the best in fc-sample models," Journal of the American Statistical Association, 101, 737-746. Cited on p. 31. Fisher, L . D . (1998), "Self-designing clinical trials," Statistics 1551-1562. C i t e d on p. 151.

in Medicine,

17,

Gabriel, K . R . (1969), "Simultaneous test procedures - some theory of multiple comparisons," The Annals of Mathematical Statistics, 40, 224-250. Cited on p. 11, 20, 21.

172

BIBLIOGRAPHY

Garcia, A . L . , Wagner, K . , Hothorn, T . , Koebnick, C . , Zunft, H . J . F . , and Trippo, U . (2005), "Improved prediction of body fat by measuring skinfold thickness, circumferences, and bone breadths," Obesity Research, 13, 6 2 6 ÂŹ 634. Cited on p. 108. Genz, A . and Bretz, F . (1999), "Numerical computation of multivariate tprobabilities with application to power calculation of multiple contrasts," Journal of Statistical Computation and Simulation, 63, 361-378. C i t e d on p. 44, 51, 161. Genz, A . and Bretz, F . (2002), "Methods for the computation of multivariate ^-probabilities," Journal of Computational and Graphical Statistics, 11, 9 5 0 ÂŹ 971. Cited on p. 44, 51, 161. Genz, A . and Bretz, F . (2009), Computation of Multivariate Normal and t Probabilities, volume 195 of Lecture Note Series, Heidelberg: Springer. Cited on p. 44, 45, 51, 73, 84, 161. Genz, A . , Bretz, F . , and Hothorn, T . (2010), m v t n o r m : Multivariate Normal and t Distribution, U R L h t t p : / / C R A N . R - pro j e c t . o r g / p a c k a g e = r o b u s t b a s e , R package version 0.9-91. Cited on p. 63. Grechanovsky, E . and Hochberg, Y . (1999), "Closed procedures are better and often admit a shortcut," Journal of Statistical Planning and Inference, 76, 79-91. Cited on p. 28. Guilbaud, O. (2008), "Simultaneous confidence regions corresponding to Holm's step-down procedure and other closed-testing procedures," Biometrical Journal, 50, 678-692. Cited on p. 19, 34, 81. Guilbaud, O. (2009), "Alternative confidence regions for Bonferroni-based closed-testing procedures that are not alpha-exhaustive," Biometrical Journal, 51, 721-735. Cited on p. 19. Hack, N. and B r a n n a t h , W . (2009), A G S D e s t : Estimation in Adaptive Group Sequential Trials, U R L h t t p : / / C R A N . R - pro j e c t . org/package=AGSDest, R package version 1.0. Cited on p. 149. Hayter, A . J . (1984), " A proof of the conjecture that the T u k e y - K r a m e r m u l tiple comparisons procedure is conservative," The Annals of Statistics, 12, 61-75. Cited on p. 84. Hayter, A . J . (1989), "Pairwise comparisons of generally correlated means," Journal of the American Statistical Association, 84, 208-213. C i t e d on p. 84. Hayter, A . J . (1990), " A one-sided studentized range test for testing against ordered alternatives," Journal of the American Statistical Association, 85, 778-785. Cited on p. 85. Hayter, A . J . and H s u , J . C . (1994), " O n the relationship between stepwise decision procedures and confidence sets," Journal of the American Statistical Association, 89, 128-136. Cited on p. 18, 29. Heiberger, R . M . (2009), H H : Statistical Analysis and D a t a Display: Heiberger and Holland, U R L h t t p : / / C R A N . R - pro j e c t . org/package=HH, R package version 2.1-32. Cited on p. 65, 90.

BIBLIOGRAPHY

173

Heiberger, R . M . and Holland, B . (2004), Statistical Analysis and Data Display: An Intermediate Course with Examples in S-Plus, R, and SAS, New York: Springer. Cited on p. 89, 93. Heiberger, R . M . and Holland, B . (2006), "Mean-mean multiple comparison displays for families of linear contrasts," Journal of Computational and Graphical Statistics, 15, 937-955. Cited on p. 89, 91, 93. Herberich, E . (2009), Niveau und Giite simultaner parametrischer Inferenzverfahren, Institut fur Statistik, Universitat Mtinchen, Germany, Diploma thesis (in G e r m a n ) . C i t e d on p. 52. Hochberg, E . and Tamhane, A . C . (1987), Multiple Comparison New York: John Wiley Sz Sons. Cited on p. xiii, 11, 22, 41, 93.

Procedures,

Hochberg, Y . (1988), " A sharper Bonferroni procedure for multiple significance testing," Biometrika, 75, 800-802. Cited on p. 38. Holm, S. (1979a), " A simple sequentially rejective multiple test procedure," Scandinavian Journal of Statistics, 6, 65-70. C i t e d on p. 17, 19, 28, 31, 32, 33, 35. Holm, S. (1979b), A stagewise directional test based on t-statistics, Institute of Mathematics, Chalmers University of Technology, Gothenberg, Statistical Research Report 1979-3. Cited on p. 17. Hommel, G . (1983), "Tests of the overall hypothesis for arbitrary dependence structures," Biometrical Journal, 25, 423-430. C i t e d on p. 37. Hommel, G . (1988), " A stagewise rejective multiple test procedure based on a modified Bonferroni test," Biometrika, 75, 383-386. C i t e d on p. 36, 38. Hommel, G . (1989), " A comparison of two modified Bonferroni procedures," Biometrika, 76, 624-625. C i t e d on p. 38. Hommel, G . (2001), "Adaptive modifications of hypotheses after an interim analysis," Biometrical Journal, 43, 581-589. C i t e d on p. 151, 152. Hommel, G . and Bernhard, G . (1992), "Multiple hypotheses testing," in Computational Aspects of Model Choice, ed. J . Antoch, Heidelberg: P h y s i c a , pp. 211-235. Cited on p. 95. Hommel, G . and Bernhard, G . (1999), "Bonferroni procedures for logically related hypotheses," Journal of Statistical Planning and Inference, 82, 1 1 9 ÂŹ 128. C i t e d on p. 34, 95. Hommel, G . and Bretz, F . (2008), "Aesthetics and power considerations in multiple testing - a contradiction?" Biometrical Journal, 50, 657-666. C i t e d on p. 20, 34, 95. Hommel, G . , Bretz, F . , and Maurer, W . (2007), "Powerful short-cuts for m u l tiple testing procedures with special reference to gatekeeping strategies," Statistics in Medicine, 26, 4063-4073. Cited on p. 28, 35. Hommel, G . and Hoffmann, T . (1988), "Controlled uncertainty," in Multiple Hypothesenprufung - Multiple Hypotheses Testing, eds. P. Bauer, G . H o m mel, and E . Sonnemann, Heidelberg: Springer, pp. 154-161. C i t e d on p. 13.

BIBLIOGRAPHY

174

Hothorn, T . , Bretz, F . , and Genz, A . (2001), " O n multivariate t and Gaufi probabilities in R , " R News, 1, 27-29. Cited on p. 45, 161. Hothorn, T . , Bretz, F . , and Westfall, P. (2008), "Simultaneous inference in general parametric models," Biometrical Journal, 50, 346-363. Cited on p. 48, 49, 50, 69. Hothorn, T . , Bretz, F . , Westfall, P., Heiberger, R . M . , and Schutzenmeister, A . (2010a), m u l t c o m p : Simultaneous Inference for General Linear H y potheses, U R L h t t p : / / C R A N . R - p r o j e c t . org/package=multcomp, R p a c k age version 1.1-7. Cited on p. 53, 56, 66. Hothorn, T . , Hornik, K . , van de Wiel, M., and Zeileis, A . (2010b), c o i n : Conditional Inference Procedures in a Permutation Test Framework, U R L h t t p : / / C R A N . R - p r o j e c t . o r g / p a c k a g e = c o i n , R package version 1.0-11. Cited on p. 127. Hothorn, T . , Hornik, K . , van de Wiel, M. A . , and Zeileis, A . (2006), " A Lego system for conditional inference," The American Statistician, 60, 257-263. Cited on p. 118, 137. H s u , J . C . (1996), Multiple on p. xiii, 11, 41, 85.

Comparisons,

London: C h a p m a n Sz Hall.

Cited

Hsu, J . C . and Berger, R . L . (1999), "Stepwise confidence intervals without multiplicity adjustment for dose-response and toxicity studies," Journal of the American Statistical Association, 94, 468-482. Cited on p. 30, 156. H s u , J . C . and Peruggia, M . (1994), "Graphical representations of T u k e y ' s m u l tiple comparison method," Journal of Computational and Graphical Statistics, 3, 143-161. Cited on p. 89. Hsueh, H . , C h e n , J . J . , and Kodell, R . L . (2003), "Comparison of methods for estimating the number of true null hypotheses in multiplicity testing," Journal of Biopharmaceutical Statistics, 13, 675-689. Cited on p. 35. Hwang, I . K . , Shih, W . J . , and D e C a n i , J . S. (1990), " G r o u p sequential designs using a family of T y p e I error probability spending functions," Statistics in Medicine, 9, 1439-1445. Cited on p. 144. I C H (1998), I C H Topic E 9 : Notes for Guidance on Statistical Principles for Clinical Trials, International Conference on Harmonization, London, U R L http://www.emea.europa.eu/pdfs/human/ich/036396en.pdf. Cited on p. 8. Ihaka, R . and Gentleman, R. (1996), " R : A language for data analysis and graphics," Journal of Computational and Graphical Statistics, 5, 299-314. C i t e d on p. xiv. Jennison, C . and Turnbull, B . W . (2000), Group Sequential Methods with Applications to Clinical Trials, B o c a Raton: C h a p m a n and Hall. Cited on p. 141. Kieser, M., Bauer, P., and Lehmacher, W . (1999), "Inference on multiple endpoints in clinical trials with adaptive interim analyses," Biometrical Journal, 41, 261-277. Cited on p. 151, 152.

175

BIBLIOGRAPHY

Kleinbaum, D . G . , K u p p e r , L . L . , Muller, K . E . , and Nizam, A . (1998), Applied Regression Analysis and Other Multivariable Methods, North Scituate, M A : Duxbury Press. Cited on p. I l l , 113. K o r n , E . L . , Troendle, J . F . , McShane, L . M., and Simon, R . (2004), " C o n trolling the number of false discoveries: Application to high-dimensional genomic data," Journal of Statistical Planning and Inference, 124, 379-398. Cited on p. 14. K o t z , S., Balakrishnan, N., and Johnson, N . L . (2000), Continuous Multivariate Distributions. Volume 1: Models and Applications, New York: Wiley. Cited on p. 44. Kotz, S. and Nadarajah, S. (2004), Multivariate t Distributions and Their Applications, Cambridge: Cambridge University Press. Cited on p. 44. K r i s h n a , R . (2006), Dose Optimization forma Healthcare. Cited on p. 100.

in Drug Development,

New York: I n -

L a n , K . K . G . and DeMets, D . L . (1983), "Discrete sequential boundaries for clinical trials," Biometrika, 70, 659-663. Cited on p. 142. Laska, E . M . and Meisner, M . J . (1989), "Testing whether an identified t r e a t ment is best," Biometrics, 45, 1139-1151. Cited on p. 22. Lehmacher, W . and Wassmer, G . (1999), "Adaptive sample size calculations in group sequential trials," Biometrics, 55, 1286-1290. C i t e d on p. 152. Lehmann, E . L . (1957a), " A theory of some multiple decision problems I , " The Annals of Mathematical Statistics, 28, 1-25. C i t e d on p. 11. Lehmann, E . L . (1957b), " A theory of some multiple decision problems I I , " The Annals of Mathematical Statistics, 28, 547-572. Cited on p. 11. Lehmann, E . L . (1986), Testing Statistical Hypotheses, New York: Wiley. Cited on p. 18. Lehmann, E . L . and Romano, J . P. (2005), "Generalizations of the familywise error rate," The Annals of Statistics, 33, 1138-1154. Cited on p. 13. L i u , W . (2010), Simultaneous Inference and Francis. C i t e d on p. 12, 114.

for Regression,

B o c a R a t o n : Taylor

L i u , W . , Jamshidian, M . , and Zhang, Y . (2004), "Multiple comparison of s e v eral linear regression models," Journal of the American Statistical Association, 99, 395-403. Cited on p. 112. L i u , W . , Jamshidian, M . , Zhang, Y . , Bretz, F . , and H a n , X . (2007a), "Some new methods for the comparison of two linear regression models," Journal of Statistical Planning and Inference, 137, 57-67. C i t e d on p. 112, 113. L i u , Y . and H s u , J . (2009), "Testing for efficacy in primary and secondary endpoints by partitioning decision paths," Journal of the American Statistical Association, 104, 1661-1670. Cited on p. 30. L i u , Y . , H s u , J . C , and Ruberg, S. (2007b), " P a r t i t i o n testing in dose-response studies with multiple endpoints," Pharmaceutical Statistics, 6, 181-192. Cited on p. 30.

176

BIBLIOGRAPHY

Marcus, R. (1976), " T h e power of some tests of the equality of normal means against an ordered alternative," Biometrika, 63, 177-183. Cited on p. 103, 104, 105, 106, 157. Marcus, R . , Peritz, E . , and Gabriel, K . R. (1976), " O n closed testing p r o c e dures with special reference to ordered analysis of variance," Biometrika, 63, 655-660. Cited on p. 11, 23, 25, 78. Maurer, W . , Hothorn, L . , and Lehmacher, W . (1995), "Multiple comparisons in drug clinical trials and preclinical assays: a-priori ordered hypotheses," in Biometrie in der chemisch-pharmazeutischen Industrie, ed. J . Vollmar, Stuttgart: Fischer Verlag, pp. 3-18. Cited on p. 28. Maurer, W . and Mellein, B . (1988), " O n new multiple test procedures based on independent p-values and the assessment of their power," in Multiple Hypothesenprufung - Multiple Hypotheses Testing, eds. P. Bauer, G . Hommel, and E . Sonnemann, Heidelberg: Springer, pp. 48-66. C i t e d on p. 16. Mehta, C . Râ&#x20AC;&#x17E; Bauer, P., Posch, M . , and Brannath, W . (2007), "Repeated c o n fidence intervals for adaptive group sequential trials," Statistics in Medicine, 26, 5422-5433. C i t e d on p. 149. Miiller, H . H . and Schafer, H . (2001), "Adaptive group sequential designs for clinical trials: combining the advantages of adaptive and of classical group sequential approaches," Biometrics, 57, 886-891. Cited on p. 148, 151. Naik, U . D . (1975), "Some selection rules for comparing p processes w i t h a standard," Communications in Statistics, 4, 519-535. Cited on p. 78. O'Brien, P. C . and Fleming, T . R . (1979), " A multiple testing procedure for clinical trials," Biometrics, 5, 549-556. Cited on p. 141, 142, 144. Parsons, N . (2010), a s d : Simulations for Adaptive Seamless Designs, U R L h t t p : / / C R A N . R - p r o j e c t . org/package=asd, R package version 1.0. C i t e d on p. 140, 153. Parsons, N . , Friede, T . , Todd, S., Valdes-Marquez, E . , Chataway, J . , Nicholas, R . , and Stallard, N . (2010), " A n R package for implementing simulations for seamless phase I I / I I I clinical trials using early outcomes for treatment selection," (submitted). Cited on p. 153. Patel, H . I . (1991), "Comparison of treatments in a combination therapy t r i a l , " Journal of Biopharmaceutical Statistics, 1, 171-183. Cited on p. 23. Petrondas, D . A . and Gabriel, K . R . (1983), "Multiple comparisons by rerandomization," Journal of the American Statistical Association, 78, 949-957. Cited on p. 133. Piepho, H . - P . (2004), " A n algorithm for a letter-based representation of allpairwise comparisons," Journal of Computational and Graphical Statistics, 13, 456-466. Cited on p. 66, 87. Pinheiro, J . , Bornkamp, B . , and Bretz, F . (2006a), "Design and analysis of dose finding studies combining multiple comparisons and modeling procedures," Journal of Biopharmaceutical Statistics, 16, 639-656. C i t e d on p. 157.

BIBLIOGRAPHY

177

Pinheiro, J . , Bretz, F . , and Branson, M . (2006b), " A n a l y s i s of dose-response studies: Modeling approaches." in Dose Finding in Drug Development., ed. N. T i n g , New York: Springer Verlag, pp. 146 -171. C i t e d on p. 159. Pocock, S. J . (1977), " G r o u p sequential methods in the design and analysis of clinical trials," Biometrika, 64, 191-199. Cited on p. 141, 142, 144. Pouard, K . S., Gilbert, H . N., Ge, Y . , Taylor, S., and Dudoit, S. (2010), m u l t t e s t : Resampling-based Multiple Hypothesis Testing, U R L h t t p : / / CRAN . R - p r o j e c t . o r g / p a c k a g e = m u l t t e s t , R package version 2.4.0. Cited on p. 137. Posch, M . , Konig, F . , Branson, M., B r a n n a t h , W . , Dunger-Baldauf, C , and Bauer, P. (2005), "Testing and estimation in flexible group sequential designs with adaptive treatment selection," Statistics in Medicine, 24, 3697-3714. Cited on p. 151. Proschan, M . , L a n , K . K . G . , and Wittes, J . T . (2006), Statistical Monitoring of Clinical Trials - A Unified Approach, New York: Springer. Cited on p. 141. Proschan, M . A . and Hunsberger, S. A . (1995), "Designed extension of studies based on conditional power," Biometrics, 51, 1315-1324. C i t e d on p. 151. Ramsey, P. H . (1978), "Power differences between pairwise multiple c o m p a r isons," Journal of the American Statistical Association, 73, 479-487. C i t e d on p. 16. Romano, J . P. and Wolf, M . (2005), " E x a c t and approximate stepdown m e t h ods for multiple hypothesis testing," Journal of the American Statistical Association, 100, 94-108. C i t e d on p. 28. Rosenthal, R . and R u b i n , D . B . (1983), "Ensemble-adjusted p-values," chological Bulletin, 94, 540-541. Cited on p. 35.

Psy-

Rousseeuw, P., Croux, C . , Todorov, V . , Ruckstuhl, A . , Salibian-Barrera, M . , Verbeke, T . , and Maechler, M . (2009), r o b u s t b a s e : B a s i c Robust S t a t i s tics, U R L h t t p : / / C R A N . R - p r o j e c t . o r g / p a c k a g e = r o b u s t b a s e , R package version 0.5-0-1. C i t e d on p. 110. Rousseeuw, P. J . and Leroy, A . M . (2003), Robust Regression Detection, New York: John Wiley & Sons. C i t e d on p. 52.

and

Outlier

Roy, S. N . (1953), " O n a heuristic method of test construction and its use in multivariate analysis," Annals of Mathematical Statistics, 24, 220-238. Cited on p. 20. Roy, S. N . and Bose, R . C . (1953), "Simultaneous confidence interval e s t i m a tion," Annals of Mathematical Statistics, 24, 513-536. C i t e d on p. 20. Royen, T . (1991), "Generalized maximum range tests for pairwise comparisons of several populations," Biometrical Journal, 31, 905â&#x20AC;&#x201D;929. C i t e d on p. 95. Ruberg, S. J . (1995a), "Dose-response studies. I . Some design considerations," Journal of Biopharmaceutical Statistics, 5, 1-14. C i t e d on p. 100, 161.

178

BIBLIOGRAPHY

Ruberg, S. J . (1995b), "Dose-response studies. I I . Analysis and interpretation," Journal of Biopharmaceutical Statistics, 5, 15-42. Cited on p. 100. Salib, E . and Hillier, V . (1997), " A case-control study of smoking Alzheimer's disease," International Journal of Geriatric Psychiatry, 295-300. Cited on p. 118, 123.

and 12,

Samuel-Cahn, E . (1996), " I s the Simes' improved Bonferroni procedure c o n servative?" Biometrika, 83, 928-933. Cited on p. 37. Sarkar, S. and Chang, C . K . (1997), "Simes' method for multiple hypothesis testing with positively dependent test statistics," Journal of the American Statistical Association, 92, 1601-1608. Cited on p. 38. Sarkar, S. K . (1998), "Some probability inequalities for censored M T P 2 r a n dom variables: A proof of the Simes conjecture," The Annals of Statistics, 26, 494-504. Cited on p. 38. Sarkar, S. K . , Snapinn, S., and Wang, W . (1995), " O n improving the min test for the analysis of combination drug trials," Journal of Statistical Computation and Simulation, 51, 197-213, correction: 60, (1998) 180-181. Cited on p. 23. Schmidli, H . , Bretz, F . , Racine, A . , and Maurer, W . (2006), "Confirmatory seamless phase I I / I I I clinical trials with hypotheses selection at interim: Applications and practical considerations," Biometrical Journal, 48, 6 3 5 ÂŹ 643. Cited on p. 152. Schweder, T . and Spj0tvoll, E . (1982), "Plots of p-values to evaluate many tests simultaneously," Biometrika, 69, 493-502. Cited on p. 35. Searle, S. R . (1971), Linear Models, p. 41, 44.

New York: John W i l e y & Sons. Cited on

Seeger, P. (1968), " A note on a method for the analysis of significance en masse," Technometrics, 10, 586-593. Cited on p. 13. Senn, S. and Bretz, F . (2007), "Power and sample size when multiple endpoints â&#x20AC;˘ are considered," Pharmaceutical Statistics, 6, 161-170. Cited on p. 16. Serning, R . J . (1980), Approximation Theorems of Mathematical New York: John Wiley & Sons. Cited on p. 49, 50.

Statistics,

Shaffer, J . P. (1980), "Control of directional errors with stagewise multiple test procedures," The Annals of Statistics, 8, 1342-1347. Cited on p. 17. Shaffer, J . P. (1986), "Modified sequentially rejective multiple test procedures," Journal of the American Statistical Association, 81, 826-831. Cited on p. 28, 34, 63, 95, 99, 100. Sidak, Z. (1967), "Rectangular confidence regions for the means of multivariate normal distributions," Journal of the American Statistical Association, 62, 626-633. Cited on p. 34. Simes, R . J . (1986), " A n improved Bonferroni procedure for multiple tests of significance," Biometrika, 73, 751-754. Cited on p. 35, 37.

BIBLIOGRAPHY

179

Solorzano, E . and Spurrier, J . D . (1999), "One-sided simultaneous c o m p a r isons with more than one control," Journal of Statistical Computation and Simulation, 63, 37-46. Cited on p. 77. Sonnemann, E . (1982), "Allgemeine Losungen multipler Testprobleme," EDV in Medizin und Biologie, 13, 120-128, translated by H . Finner and reprinted in: Sonnemann, E . (2008), "General solutions to multiple testing problems," Biometrical Journal, 50, 641â&#x20AC;&#x201D;656. Cited on p. 11. Sonnemann, E . and Finner, H . (1988), "Vollstandigkeitssatze fur multiple T e s t probleme," in Multiple Hypothesenprufung - Multiple Hypotheses Testing, eds. P. Bauer, G . Hommel, and E . Sonnemann, Heidelberg: Springer, pp. 121-135. C i t e d on p. 20. Soric, B . (1989), "Statistical "discoveries" and effect size estimation," Journal of the American Statistical Association, 84, 608-610. C i t e d on p. 13. Spurrier, J . D . and Solorzano, E . (2004), "Multiple comparisons with more than one control," in Recent Developments in Multiple Comparison Procedures, eds. Y . Benjamini, F . Bretz, and S. Sarkar, volume 47 of IMS Lecture Notes - Monograph Series, pp. 119-128. Cited on p. 77. Stefansson, G . , K i m , W . C , and Hsu, J . C . (1988), " O n confidence sets in multiple comparisons," in Statistical Decision Theory and Related Topics IV, eds. S. S. G u p t a and J . O. Berger, New York: Academic Press, pp. 89-104. Cited on p. 11, 29. Stewart, W . H . and Ruberg, S. (2000), "Detecting dose-response w i t h c o n trasts," Statistics in Medicine, 19, 913-921. C i t e d on p. 157. Storey, J . D . (2002), " A direct approach to false discovery rates," Jomal of the Royal Statistical, Society B, 64, 479-498. C i t e d on p. 35. Storey, J . D . (2003), " T h e positive false discovery rate: A Bayesian interpretation and the g-value," The Annals of Statistics, 31, 2013-2035. C i t e d on p. 13. Strassburger, K . and Bretz, F . (2008), "Compatible simultaneous lower c o n fidence bounds for the Holm procedure and other Bonferroni based closed tests," Statistics in Medicine, 27, 4914-4927. C i t e d on p. 19, 28, 31, 34, 81. Strassburger, K . , Bretz, F . , and Finner, H . (2007), "Ordered multiple c o m p a r isons with the best and their applications to dose-response studies," Biometrics, 63, 1143-1151. C i t e d on p. 30, 157. Strassburger, K . , Bretz, F . , and Hochberg, Y . (2004), "Compatible confidence intervals for intersection union tests involving two hypotheses," i n Recent Developments in Multiple Comparison Procedures, eds. Y . Benjamini, F . Bretz, and S. Sarkar, Beachwood, Ohio: Institute of Mathematical S t a t i s tics, volume 47 of IMS Lecture Notes - Monograph Series, pp. 129-142. Cited on p. 19, 23, 30. Strasser, H . and Weber, C . (1999), " O n the asymptotic theory of permutation statistics," Mathematical Methods of Statistics, 8, 220-250. C i t e d on p. 137.

BIBLIOGRAPHY

180

Takeuchi, K . (1973), Studies in Some Aspects of Theoretical Foundations of Statistical Data Analysis, Tokyo: Toyo Keizai Shinposha, (in Japanese). Cited on p. 29. Takeuchi, K . (2010), " B a s i c ideas and concepts for multiple comparison p r o cedures," Biometrical Journal, (in press). Cited on p. 29. Tamhane, A . C , Dunnett, C . W . , Green, J . W . , and Wetherington, J . D . (2001), "Multiple test procedures for identifying the maximum safe dose," Journal of the American Statistical Association, 96, 835-843. C i t e d on p. 156. Tamhane, A . C , Hochberg, Y . , and Dunnett, C . W . (1996), "Multiple test procedures for dose finding," Biometrics, 52, 21â&#x20AC;&#x201D;37. Cited on p. 156. Tamhane, A . C . and Logan, B . (2006), "Multiple comparison procedures for dose response studies," in Dose Finding in Drug Development., ed. N . T i n g , New York: Springer Verlag, pp. 172-183. Cited on p. 157. T i n g , N . (2006), Dose Finding Cited on p. 100.

in Drug Development,

New York: Springer.

Tong, Y . L . (1980), Probability Inequalities in Multivariate York: Academic Press. Cited on p. 34.

Distributions,

Tong, Y . L . (1990), The Multivariate Verlag. Cited on p. 44.

New York: Springer

Normal Distribution,

New

Troendle, J . (1995), " A stepwsie resampling method of multiple hypothesis testing," Journal of the American Statistical Association, 90, 370-378. Cited on p. 127. Troendle, J . , K o r n , E . , and McShane, L . (2004), " A n example of slow convergence of the bootstrap in high dimensions," The American Statistician, 58, 25-29. C i t e d on p. 140. Tukey, J . W . (1953), T h e Problem of Multiple Comparisons, unpublished manuscript reprinted in: T h e Collected Works of John W . Tukey, Volume 8, 1994, H . I . B r a u n ( E d . ) , C h a p m a n and Hall, New York. C i t e d on p. 83, 100. Tukey, J . W . (1977), Exploratory Wesley. Cited on p. 53.

Data Analysis,

Reading, Mass: Addison-

Tukey, J . W . , Ciminera, J . L . , and Heyse, J . F . (1985), "Testing the statistical certainty of a response to increasing doses of a drug," Biometrics, 41, 2 9 5 ÂŹ 301. Cited on p. 157. van der L a a n , M . J . , Dudoit, S., and Pollard, K . S. (2004), "Augmentation procedures for control of the generalized family-wise error rate and tail probabilities for the proportion of false positives," Statistical Applications in Genetics and Molecular Biology, 3(1). Cited on p. 14. Vandemeulebroecke, M . (2009), a d a p t T e s t : Adaptive Two-stage Tests, U R L h t t p - . / / C R A N . R - pro j e c t . o r g / p a c k a g e = a d a p t T e s t , R package version 1.0. Cited on p. 153.

181

BIBLIOGRAPHY Venables, W . N. and Ripley, B . D . (2002), Modem New York: Springer. Cited on p. 82, 83.

Applied Statistics

with S,

Victor, N. (1982), "Exploratory data-analysis and clinical research," Methods of Information in Medicine, 21, 53-54. Cited on p. 13. Wang, S. J . , O'Neill, R. T . , and Hung, H . M . J . (2007), "Approaches to e v a l u ation of treatment effect in randomized clinical trials with genomic subset," Pharmaceutical Statistics, 6, 227-244. Cited on p. 152. Wang, S. K . and T s i a t i s , A . A . (1987), "Approximately optimal one parameter boundaries for group sequential trials," Biometrics, 43, 193-199. Cited on p. 142, 143, 144. Wassmer, G . (1999), Statistische Testverfahren fur gruppensequentielle und adaptive Plane in klinischen Studien. Theoretische Konzepte und deren praktische Umsetzung mit SAS, K o l n : Verlag Alexander Monch. Cited on p. 141. Wassmer, G . (2009), " G r o u p sequential designs," in Encyclopedia of Clinical Trials, eds. R. D'Agostino, L . Sullivan, and J . Massaro, Hoboken: Wiley. Cited on p. 141. Westfall, P. (2005), "Combining p-values," in Encyclopedia of Biostatistics, eds. P. Armitage and T . Colton, Chichester: Wiley, pp. 987-991. Cited on p. 22. Westfall, P., Krishen, A . , and Young, S. S. (1998), " U s i n g prior information to allocate significance levels for multiple endpoints," Statistics in Medicine, 17, 2107-2119. C i t e d on p. 139. Westfall, P., Tobias, R . , and Bretz, F . (2000), E s t i m a t i n g directional error rates of stepwise multiple comparison methods using distributed computing and variance reduction, U R L h t t p : / / s u p p o r t . s a s . c o m / r n d / a p p / p a p e r s / d i r e c t i o n a l . p d f , Technical Report. Cited on p. 17. Westfall, P. H . (1997), "Multiple testing of general contrasts using logical c o n straints and correlations," Journal of the American Statistical Association, 92, 299-306. C i t e d on p. 51, 63, 95, 98, 99, 100, 103, 104, 157. Westfall, P. H . and Bretz, F . (2010), "Multiplicity in clinical trials," in Encyclopedia of Biopharmaceutical Statistics, ed. S. C . Chow, New York: Marcel Dekker I n c . , (in press). Cited on p. 6, 15. Westfall, P. H . and Krishen, A . (2001), "Optimally weighted, fixed sequence, and gatekeeping multiple testing procedures," Journal of Statistical Planning and Inference, 99, 25-40. Cited on p. 28. Westfall, P. H . , Kropf, S., and Finos, L . (2004), "Weighted F W E - c o n t r o l l i n g methods in high-dimensional situations," in Recent Developments in Multiple Comparison Procedures, eds. Y . Benjamini, F . Bretz, and S. Sarkar, Beachwood, Ohio: Institute of Mathematical Statistics, volume 47 of IMS Lecture Notes - Monograph Series, pp. 143-154. C i t e d on p. 35.

182

BIBLIOGRAPHY

Westfall, P. H . and Tobias, R . D . (2007), "Multiple testing of general contrasts: Truncated closure and the extended Shaffer-Royen method," Journal of the American Statistical Association, 102, 487-494. Cited on p. 28, 34, 51, 63, 95. Westfall, P. H . , Tobias, R . D . , R o m , D . , Wolfinger, R . D . , and Hochberg, Y . (1999), Multiple Comparisons and Multiple Tests Using the SAS System, Cary, N C : S A S Institute Inc. Cited on p. xiii, x v i , 16, 71, 113, 136. Westfall, P. H . and Troendle, J . (2008), "Multiple testing with minimal a s sumptions," Biometrical Journal, 50, 745-755. Cited on p. 127, 130, 133. Westfall, P. H . and Young, S. S. (1993), Resampling-Based Multiple New York: Wiley. Cited on p. xiii, 18, 19, 28, 100, 101, 127, 133.

Testing,

Wiens, B . L . (2003), " A fixed sequence Bonferroni procedure for testing m u l tiple endpoints," Pharmaceutical Statistics, 2, 211-215. Cited on p. 28. Wiens, B . L . and Dmitrienko, A . (2005), " T h e fallback procedure for evaluating a single family of hypotheses," Journal of Biopharmaceutical Statistics, 15, 929-942. C i t e d on p. 28. Williams, D . A . (1971), " A test for difference between treatment means when several dose levels are compared with a zero dose control," Biometrics, 27, 103-117. C i t e d on p. 103, 104, 105, 106, 157. Wright, S. P. (1992), "Adjusted p-values for simultaneous inference," rics, 48, 1005-1013. C i t e d on p. 18.

Biomet-

Yohai, V . J . (1987), "High breakdown-point and high efficiency estimates for regression," The Annals of Statistics, 15, 642-665. Cited on p. 110. Zeileis, A . (2006), "Object-oriented computation of sandwich estimators," Journal of Statistical Software, 16, 1-16. Cited on p. 117.

Index Underlined page numbers for the subject items below refer to those places in the book where the associated topic is introduced a n d / o r discussed in detail.

Adaptive design, 150 p-value combination function, 151 a d a p t T e s t , 153 asd, 153 Fisher's product test, 151 Inverse normal combination test, 152 Operational definition, 151 Schematic diagram, 153 Adjusted p-value, 18 m u l t c o m p , 60 Adverse event analysis, see Multiple comparison problems All pairwise comparisons, see Multiple comparison problems, Tukey test Analysis of covariance, 41, 101 Analysis of variance, 41, 52 X test, 50, 60 F test, 50, 60, 83, 109, 111 Linear regression model, see Regression Mixed-effects model, 48, 52, 126 One-way layout, 53, 71, 115 Robust, 52, 110, 117 Two-way layout, 82 Average power, see Power 2

Bonferroni inequality, 31 Bonferroni test, 4, 17, 23, 31, 128 Adjusted p-value, 5, 31 Comparison to Simes test, 36 m u l t c o m p , 61 p . a d j u s t function, 31 Rejection region, 36 Simultaneous confidence intervals, 34, 65 Weighted, 35 Bootstrap testing, 140

Closed test procedure, 26, see also Step-down test, Step-up test Closed Dunnett test, see Step-down test Closed Tukey test, 93 Comparison to other methods, 99 m u l t c o m p , 98 Schematic diagram, 96 Truncated, 34, 95 Closure principle, 23, 78, 93, 127, 129, 134 Adjusted p-value, 26 Control of familywise error rate, 27 Operational definition, 26 Shortcut, 28, 79, 133 Simultaneous confidence intervals, 28 Visualization Parameter space, 23 Schematic diagram, 24, 25, 27, 33, 81, 96, 130, 153 Venn diagram, 24 Coherence, 20, 21, 24, 26 Combined error rate, 16 Comparisons with a control, see Multiple comparison problems, Dunnett test Conjunctive power, see Power Consonance, 20, 21 Contrasts, 42 General, 103 Modified Williams, see Trend test m u l t c o m p , 104 Orthogonal, 93 Other comparisons than contrasts, 43, 121, 126 Pairwise, 56, 75, 77, 101 Treatment contrasts in R, 57, 116, 122 183

184 Williams, see Trend test Dataset, xvi alpha, 115 alzheimer, 118 biom, 162 bodyfat, 108 c l i n i c a l , 124 immer, 82 l i t t e r , 101 recovery, 71 sbp, 113 thuesen, 3, 42 trees513, 125 warpbreaks, 53 Directional error, 6, 12, 16 Discrete data, see Sparse data Disjunctive power, see Power Dose response analysis, 99, 156, see also Trend test MCP-Mod procedure, 156 Dunnett test, 71, 101 Comparison to other methods, 81 Contrast matrix, 75 Correlation, 73 Distribution, 72 m u l t c o m p , 73 Null hypothesis, 72 Simultaneous confidence intervals, 74 Step-down, see Step-down test Step-up, see Step-up test Test statistic, 72 With covariates, 101 Effective number of tests, 128 Elementary null hypothesis, 11 Error rate, 11 Fallback procedure, 28 False discovery rate, 13 False negative, see Type I I error False positive, see Type I error Family of hypotheses, 13, 15 Familywise error rate, 13, 129 Fixed sequence test, 28 Free combinations, 19 Function

INDEX aov, xvi, 53, 55, 60, 73, 82, 101, 106, 115 asd.sim, 153 eld, 66, 88 contrMat, 55,104-106 glht, 5, 46, 53-60, 73-75, 77, 84-86, 90, 91, 106, 107, 109, 113, 116, 117, 119 glht.nunc, 90, 91, 93 glm, 55, 118 gsCP, 148 gsDesign, 145, 146 gsProbability, 147 independence_test, 138, 139 lm, xvi, 45, 55, 108, 113 lmer, 126 lmrob, 110 mcp, 55, 73, 77, 84 MCPMod, 163 model.tables, 83 nlme, xvi p.adjust, 31, 32, 39, 61, 63 pmvt, 45 sandwich, 117 survreg, 124 TukeyHSD, 85 uniroot, 46 Gatekeeping procedure, 28 General parametric model, 48 Adjusted p-value, 51 Simultaneous confidence intervals, 51 Generalized familywise error rate, 13 Generalized linear model, 48, 52 Global null hypothesis, 14, 21 Group sequential design, 141 A G S D e s t , 149 Average sample size, 142 Error spending function, 142 gsDesign, 145 O'Brien-Fleming boundaries, 142, 145 Pocock boundaries, 142 Wang-Tsiatis boundaries, 142 Hochberg procedure, 38 Adjusted p-value, 38 m u l t c o m p , 61

INDEX p.adjust function, 39 Holm procedure, 17, 19, 20, 32 Adjusted p-value, 32 Comparison to Hochberg procedure, 38 m u l t c o m p , 61 p . a d j u s t function, 32 Weighted, 35 Hommel procedure, 38 Comparison to Hochberg procedure, 39 multcomp, 61 p . a d j u s t function, 39 Individual power, see Power Intersection union test, 22 Letter display, see Tukey test Linear model, 42 Simultaneous confidence intervals, 44 Logical constraints, see Restricted combinations Many-to-one comparisons, see Multiple comparison problems, Dunnett test Marginal p-value, 4, 18 m u l t c o m p , 61 max-* test, 21, 78, 130, 133, see also Union intersection test Mean-mean comparison plot, 89 min test, 22, see also Intersection union test min-p test, 130 m u l t c o m p package, see also Package eld, 66 coef, 55 conf i n t , 64 contrMat, 55, 104 glht, 5, 46, 53 mcp, 55 modelparm, 55 plot, 64 p r i n t , 57 summary, 59 vcov, 55 Multiple comparison problems Adaptive design, 150, see also Adaptive design

185 Adverse event analysis, 128, 131, 137 All pairwise comparisons, 53, 82, 116, 124, see also Tukey test Comparisons with a control, 70, see also Dunnett test Comparisons with several controls, 77 Dose response, see Dose response analysis General contrasts, 103 Group sequential design, 141, see also Group sequential design Many-to-one comparisons, 72, see also Dunnett test Multiple endpoint analysis, 128, 139 Repeated hypothesis testing, 140 Safety analysis, 128, 131, 137 Simultaneous confidence bands, 111 Subgroup analysis, 118 Variable selection, 108 Multiple comparison procedure, 15 Construction methods, 20 Multiple endpoint analysis, see Multiple comparison problems Multiple testing, 7 Multiplicity problem, 1-3, 6 Multivariate t distribution, 43, 59, 72 Multivariate normal distribution, 59 asymptotic, 43, 50 One-sided studentized range test, see Tukey test p . a d j u s t function, 31, 61 Package, xvi a d a p t T e s t , 153, 181 A G S D e s t , 149, 172 asd, 140, 153, 155, 176 coin, 127, 137-139, 174 datasets, 53 D o s e F i n d i n g , 157, 160, 162, 168 gsDesign, 140, 141, 145, 148, 149, 167 H H , 65, 86, 90, 93, 173 I S w R , xvi, 3, 170 lme4, 126, 167 M A S S , xvi, 82 m u l t c o m p , xiv, xvi, xvii, 4-6, 28, 31, 34, 41, 45-48, 51, 53, 54, 59,

186 62-64, 66, 69-71, 73-75, 77, 78, 80, 81, 84, 85, 87, 93, 95, 98, 101, 103, 104, 106, 109, 113, 114, 116, 123, 174 multtest, 137, 177 m v t n o r m , 45, 63, 65, 161, 172 robust base, 110, 177 s a n d w i c h , 117 stats, 31, 61, 85 s u r v i v a l , 124 Partitioning principle, 18, 29 Visualization, 30 Per-comparison error rate, 12 Permutation test, 127 coin, 137 Fisher's exact test, 128, 131 Joint exchangeability assumption, 130 Multiple-sample problem, 127 Subset pivotality, 133, 134, 136 Two-sample problem, 129 Positive false discovery rate, 13 Power, 15 Average, 15 Conjunctive, 16 Disjunctive, 16, 155 Individual, 15 Proportion of false positives, 14 Regression, 42, 52 Logistic, 118, 126 Multiple linear, 108 Non-linear, 159 Simple linear, 3, 42, 45, 111 Repeated hypothesis testing, see Multiple comparison problems Reproducibility, 6 Resampling-based test, see Permutation test, Bootstrap test Restricted combinations, 19, 95, 135 Safety analysis, see Multiple comparison problems Selection effect, 6 Shaffer procedure, 19, 34, 95, 99 Comparison to other methods, 99 Shortcut procedure, see Closure principle

INDEX Sidak approach, 34, 122 Simes test, 35 Distributional assumptions, 37 Rejection region, 36 Simultaneous confidence bands, 111 Approximation, 113 Simultaneous confidence intervals, 18, 44, 81, 91, 121, 126 multcomp, 64 Single-step test, 17 Sparse data, 128, 130, 136 Step-down test, 17 Algorithm under free combinations, 79 Holm, see Holm procedure Shaffer, see Shaffer procedure Step-down Dunnett test, 77, 80, 103 Comparison to other methods, 81 multcomp, 80 Schematic diagram, 81 Step-up test, 17 Hochberg, see Hochberg procedure Step-up Dunnett test, 78 Stepwise test procedure, 17, 93 Strong control of error rates, 14 Studentized range test, see Tukey test Subgroup analysis, see Multiple comparison problems Subset pivotality, see Permutation test Survival analysis, 48, 52, 124 Treatment contrasts, see Contrasts Trend test, 103 Contrast test, 103 m u l t c o m p , 106 MCP-Mod procedure, 156 Modified Williams contrast test, 106 One-sided studentized range test, see Tukey test Williams contrast test, 104 Truncated closed test, see Closed test procedure Tukey test, 82, 116, 124 Closed Tukey test, see Closed test procedure Comparison to other methods, 99 Contrast matrix, 56, 116 Distribution, 84 Generalized Tukey test, 93

INDEX Hypotheses, 53, 83 Letter display, 66, 87 multcomp, 55, 84 One-sided studentized range test, 85 Simultaneous confidence intervals, 64, 86, 118 Test statistics, 59, 83 TukeyHSD function, 85 Under heteroscedasticity, 114 TukeyHSD function, see Tukey test Type I error, 7, 12, see also Error Rate Inflation, 1 Type I I error, 12, 15, see also Power Type I I I error, see Directional error Unadjusted p-value, see Marginal p-value Union intersection test, 20, 24 Variable selection, see Multiple comparison problems Weak control of error rates, 14

187