Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A Course in Multiple Comparisons and Multiple Tests Peter H. Westfall, Ph.D. Professor of Statistics, Department of Inf. Systems and Quant. Sci. Texas.

Similar presentations


Presentation on theme: "1 A Course in Multiple Comparisons and Multiple Tests Peter H. Westfall, Ph.D. Professor of Statistics, Department of Inf. Systems and Quant. Sci. Texas."— Presentation transcript:

1 1 A Course in Multiple Comparisons and Multiple Tests Peter H. Westfall, Ph.D. Professor of Statistics, Department of Inf. Systems and Quant. Sci. Texas Tech University

2 2 Learning Outcomes Elucidate reasons that multiple comparisons procedures (MCPs) are used, as well as their controversial nature Know when and how to use classical interval-based MCPs including Tukey, Dunnett, and Bonferroni Understand how MCPs affect power Elucidate the definition of closed testing procedures (CTPs) Understand specific types of CTPs, benefits and drawbacks Distinguish false discovery rate (FDR) from familywise error rate (FWE) Understand general issues regarding Bayesian MCPs

3 3 Introduction. Overview of Problems, Issues, and Solutions, Regulatory and Ethical Perspectives, Families of Tests, Familywise Error Rate, Bonferroni. (pp. 5-21) Interval-Based Multiple Inferences in the standard linear models framework. One-way ANOVA and ANCOVA, Tukey, Dunnett, and Monte Carlo Methods, Adjusted p-values, general contrasts, Multivariate T distribution, Tight Confidence Bands, TreatmentxCovariate Interaction, Subgroup Analysis (pp ) Power and Sample Size Determinations for multiple comparisons. (pp ) Stepwise and Closed Testing Procedures I: P-value-Based Methods. Closure Method, Global Tests; Holm, Hommel, Hochberg and Fisher combined methods for p-Values; (pp ) Stepwise and Closed Testing Procedures II: Fixed Sequences, Gatekeepers and I-U tests: Fixed Sequence tests, Gatekeeper procedures, Multiple hypotheses in a gate, Intersection-union tests; with application to dose response, primary and secondary endpoints, bioequivalence and combination therapies (pp ) Outline of Material

4 4 Stepwise and Closed Testing Procedures III: Methods that use logical constraints and correlations. Lehmacher et al. Method for Multiple endpoints; Range-Based and F-based ANOVA Tests, Fisher’s protected LSD, Free and Restricted Combinations, Shaffer-Type Methods for dose comparisons and subgroup analysis (pp ) Multiple nonparametric and semiparametric tests: Bootstrap and Permutation-based Closed tesing. PROC MULTTEST, examples with multiple endpoints, genetic associations, gene expression, binary data and adverse events (pp ) More complex models and FWE control: Heteroscedasticity, Repeated measures, and large sample methods. Applications: multiple treatment comparisons, crossover designs, logistic regression of cure rates (pp ) False Discovery Rate: Benjamini and Hochberg’s method, comparison with FWE – controlling methods ( ) Bayesian methods: Simultaneous credible intervals, ranking probabilities and loss functions, PROC MIXED posterior sampling, Bayesian testing of multiple endpoints (pp ) Conclusion, discussion, references ( ) Outline (Continued)

5 5 Sources of Multiplicity Multiple variables (endpoints) Multiple timepoints Subgroup analysis Multiple comparisons Multiple tests of the same hypothesis Variable and Model selection Interim analysis Hidden Multiplicity: File Drawers, Outliers

6 6 The Problem: “Significant” results may fail to replicate. Documented cases: Ioannidis (JAMA 2005)

7 7 An Example Phase III clinical trial Three arms – Placebo, AC, Drug Endpoints: Signs and symptoms Measured at weekly visits Baseline covariates

8 8 Example-Continued ‘Features’ displayed at trial conclusion: Trends Baseline adjusted comparisons of raw data Baseline adjusted % changes Nonparametric and parametric tests Specific endpoints and combinations of endpoints Particular week results AC and Placebo comparisons Fact: The features that “look the best” are biased.

9 9 Example Continued – Feature Selection ‘Effect Size’ is a feature Effect size = (mean difference)/sd Dimensionless.2=‘small’,.5=‘medium’,.8=‘large’ Estimated effect sizes : F 1, F 2,…,F k What if you select (max{F 1,F 2,…,F k }) and publish it?

10 10 The Scientific Concern

11 11 Feature Selection Model Clinical Trials Simulation Real data used Conservative! If you must know more: F j =  j +  j, j=1,…,20. Error terms or N(0,.2 2 ) True effect sizes  j are N(.3,.1 2 ) Features F j are highly correlated.

12 12 Key Points: (i) Multiplicity invites Selection (ii) Selection has an EFFECT Just like effects due to Treatment Treatment Confounding Confounding Learning Learning Nonresponse Nonresponse Placebo Placebo

13 13 Published Guidelines ICH Guidelines CPMP Points to consider CDRH Statistical Guidance ASA Ethical Guidelines

14 14 Regulatory/Journal/Ethical/Professional Concerns Replicability (good science) Fairness Regulatory report: The drug company reported efficacy at p=.047. We repeated the analysis in several different ways that the company might have done. In 20 re-analyses of the data, 18 produced p- values greater than.05. Only one of the 20 re-analyses produced a p-value smaller than.047.

15 15 Multiple Inferences: Notation There is a “family” of k inferences Parameters are  1,…,  k Null hypotheses are H 01 :  1 =0, …, H 0k :  k =0

16 16 Comparisonwise Error Rate (CER) Intervals: CER j = P(Interval j incorrect) Tests: CER j = P(Reject H 0j | H 0j is true) Usually CER =  =.05

17 17 Familywise Error Rate (FWE) Intervals: FWE = 1 - P(all intervals are correct) Tests: FWE = P(reject at least one true null)

18 18 False Discovery Rate FDR = E(proportion of rejections that are incorrect) Let R = total # of rejections Let V = # of erroneous rejections FDR = E(V/R) (0/0 defined as 0). FWE = P(V>0)

19 19 Bonferroni Method Identify Family of inferences Identify number of elements (k) in the Family Use  /k for all inferences. Ex: With k=36, p-values must be less than 0.05/36 = to be “significant”

20 20 FWE Control for Bonferroni FWE = P(p 0j 1 .05/36 or … or p 0j m .05/36 | H 0j 1,..., H 0j m true)  P(p 0j 1 .05/36) + … + P( p 0j m .05/36) = (.05)m/36 .05 A B P(A  B)  P(A) + P(B)

21 21 Main Interest - Primary & Secondary Approval and Labeling depend on these. Tight FWE control needed. Lesser Interest - Depending on goals and reviewers, FWE controlling methods might be needed. Supportive Tests - mostly descriptive FWE control not needed. Exploratory Tests - investigate new indications - future trials needed to confirm - do what makes sense. Serious and known treatment- related AEs FWE control not needed All other AEs Reasonable to control FWE (or FDR) Efficacy Safety “Families” in clinical trials 1 1 Westfall, P. and Bretz, F. (2003). Multiplicity in Clinical Trials. Encyclopedia of Biopharmaceutical Statistics, second edition, Shein-Chung Chow, ed., Marcel Decker Inc., New York, pp

22 22 Classical Single-Step Testing and Interval Methods to Control FWE  Simultaneous confidence intervals;  Adjusted p-values  Dunnett method  Tukey’s method  Simulation-based methods for general comparisons

23 23 “Specificity” and “Sensitivity” Estimates of effect sizes & error margins Confident inequalities Overall Test Simultaneous Confidence Intervals Stepwise or closed tests F-test, O’Brien, etc. If you want... …then use

24 24 The Model Y = X  +  where  ~ N(0,  2 I ) Includes ANOVA, ANCOVA, regression For group comparisons, covariate adjustment Not valid for survival analysis, binary data, multivariate data

25 25 Example: Pairwise Comparisons against Control Goal: Estimate all mean differences from control and provide simultaneous 95% error margins: What c  to use?

26 26 Comparison of Critical Values

27 27 Results - Dunnett The GLM Procedure Dunnett's t Tests for gain NOTE: This test controls the Type I experimentwise error for comparisons of all treatments againstba control. Alpha 0.05 Error Degrees of Freedom 21 Error Mean Square Critical Value of Dunnett's t Minimum Significant Difference Comparisons significant at the 0.05 level are indicated by ***. Difference Simultaneous g Between 95% Confidence Comparison Means Limits *** ***

28 28 c  is the 1-  quantile of the distribution of max i |Z i -Z 0 |/(2  2 /df) 1/2, called Dunnett’s two-sided range distribution.

29 29 Adjusted p-Values Definition: Adjusted p-value = smallest FWE at which the hypothesis is rejected. or The FWE for which the confidence interval has “0” as a boundary.

30 30 Adjusted p-values for Dunnett proc glm data=tox; class g; model gain=g; lsmeans g/adjust=dunnett pdiff; run;

31 31 Example: All Pairwise Comparisons Goal: Estimate all mean differences and provide simultaneous 95% error margins: What c  to use?

32 32 Comparison of Critical Values

33 33 Tukey Comparisons Alpha= 0.05 df= 21 MSE= Critical Value of Studentized Range= Minimum Significant Difference= Means with the same letter are not significantly different. Tukey Grouping Mean N G A A A A A A A A A A A A A

34 34 Tukey Adjusted p-Values General Linear Models Procedure Least Squares Means Adjustment for multiple comparisons: Tukey G GAIN Pr > |T| H0: LSMEAN(i)=LSMEAN(j) LSMEAN i/j

35 35 Tukey Simultaneous Intervals Simultaneous Simultaneous Lower Difference Upper Confidence Between Confidence i j Limit Means Limit

36 36 c  is (1/   the 1-  quantile of the distribution of max i,i’ |Z i -Z i’ |/(  2 /df) 1/2 }, which is called the Studentized range distribution.

37 37 Unbalanced Designs and/or Covariates Tukey method is conservative when the design is unbalanced and/or there are covariates; otherwise exact Dunnett method is conservative when there are covariates; otherwise exact “Conservative” means {True FWE} < {Nominal FWE} ; also means “less powerful”

38 38 Tukey-Kramer Method for all pairwise comparisons Let c  be the critical value for the balanced case using Tukey’s method and the correct df. Intervals are Conservative (Hayter, 1984 Annals)

39 39 Exact Method for General Comparisons of Means

40 40 Multivariate T-Distribution Details 40

41 41 Calculation of “Exact” c  Edwards and Berry: Simple simulation Hsu and Nelson: Factor analytic control variate (better) Genz and Bretz: Integration using lattice methods (best) Even with simple simulation, the value c  can be obtained with reasonable precision. Edwards, D., and Berry, J. (1987) The efficiency of simulation-based multiple comparisons. Biometrics, 43, Hsu, J.C. and Nelson, B.L. (1998) Multiple comparisons in the general linear model. Journal of Computational and Graphical Statistics, 7, Genz, A. and Bretz, F. (1999), Numerical Computation of Multivariate t Probabilities with Application to Power Calculation of Multiple Constrasts, J. Stat. Comp. Simul. 63, pp

42 42 Example: ANCOVA with two covariates Y = Diastolic BP Group = Therapy (Control, D1, D2, D3) X1 = Baseline Diastolic BP X2 = Baseline Systolic BP Goal: Compare all therapies, controlling for baseline proc glm data=research.bpr; class therapy; model dbp10 = therapy dbp7 sbp7; lsmeans therapy/pdiff cl adjust=simulate(nsamp= cvadjust seed= report); run; quit;

43 43 Results From ANCOVA Source DF Type III SS Mean Square F Value Pr > F THERAPY DBP <.0001 SBP Least Squares Means for Effect THERAPY Difference Simultaneous 95% Between Confidence Limits for i j Means LSMean(i)-LSMean(j) Note: “4” is control

44 44 Details for Quantile Simulation Random number seed Comparison type All Sample size Target alpha 0.05 Accuracy radius (target) Accuracy radius (actual) 437E-7 Accuracy confidence 99% Simulation Results Estimated 99% Confidence Method 95% Quantile Alpha Limits Simulated Tukey-Kramer Bonferroni Sidak GT Scheffe T NOTE: PROCEDURE GLM used: real time seconds

45 45 Results from ANCOVA-Dunnett H0:LSMean= Control THERAPY DBP10 LSMEAN Pr > |t| Dose Dose Dose Placebo Least Squares Means for Effect THERAPY Difference Simultaneous 95% Between Confidence Limits for i j Means LSMean(i)-LSMean(j)

46 46 Details for Quantile Simulation- Dunnett Random number seed Comparison type Control, two-sided Sample size Target alpha 0.05 Accuracy radius (target) Accuracy radius (actual) 139E-7 Accuracy confidence 99% Simulation Results Estimated 99% Confidence Method 95% Quantile Alpha Limits Simulated Dunnett-Hsu, two-sided Bonferroni Sidak GT Scheffe T NOTE: PROCEDURE GLM used: real time seconds

47 47 More General Inferences Question: For what values of the covariate is treatment A better than treatment B?

48 48 Discussion of (Treatment  Covariate) Interaction Example

49 49 The GLIMMIX Procedure Computes MC-exact simultaneous confidence intervals and adjusted p-values for any set of linear functions in a linear model

50 50 GLIMMIX syntax proc glimmix data=research.tire; class make; model cost = make mph make*mph; estimate "10" make 1 -1 make*mph , "15" make 1 -1 make*mph , "20" make 1 -1 make*mph , "25" make 1 -1 make*mph , "30" make 1 -1 make*mph , "35" make 1 -1 make*mph , "40" make 1 -1 make*mph , "45" make 1 -1 make*mph , "50" make 1 -1 make*mph , "55" make 1 -1 make*mph , "60" make 1 -1 make*mph , "65" make 1 -1 make*mph , "70" make 1 -1 make*mph /adjust=simulate(nsamp= report) cl; run;

51 51 Output from PROC GLIMMIX Simultaneous intervals are Estimate * StdErr Label Estimate StdErr tValue AdjLower AdjUpper Bonferroni – critical value is t_{16,.05/2*13} =

52 52 Other Applications of Linear Combinations Multiple Trend Tests (0,1,2,3), (0,1,2,4), (0,4,6,7) (carcinogenicity) (0,0,1), (0,1,1), (0,1,2) (recessive/dominant/ordinal genotype effects) Subgroup Analysis Subgroups define linear combinations (more on next slide)

53 53 Subgroup Analysis Example Data: Y ijkl, where i=Trt,Cntrl ; j=Old, Yng; k = GoodInit, PoorInit. Model: Y ijkl =  ijk +  ijkl, where  ijk =  +  i +  j +  k +(  ) ij +(  ) ik +(  ) jk Subgroup Contrasts:  111  112  121  122  211  212  221  222 Overall ¼ ¼ ¼ ¼ -¼ -¼ -¼ -¼ Older ½ ½ 0 0 -½ -½ 0 0 Younger 0 0 ½ ½ 0 0 -½ -½ GoodInit ½ 0 ½ 0 -½ 0 -½ 0 PoorInit 0 ½ 0 ½ 0 -½ 0 -½ OldGood OldPoor YoungGood YoungPoor

54 54 Subgroup Analysis Results Label Estimate StdErr tValue Probt Adjp AdjLower AdjUpper Overall I Older I Younger I GoodInitHealth I PoorInitHealth I OldGood I OldPoor I YoungGood I YoungPoor I (SAS code available upon request)

55 55 Summary Include only comparisons of interest. Utilize correlations to be less conservative. The critical values can be computed exactly only in balanced ANOVA for all pairwise comparisons, or in unbalanced ANOVA for comparisons with control. Simulation-based methods are “exact” if you let the computer run for a while. This is my general recommendation.

56 56 Power Analysis Sample size - Design of study Power is less when you use multiple comparisons  larger sample sizes Many power definitions Bonferroni & independence are convenient (but conservative) starting points

57 57 Power Definitions “Complete Power” = P(Reject all H 0i that are false) “Minimal Power” = P(Reject at least one H 0i that is false) “Individual Power” = P(Reject a particular H 0i that is false) “Proportional Power” = Average proportion of false H 0i that are rejected

58 58 Power Calculations. Example: H 1 and H 2 powered individually at 50%; H 3 and H 4 powered individually at 80%, all tests independent. Complete Power = P(reject H 1 and H 2 and H 3 and H 4 ) =.5 .5 .8 .8 = Minimal Power = P(reject H 1 or H 2 or H 3 or H 4 ) = 1-P(“accept” H 1 and H 2 and H 3 and H 4 ) =1  (1 .5) .5) .8) .8) = Individual Power = P(reject H 3 (say)) = (depends on the test) Proportional Power = ( )/4 = 0.65

59 59 Sample Size for Adequate Individual Power - Conservative Estimate

60 60 Individual power of two-tail two- sample Bonferroni t-tests %let MuDiff = 5; /* Smallest meaningful difference MUx-MUy that you want to detect */ %let Sigma = 10.0 ; /* A guess of the population std. dev. */ %let alpha =.05 ; /* Familywise Type I error probability of the test */ %let k = 4; /* Number of tests */ options ls=76; data power; cer = &alpha/&k; do n = 2 to 100 by 2; *n=sample size for each group*; df = n + n - 2; ncp = (&Mudiff)/(&Sigma*sqrt(2/n)); * The noncentrality parameter *; tcrit = tinv(1-cer/2, df); * The critical t value * ; power = 1 - probt(tcrit, df, ncp) + probt(-tcrit,df,ncp) ; output; end; proc print data=power; run; proc plot data=power; plot power*n/vpos=30; run;

61 61 Graph of Power Function Plot of power*n. Legend: A = 1 obs, B = 2 obs, etc. power ‚ ‚ 1.0 ˆ ‚ ‚ AAA 0.8 ˆ AAAA ‚ AAA ‚ AA 0.6 ˆ AAA n=92 for 80% ‚ AA power ‚ AA 0.4 ˆ A ‚ AA 0.2 ˆ AA ‚ AA ‚ AAA 0.0 ˆ A ‚ Šƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒ n

62 62 %IndividualPower macro* Uses PROBMC and PROBT (noncentral) Assumes that you want to use the single-step (confidence interval based) Dunnett (one- or two-sided) or Range (two-sided) test Less conservative than Bonferroni Conservative compared to stepwise procedures %IndividualPower(MCP=DUNNETT2,g=4,d=5,s=10); *Westfall et al (1999), Multiple Comparisons and Multiple Tests Using SAS

63 63 %IndividualPower Output

64 64 More general Power- Simulate! Invocation: %SimPower(method = dunnett, TrueMeans = (10, 10, 13, 15, 15), s = 10, n = 87, seed=12345 ); Output: Method=DUNNETT, Nominal FWE=0.05, nrep=1000 True means = (10, 10, 13, 15, 15), n=87, s=10 Quantity Estimate ---95% CI---- Complete Power (0.260,0.316) Minimal Power (0.913,0.945) Proportional Power (0.633,0.669) True FWE (0.011,0.027) Directional FWE (0.011,0.027)

65 65 Concluding Remarks - Power Need a bigger n Like to avoid bigger n (see sequential, gatekeepers methods later) Which definition? Bonferroni and independence useful Simulation useful – especially for the more complex methods that follow

66 66 Estimates of effect sizes & error margins Confident inequalities Overall Test Simultaneous Confidence Intervals Stepwise or closed tests Holm’s Method Hommel’s Method Hochberg’s Method Fisher Combination Method F-test, O’Brien, etc. If you want... …then use Closed and Stepwise Testing Methods I: Standard P-Value Based Methods

67 67 Closed Testing Method(s) Form the closure of the family by including all intersection hypotheses. Test every member of the closed family by a (suitable)  -level test. (Here,  refers to comparison- wise error rate). A hypothesis can be rejected provided that its corresponding test is significant at level  and every other hypothesis in the family that implies it is rejected by its  level test.

68 68 Closed Testing – Multiple Endpoints H 0 :  1 =  2 =  3 =  4 =0 H 0 :  1 =  2 =  3 =0H 0 :  1 =  2 =  4 =0H 0 :  1 =  3 =  4 =0H 0 :  2 =  3 =  4 =0 H 0 :  1 =  2 =0H 0 :  1 =  3 =0 H 0 :  1 =  4 =0 H 0 :  2 =  3 =0 H 0 :  2 =  4 =0 H 0 :  3 =  4 =0 H 0 :  1 =0 p = H 0 :  2 =0 p = H 0 :  3 =0 p = H 0 :  4 =0 p = Where  j = mean difference, treatment -control, endpoint j.

69 69 Closed Testing – Multiple Comparisons 1=2=3=41=2=3=4 1=2=31=2=3 1=2=41=2=4 1=3=41=3=4 2=3=42=3=4  1 =  2,  3 =  4  1 =  3,  2 =  4  1 =  4,  2 =  3 1=21=2 1=31=3 1=41=4 2=32=3 =4=4 3=43=4 Note: Logical implications imply that there are only 14 nodes, not = 63 nodes.

70 70 Control of FWE with Closed Tests Suppose H 0j 1,..., H 0j m all are true (unknown to you which ones). {Reject at least one of H 0j 1,..., H 0j m using CTP}  {Reject H 0j 1 ...  H 0j m } Thus, P(reject at least one of H 0j 1,..., H 0j m | H 0j 1,..., H 0j m all are true)  P(reject H 0j 1 ...  H 0j m | H 0j 1,..., H 0j m all are true) = 

71 71 Examples of Closed Testing Methods Bonferroni MinP Resampling-Based MinP Simes O’Brien Simple or weighted test … Holm’s Method Westfall-Young method Hommel’s method Lehmacher’s method Fixed sequence test (a- priori ordered) … When the Composite Test is… Then the Closed Method is …

72 72 P-value Based Methods Test global hypotheses using p-value combination tests Benefit – Fewer model assumptions: only need to say that the p-values are valid Allows for models other than homoscesdastic normal linear models (like survival analysis).

73 73 Holm’s Method is Closed Testing Using the Bonferroni MinP Test Reject H 0j 1  H 0j 2 ...  H 0j m if Min (p 0j 1  p 0j 2 ...  p 0j m )   /m. Or, Reject H 0j 1  H 0j 2 ...  H 0j m if p* = m  Min (p 0j 1  p 0j 2 ...  p 0j m )   (Note that p* is a valid p-value for the joint null, comparable to p-value for Hotellings T 2 test.)

74 74 Holm’s Stepdown Method H 0 :  1 =  2 =  3 =  4 =0 minp= p*= H 0 :  1 =  2 =  3 =0 minp= p*= H 0 :  1 =  2 =  4 =0 minp= p*= H 0 :  1 =  3 =  4 =0 minp=.0121 p*= H 0 :  2 =  3 =  4 =0 minp= p*= H 0 :  1 =  2 =0 minp= p*= H 0 :  1 =  3 =0 minp= p*= H 0 :  1 =  4 =0 minp= p*= H 0 :  2 =  3 =0 minp= p*= H 0 :  2 =  4 =0 minp= p*= H 0 :  3 =  4 =0 minp= p*= H 0 :  1 =0 p = H 0 :  2 =0 p = H 0 :  3 =0 p = H 0 :  4 =0 p = Where  j = mean difference, treatment -control, endpoint j.

75 75 Shortcut For Holm’s Method Let H (1),…,H (k) be the hypotheses corresponding to p (1)  …  p (k) –If p (1)   /k, reject H (1) and continue, else stop and retain all H (1),…,H (k). – If p (2)   /(k-1), reject H (2) and continue, else stop and retain all H (1),…,H (k). –… –If p (k)  , reject H (k)

76 76 Adjusted p-values for Closed Tests The adjusted p-value for H 0j is the maximum of all p-values over all relevant nodes In the previous example, p A(1) =0.0484,p A(2) =0.0484, p A(3) =0.0484, p A(4) = General formula for Holm: p A(j) = max i  j (k-i+1)p (i).

77 77 Worksheet For Holm’s Method

78 78 Simes’ Test for Global Hypotheses Uses all p-values p 1, p 2, …, p m not just the MinP Simes’ test rejects H 01  H 02 ...  H 0m if p (j)  j  /m for at least one j.  p-value for the joint test is p* = min {(m/j)p (j) } Uniformly smaller p-value than m  MinP Type I error at most  under independence or positive dependence of p-values

79 79 Rejection Regions 01   p1p1 p2p2 1  P(Simes Reject) = 1 – (1       P(Bonferroni Reject ) = 1 – (1     

80 80 Hommel’s Method (Closed Simes) H 0 :  1 =  2 =  3 =  4 =0 p*= H 0 :  1 =  2 =  3 =0 p*= H 0 :  1 =  2 =  4 =0 p*= H 0 :  1 =  3 =  4 =0 p*= H 0 :  2 =  3 =  4 =0 p*= H 0 :  1 =  2 =0 p*= H 0 :  1 =  3 =0 p*= H 0 :  1 =  4 =0 p*= H 0 :  2 =  3 =0 p*= H 0 :  2 =  4 =0 p*= H 0 :  3 =  4 =0 p*= H 0 :  1 =0 p = H 0 :  2 =0 p = H 0 :  3 =0 p = H 0 :  4 =0 p = Where  j = mean difference, treatment -control, endpoint j.

81 81 Adjusted P-values for Hommel’s Method Again, take the maximum p-value over all hypotheses that imply the given one. In the previous example, the Hommel adjusted p- values are p A(1) =0.0287, p A(2) =0.0287, p A(3) =0.0382, p A(4) = These adjusted p-values are always smaller than the Holm step-down adjusted p-values.

82 82 Adjusted P-values for Hommel’s Method They are maxima over relevant nodes In example, Hommel adjusted p-values are p A(1) =0.0287, p A(2) =0.0287, p A(3) =0.0382, p A(4) = {Hommel adjusted p-value} ≤ {Holm adjusted p-value}

83 83 Hochberg’s Method A conservative but simpler approximation to Hommel’s method {Hommel adjusted p-value} ≤ {Hochberg adjusted p-value} ≤ {Holm adjusted p-value}

84 84 Hochberg’s Shortcut Method Let H (1),…,H (k) be the hypotheses corresponding to p (1)  …  p (k) –If p (k)  , reject all H (j) and stop, else retain H (k) and continue. – If p (k-1)   /2, reject H (2) … H (k) and stop, else retain H (k-1) and continue. –… –If p (1)   /k, reject H (k) Adjusted p-values are p A(j) = min j  i (k-i+1)p (i).

85 85 Worksheet for Hochberg’s Method

86 86 Comparison of Adjusted P-Values p-Values Stepdown Test Raw Bonferroni Hochberg Hommel

87 87 Fisher Combination Test for Independent p-Values Reject H 01  H 02 ...  H 0m if -2  ln(p i ) >   (1- , 2m)

88 88 Example: Non-Overlapping Subgroup* p-values The Multtest Procedure p-Values Stepdown Fisher Test Raw Bonferroni Hochberg Hommel Combination *Non-overlapping is required by the independence assumption.

89 89 Power Comparison Liptak test stat: T =  -1 (p i ) =  Z i

90 90 Concluding Notes Closed testing more powerful than single-step (  /m rather than  /k). P-value based methods can be used whenever p- values are valid Dependence issues: MinP (Holm) conservative Simes (Hommel, Hochberg) less conservative, rarely anti-conservative Fisher combination, Liptak require independence

91 91 Closed and Stepwise Testing Methods II: Fixed Sequences and Gatekeepers Methods Covered: Fixed Sequences (hierarchical endpoints, dose response, non-inferiority superiority) Gatekeepers (primary and secondary analyses) Multiple Gatekeepers (multiple endpoints & multiple doses) Intersection-Union tests* * Doesn’t really belong in this section

92 92 Fixed Sequence Tests Pre-specify H 1, H 2, …, H k, and test in this sequence, stopping as soon as you fail to reject. No  -adjustment is necessary for individual tests. Applications: Dose response: High vs. Control, then Mid vs. Control, then Low vs. Control Primary endpoint, then Secondary endpoint

93 93 Fixed Sequence as a Closed Procedure H 12 :      Rej if p 1 .05 H 13 :      Rej if p 1 .05 H 23 :      Rej if p 2 .05 H 1 :    Rej if p 1 .05 H 2 :    Rej if p 2 .05 H 3 :    Rej if p 3 .05 H 123 :        Rej if p 1 .05 Rej H 1 if p 1 .05 Rej H 2 if p 1 .05 and p 2 .05 Rej H 3 if p 1 .05 and p 2 .05 and p 3 .05

94 94 A Seemingly Reasonable But Incorrect Protocol 1. Test Dose 2 vs Pbo, and Dose 3 vs Pbo using the Bonferroni method (0.025 level). 2. Test Dose 1 vs Pbo at the unadjusted 0.05 level only if at least one of the first two tests is significant at the level.

95 95 The problem: FWE  Moral: Caution needed when there are multiple hypotheses at some point in the sequence.

96 96 Correcting the Incorrect Protocol: Use Closure Where p ij = 2min(p i,p j )

97 97 References –Fixed Sequence and Gatekeeper Tests 1. Bauer, P (1991) Multiple Testing in Clinical Trials, Statistics in Medicine, 10, O’Neill RT. (1997) Secondary endpoints cannot be validly analyzed if the primary endpoint does not demonstrate clear statistical significance. Controlled Clinical Trials; 18:550 – D’Agostino RB. (2000) Controlling alpha in clinical trials: the case for secondary endpoints. Statistics in Medicine; 19:763– Chi GYH. (1998) Multiple testings: multiple comparisons and multiple endpoints. Drug Information Journal 32:1347S–1362S. 5. Bauer P, R ö hmel J, Maurer W, Hothorn L. (1998) Testing strategies in multi-dose experiments including active control. Statistics in Medicine; 17:2133 – Westfall, P.H. and Krishen, A. (2001). Optimally weighted, fixed sequence, and gatekeeping multiple testing procedures, Journal of Statistical Planning and Inference 99, Chi, G. “Clinical Benefits, Decision Rules, and Multiple Inferences,” 8. Dmitrienko, A, Offen, W. and Westfall, P. (2003). Gatekeeping strategies for clinical trials that do not require all effects to be significant. Stat Med. 22: Chen X, Luo X, Capizzi T. (2005) The application of enhanced parallel gatekeeping strategies. Stat Med. 24: Alex Dmitrienko, Geert Molenberghs, Christy Chuang-Stein, and Walter Offen (2005), Analysis of Clinical Trials Using SAS: A Practical Guide, SAS Press. 11. Wiens, B, and Dmitrienko, A. (2005). The fallback procedure for evaluating a single family of hypotheses. J Biopharm Stat.15(6): Dmitrienko, A., Wiens, B. and Westfall, P. (2006). Fallback Tests in Dose Response Clinical Trials, J Biopharm Stat, 16,

98 98 Intersection-Union (IU) Tests Union-Intersection (UI): Nulls are intersections, alternatives are unions. H 0 : {  1 =0 and  2 =0} vs. H 1 : {  1  0 or  2  0} Intersection-Union (IU): Nulls are unions, alternatives are intersections H 0 : {  1 =0 or  2 =0} vs. H 1 : {  1  0 and  2  0} IU is NOT a closed procedure. It is just a single test of a different kind of null hypothesis.

99 99 Applications of I-U Bioequivalence: The “TOST” test: Test 1. H 01 :  0 vs. H A1 :  0 Test 2. H 01 :  0 vs. H A1 :  0 Can test both at  =.05, but must reject both. Combination Therapy: Test 1. H 01 :  12   vs. H A1 :  12   Test 2. H 01 :  12   vs. H A1 :  12   Can test both at  =.05, but must reject both.

100 100 Control of Type I Error for IU tests Suppose  1 =0 or  2  0. Then P(Type I error) = P(Reject H 0 ) (1) = P(p 1 .05 and p 2 .05) (2) < min{P(p 1 .05), P(p 2 .05)} (3) =.05. (4) Note: The inequality at (3) becomes an approximate equality when p 2 is extremely noncentral.

101 101 Concluding Notes: Fixed Sequences and Gatekeepers Many times, no adjustment is necessary at all! Other times you can gain power by specifying gatekeeping sequences However, you must clearly state the method and follow the rules There are many “incorrect” no adjustment methods - use caution

102 102 Closed and Stepwise Testing Methods III: Methods that Use Logical Constraints and Correlations Methods Application Lehmacher et al Multiple endpoints Westfall-Tobias- Shaffer-Royen General contrasts

103 103 Lehmacher et al. Method Use O’Brien test at each node (incorporates correlations) Do closed testing Note: Possibly no adjustment whatsoever; possibly big adjustment

104 104 Calculations for Lehmacher’s Method proc standard data=research.multend1 mean=0 std=1 out=stdzd; var Endpoint1-Endpoint4; run; data combine; set stdzd; H1234 = Endpoint1+Endpoint2+Endpoint3+Endpoint4; H123 = Endpoint1+Endpoint2+Endpoint3 ; H124 = Endpoint1+Endpoint2+ Endpoint4; H134 = Endpoint1+ Endpoint3+Endpoint4; H234 = Endpoint2+Endpoint3+Endpoint4; H12 = Endpoint1+Endpoint2 ; H13 = Endpoint1+ Endpoint3 ; H14 = Endpoint1+ Endpoint4; H23 = Endpoint2+Endpoint3 ; H24 = Endpoint2+ Endpoint4; H34 = Endpoint3+Endpoint4; H1 = Endpoint1 ; H2 = Endpoint2 ; H3 = Endpoint3 ; H4 = Endpoint4; run; proc ttest; class treatment; var H1234 H123 H124 H134 H234 H12 H13 H14 H23 H24 H34 H1 H2 H3 H4 ; ods output ttests=ttests; run;

105 105 Output For Lehmacher’s Method Obs Variable Method Variances tValue DF Probt 1 H1234 Pooled Equal H123 Pooled Equal H124 Pooled Equal H134 Pooled Equal H234 Pooled Equal H12 Pooled Equal H13 Pooled Equal H14 Pooled Equal H23 Pooled Equal H24 Pooled Equal H34 Pooled Equal H1 Pooled Equal H2 Pooled Equal H3 Pooled Equal H4 Pooled Equal p A1 = max(0.0121, , , , , , , ) = p A2 = max(0.0142, , , , , , , ) = p A3 = max(0.1986, , , , , , , ) = p A4 = max(0.0191, , , , , , , ) =

106 106 Free and Restricted Combinations If truth of some null hypotheses logically forces other nulls to be true, the hypotheses are restricted. Examples Multiple Endpoints, one test per endpoint - free All Pairwise Comparisons - restricted

107 107 Pairwise Comparisons, 3 Groups H 0 :       H 0 :         H 0 :         H 0 :         H 0 :     H 0 :     H 0 :     Note : The entire middle layer is not needed!!!!! Fisher protected LSD valid!

108 108 Pairwise Comparisons, 4 Groups 1=2=3=41=2=3=4 1=2=31=2=3 1=2=41=2=4 1=3=41=3=4 2=3=42=3=4  1 =  2,  3 =  4  1 =  3,  2 =  4  1 =  4,  2 =  3 1=21=2 1=31=3 1=41=4 2=32=3 =4=4 3=43=4 Note: Logical implications imply that there are only 14 nodes, not = 63 nodes. Also, Fisher protected LSD not valid.

109 109 Restricted Combinations Multipliers (Shaffer* Method 1; Modified Holm) *Shaffer, J.P. (1986). Modified sequentially rejective multiple test procedures. JASA 81, 826—831.

110 110 Shaffer’s (1) Adjusted p-values

111 111 Westfall/Tobias/Shaffer/Royen* Method Uses actual distribution of MinP instead of conservative Bonferroni approximation Closed testing incorporating logical constraints Hard-coded in PROC GLIMMIX Allows arbitrary linear functions *Westfall, P.H. and Tobias, R.D. (2007). Multiple Testing of General Contrasts: Truncated Closure and the Extended Shaffer-Royen Method, Journal of the American Statistical Association 102:

112 112 Application of Truncated Closed MinP to Subgroup Analysis Compare Treatment with control as follows: Overall In the Older Patients subgroup In the Younger Patients subgroup In patients with better initial health subgroup In patients with poorer initial health subgroup In each of the four (old/young)x(better/poorer) subgroups 9 tests overall (but better 1 gatekeeper + 8 follow-up)

113 113 Analysis File ods output estimates=estimates_logicaltests; proc glimmix data=research.respiratory; class Treatment AgeGroup InitHealth; model score = Treatment AgeGroup InitHealth Treatment*AgeGroup Treatment*InitHealth AgeGroup*InitHealth; Estimate "Overall" treatment 4 -4 treatment*Agegroup treatment*InitHealth (divisor=4), "Older" treatment 2 -2 treatment*Agegroup treatment*InitHealth (divisor=2), "Younger" treatment 2 -2 treatment*Agegroup treatment*InitHealth (divisor=2), "GoodInitHealth" treatment 2 -2 treatment*Agegroup treatment*InitHealth (divisor=2), "PoorInitHealth" treatment 2 -2 treatment*Agegroup treatment*InitHealth (divisor=2), "OldGood" treatment 1 -1 treatment*Agegroup treatment*InitHealth , "OldPoor" treatment 1 -1 treatment*Agegroup treatment*InitHealth , "YoungGood" treatment 1 -1 treatment*Agegroup treatment*InitHealth , "YoungPoor" treatment 1 -1 treatment*Agegroup treatment*InitHealth /adjust=simulate(nsamp= report seed=12321) upper stepdown(type=logical report); run; proc print data=estimates_logicaltests noobs; title "Subgroup Analysis Results – Truncated Closure"; var label estimate Stderr tvalue probt Adjp; run;

114 114 Results – Truncated Closure Subgroup Analysis Results adjp_ adjp_ Label Estimate StdErr tValue Probt logical interval Overall Older Younger GoodInitHealth PoorInitHealth OldGood OldPoor YoungGood YoungPoor The adjusted p-values for the stepdown tests are mathematically smaller than those of the simultaneous interval-based tests,

115 115 Example: Stepwise Pairwise vs. Control Testing Teratology data set Observations are litters Response variable = litter weight Treatments: 0,5,50,500. Covariates: Litter size, Gestation time

116 116 Analysis File proc glimmix data=research.litter; class dose; model weight = dose gesttime number; estimate "5 vs 0" dose , "50 vs 0" dose , "500 vs 0" dose / adjust=simulate(nsample= report) stepdown(type=logical); run; quit;

117 117 Results Estimates with Simulated Adjustment Standard Label Estimate Error DF t Value Pr > |t| Adj P 5 vs vs vs Note: 50-0 and not significant at.10 with regular Dunnett

118 118 Concluding Notes: More power is available when combinations are restricted. Power of closed tests can be improved using correlation and other distributional characteristics

119 119 Nonparametric Multiple Testing Methods Overview: Use nonparametric tests at each node of the closure tree Bootstrap tests Rank-based tests Tests for binary data

120 120 Bootstrap MinP Test (Semi-Parametric Test) The composite hypothesis H 1  H 2  …  H k may be tested using the p-value p* = P(MinP  minp | H 1  H 2  …  H k ) Westfall and Young (1993) show how to obtain p* by bootstrapping the residuals in a multivariate regression model. how to obtain all p*’s in the closure tree efficiently

121 121 Multivariate Regression Model (Next Five slides are from Westfall and Young, 1993)

122 122 Hypotheses and Test Statistics

123 123 Joint Distribution of the Test Statistics

124 124 Testing Subset Intersection Hypotheses Using the Extreme Pivotals

125 125 Exact Calculation of p K Bootstrap Approximation:

126 126 Bootstrap Tests (PROC MULTTEST) H 0 :  1 =  2 =  3 =  4 =0 min p =.0121, p* =.0379 H 0 :  1 =  2 =  3 =0 min p =.0121, p* <.0379 H 0 :  1 =  2 =  4 =0 min p =.0121, p* <.0379 H 0 :  1 =  3 =  4 =0 min p =.0121, p* <.0379 H 0 :  2 =  3 =  4 =0 min p =.0142, p* =.0351 H 0 :  1 =  2 =0 minp =.0121 p* <.0379 H 0 :  1 =  3 =0 minp =.0121 p* <.0379 H 0 :  1 =  4 =0 minp =.0121 p* <.0379 H 0 :  2 =  3 =0 minp =.0142 p* <.0351 H 0 :  2 =  4 =0 minp =.0142 p* <.0351 H 0 :  3 =  4 =0 minp =.0191 p* =.0355 H 0 :  1 =0 p = p* <.0379 H 0 :  2 =0 p = p* <.0351 H 0 :  3 =0 p = p* =.1991 H 0 :  4 =0 p = p* <.0355 p* = P(Min P  min p | H 0 ) (computed using bootstrap resampling) (Recall, for Bonferroni, p* = k(MinP) )

127 127 Permutation Tests for Composite Hypotheses H 0K Joint p-value = proportion of the n!/(n T !n C !) permutations for which min i  K P i *  min i  K p i.

128 128 Problem; Simplification Simplification: You need only test k of the 2 k -1 subsets! Why? Because P(min i  K P i *  c)  P(min i  K’ P i *  c) when K  K’. Significance for most lower order subsets is determined by significance of higher order subsets. Problem: There are 2 k -1 subsets K to be tested This might take a while...

129 129 MULTTEST PROCEDURE Tests only the needed subsets (k, not 2 k - 1). Samples from the permutation distribution. Only one sample is needed, not k distinct samples, if the joint distribution of minP is identical under H K and H S. (Called the “subset pivotality” condition by Westfall and Young, 1993, valid under location shift and other models)

130 130 Great Savings are Possible with Exact Permutation Tests! Why? Suppose you test H 12…k using MinP. The joint p-value is p* = P(MinP  minp)  P(P 1  minp) + P(P 2  minp) + … + P(P k  minp) Many summands can be zero, others much less than minp.

131 131 Stepdown Stepdown Variable Contrast Raw Bonferroni Permutation ae1 t vs c ae2 t vs c ae3 t vs c ae4 t vs c ae5 t vs c ae6 t vs c ae7 t vs c ae8 t vs c ae9 t vs c ae10 t vs c ae11 t vs c ae12 t vs c ae13 t vs c ae14 t vs c ae15 t vs c ae16 t vs c ae17 t vs c ae18 t vs c ae19 t vs c ae20 t vs c ae21 t vs c ae22 t vs c ae23 t vs c ae24 t vs c ae25 t vs c ae26 t vs c ae27 t vs c ae28 t vs c Multiple Binary Adverse Events

132 132 Example: Genetic Associatons Phenotype: 0/1 (diseased or not). Sample n 1 from diseased, n 2 from not diseased. Compare 100’s of genotype frequencies (using dominant and recessive codings) for diseased and non-diseased using multiple Fisher exact tests.

133 133 PROC MULTTEST Code proc multtest data=research.gen stepperm n=20000 out=pval hommel fdr; class y; test fisher(d1-d100 r1-r100); contrast "dis v nondis" -1 1; run; proc sort data=pval; by raw_p; run; proc print data=pval; var _var_ raw_p stppermp hom_p; where raw_p <.05; run;

134 134 Results from PROC MULTTEST Obs _var_ raw_p stppermp hom_p fdr_p 1 r r d r d r r r

135 135 Application - Gene Expression Group 1: Acute Myeloid Leukemia (AML), n 1 =11 Group 2: Acute Lymphoblastic Leukemia (ALL), n 2 =27 Data: OBS TYPE G1 G2 G3 … G AML (Gene expression levels) 2 AML … … … … 11 AML 12 ALL … … 38 ALL

136 136 PROC MULTTEST code for exact* closed testing Proc multtest data=research.leuk noprint out=adjp holm fdr stepperm n=1000; class type; /* AML or ALL */ test mean (gene1-gene7123); contrast 'AML vs ALL' -1 1; run; proc sort data=adjp(where=(raw_p le.0005)); by raw_p; proc print; var _var_ raw_p stpbon_p fdr_p stppermp; run; * modulo Monte Carlo error

137 137 PROC MULTTEST Output (1 hour on 2.8 GhZ Xeon for 200,000 samples)

138 138 Subset Pivotality, PROC MULTTEST MULTTEST requires “subset pivotality” condition, which states cases where resampling under the global null is valid. Valid cases: Multivariate Regression Model (location-shift). Multivariate permutation multiple comparisons, one test per variable, assuming model with exchangeable subsets. Not Valid with: Permutation multiple comparisons, within a variable, with three or more groups, Heteroscedasticity. Closed testing “by hand” works regardless.

139 139 Summary: Nonparametric Closed Tests Nonparametric closed tests are simple, in principle. Robustness gains and power advantages are possible.

140 140 Further Topics: More Complex Situations for FWE Control Heteroscedasticity Repeated Measures Large Sample Methods

141 141 Heteroscedasticity in MCPs Extreme Example: data het; do g = 1 to 5; do rep = 1 to 10; input y output; end; datalines; ; proc glm; class g; model y = g; lsmeans g/adjust=tukey pdiff; run; quit;

142 142 Least Squares Means for effect g Pr > |t| for H0: LSMean(i)=LSMean(j) Adjustment for Multiple Comparisons: Tukey-Kramer i/j <.0001 < <.0001 < <.0001 < <.0001 <.0001 < <.0001 <.0001 < Level of y g N Mean Std Dev RMSE = 6.17 Bad Results from Heteroscedastic Data

143 143 proc glimmix data=het; if (g > 3) then y2=y/20; else y2=y; /* overcomes scaling problem */ class g; model y2 = g/noint ddfm=satterth; random _residual_ / group=g ; estimate '1 -2' g , '1 -3' g , '1 -4' g , '1 -5' g , '2 -3' g , '2 -4' g , '2 -5' g , '3 -4' g , '3 -5' g , '4 -5' g /adjust=simulate(nsamp = ) stepdown(type=logical) adjdfe=row; run; Approximate Solution for Heteroscedasticity Problem

144 144 Estimates with Simulated Adjustment Standard Label Estimate Error DF t Value Pr > |t| Adj P < < < Heteroscedastic Results Notes: Approximation 1: df’s Approximation 2: Covariance matrix involving all comparisons is approximate 1,2,3 different, 4-5 not. (sensible)

145 145 Repeated Measures and Multiple Comparisons Usually considered quite complicated (wave hands, use Bonferroni) PROC GLIMMIXED provides a viable solution The method is approximate because of its df approximation, and because it treats estimated variance ratios as known.

146 146 Multiple Comparisons with Mixed Model data Halothane; do Dog =1 to 19; do Treatment = 'HA','LA','HP','LP'; input Rate output; end; datalines; ; Crossover study: Dog heart rates H,L = CO2 High/Low A,P = Halothane absent/present Source: Johnson and Wichern, Applied Multivariate Statistical Analysis, 5 th ed, Prentice Hall

147 147 GLIMMIX code for analyzing all pairwise comparisons, main effects, and interactions simultaneously proc glimmix data=halothane order=data; class treatment dog; model rate = treatment/ddfm=satterth; random treatment/ subject=dog type=chol v=1 vcorr=1; estimate 'HA - LA' treatment , 'HA - HP' treatment , 'HA - LP' treatment , 'LA - HP' treatment , 'LA - LP' treatment , 'HP - LP' treatment , 'Co2 ' treatment (divisor=2), 'Halothane' treatment (divisor=2), 'Interaction' treatment /adjust=simulate(nsamp= ) stepdown(type=logical) adjdfe=row;

148 148 Estimates with Simulated Adjustment Standard Label Estimate Error DF t Value Pr > |t| Adj P HA - LA HA - HP <.0001 <.0001 HA - LP <.0001 <.0001 LA - HP < LA - LP <.0001 <.0001 HP - LP Co Halothane <.0001 <.0001 Interaction Results

149 149 Cure Rates Example: Multiple Comparisons of Odds Questions: (1) Multiple comparisons of cure rates for the Treatments (3 comparisons) (2) Comparison of cure rates for Complicated vs Uncomplicated Diagnosis.

150 150 Method Use the estimated parameter vector and associated estimate of covariance matrix from PROC GLIMMIX Treat the estimated (asymptotic) covariance matrix as known Simulate critical values and p-values (MinP-based) from the multivariate normal distribution instead of the Multivariate T distribution Controls FWE asymptotically under correct logit model

151 151 Results Estimates with Simulated Adjustment Standard Label Estimate Error DF t Value Pr > |t| Adj P A-B Infty A-C Infty B-C Infty 4.94 <.0001 <.0001 Comp-Uncomp Infty

152 152 Summary Classic, FWE-controlling MCPs that incorporate alternative covariance structures and non-normal distributions are easy using PROC GLIMMIX. However, be aware of approximations  Plug-in variance/covariance estimates  df

153 153 Further Topics: False Discovery Rate FDR = E(proportion of rejections that are incorrect) Let R = total # of rejections Let V = # of erroneous rejections FDR = E(V/R) (0/0 defined as 0). FWE = P(V>0)

154 154 Example 30 independent tests: 20 null hypotheses are true with p j ~U(0,1) 10 extremely alternative with p j = 0. Decision rule: Reject H 0j if p j  0.05 Then: CER j = P(reject H 0j | H 0j true ) = FWE = P(reject one or more of the 20) = 1-(.95) 20 =0.64 FDR = E{V/(V+10)} where V~Bin(20.05) so FDR =

155 155 Benjamini and Hochberg’s FDR- Controlling Method Let H (1),…,H (k) be the hypotheses corresponding to p (1)  …  p (k) –If p (k)  , reject all H (j) and stop, else continue. – If p (k-1)  (k-1)  /k, reject H (1) … H (k-1) and stop, else continue. –… –If p (1)   /k, reject H (1) Adjusted p-values: p A(j) = min j  i (k/i)p (i).

156 156 Comparison with Hochberg’s Method A step-up procedure, like Hochberg’s method adjusted p-values are p A(j) = min j  i (k/i)p (i). Recall for Hochberg’s method, p A(j) = min j  i (k-i+1)p (i). FDR adjusted p-values are uniformly smaller since k/i  k-i+1 B-H FDR method uses Simes’ critical points.

157 157 Critical Values – FDR vs FWE

158 158 Comments on FDR Control Considered better for large numbers of tests since FWE is inconsistent Is adaptive Has a loose Bayesian correspondence Easy to misinterpret the results: Given 10 FDR<.10 rejections in a given study, it is tempting to claim that only one can be in error (in an “average” sense). However, this is incorrect, as E(V/R | R>0) > .

159 159 Further Topics: Bayesian Methods Simultaneous Credible Intervals Probabilities of ranking Loss function approach Posterior probabilities of null hypotheses

160 160 Bayes/Frequentist Comparisons

161 161 Simultaneous Credible Intervals Create intervals I i for  i, so that P(  i  I i, all i | Data) =.95 Implementation in Westfall et al (1999) assumes – Variance components model (includes regular GLM and heteroscedastic GLM as special case) – Jeffreys’ priors on variances (vague) – Flat prior on means (also vague) Uses PROC MIXED to obtain sample (assume i.i.d) from posterior distribution Uses %BayesIntervals to obtain simultaneous credible intervals

162 162 Bayesian Simultaneous Conf. Band Obs _NAME_ Lower Upper 1 diff diff diff diff diff diff diff diff diff diff diff diff diff

163 163 Bayes/Frequentist Correspondence From Westfall, P.H. (2005). Comment on Benjamini and Yekutieli, ‘False Discovery Rate Adjusted Confidence Intervals for Selected Parameters,’ Journal of the American Statistical Association 100,

164 164 Bayesian Probabilities for Rankings Suppose you observe Ave 1 > Ave 2 > … > Ave k. What is the probability that m 1 > m 2 > … > m k ? Bayesian Solution: Calculate proportion of posterior samples for which the ranking holds.

165 165 Results: Comparing Formulations Solution for Fixed Effects Standard Effect formulation Estimate Error formulation A formulation B formulation C formulation D formulation E The MEANS Procedure Variable N Mean ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ rank_observed_means Mean5_best Mean1_best Mean2_best Mean3_best E-6 Mean4_best ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

166 166 Waller-Duncan Loss Function Approach Let  ij =  i -  j. Let L ij (  ij ) denote the loss of declaring  i >  j. Let L i~j (  ij ) denote the loss of declaring  i n.s. different from  j. W-D Loss functions* L i~j (  ij ) = |  ij | L ij (  ij ) = -k  ij,  ij <0, = 0 otherwise * Equivalent form; See Hochberg and Tamhane (1987, ) See Pennello, G The k-ratio multiple comparisons Bayes rule for the balanced two-way design. Journal of the American Statistical Association 92:

167 167

168 168 Implementation Waller – Duncan in PROC GLM More general: Simulate from posterior pdf of the  ij, calculate all three losses, average, and choose decision with smallest average loss.

169 169 Sample Output The MEANS Procedure Variable N Mean Std Error ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Loss1LT Loss1NS Loss1GT ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Decision:  1 >  3 The MEANS Procedure Variable N Mean Std Error ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Loss1LT Loss1NS Loss1GT ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Decision:  1 ~  5

170 170 Bayesian Multiple Testing Frequentist univariate testing: Calculate p-value = P(data more extreme | H 0 ) Bayesian univariate testing: Calculate P(H 0 is true | Data) Frequentist multiple testing: if H 01, H 02, …, H 0k are all true (or if many are true) then we get a small p-value by chance alone.  use a more conservative rule. Bayesian multiple testing: Express the doubt about many or all H 0i being true using prior distribution; use this to calculate posterior probabilities P(H 0i is true | Data).

171 171 Bayesian Multiple Testing: Methodology Find posterior probability for each of the 2 k models where  i is either =0 or  0. Then P(  i = 0| Z) = (Sum of posterior probs for all 2 k-1 models where  i = 0) (Sum of posterior probs for all 2 k models) Gopalan, R., and Berry, D.A. (1998), Bayesian Multiple Comparisons Using Dirichlet Process Priors, Journal of the American Statistical Association 93, Gönen, M., Westfall, P.H. and Johnson, W.O. (2003). Bayesian multiple testing for two-sample multivariate endpoints," Biometrics 59,

172 172 The %BayesTests Macro: Priors You can specify your level of prior doubt about individual hypotheses. You can specify either (i) P(H 0i is true) or (ii) P(H 0i is true, all i), or both. You can specify (iii) the degree of prior correlation among the individual hypotheses. Specify two out of three of (i), (ii), and (iii). The third is determined by the other two. Specify prior expected effect sizes and prior variances of effect sizes (default: mean effect size is 2.5, variance= 2.)

173 173 The %BayesTests Macro: Data Assumptions, Inputs, and Outputs Assume: tests are free combinations (e.g.,multiple endpoints); MANOVA; Large Samples. Inputs: t-statistics and their (conditional) large-sample correlation matrix (this is the partial correlation matrix in the case of multiple endpoints); priors. Outputs: Posterior probabilities P(H 0i is true | Data).

174 174 %BayesTests Example: Multiple Endpoints in Panic Disorder Study proc glm data=research.panic; class TX; model AASEVO PANTOTO PASEVO PHCGIMPO = TX; estimate "Treatment vs Control" TX 1 -1; manova h=TX / printe; ods output Estimates =Estimates PartialCorr=PartialCorr; run; %macro Estimates; use Estimates; read all var {tValue} into EstPar; use PartialCorr; read all var {AASEVO, PANTOTO, PASEVO, PHCGIMPO} into cov; %mend; %BayesTests(rho=.5,Pi0 =.5);

175 175 Output from %BayesTests

176 176 The Effect of Prior Correlation: Borrowing Strength

177 177 The Bayesian Multiplicity Effect If the multiple comparisons concern, “What if many or all nulls are true” is valid, the Bayesian must attach a higher probability to P(H 0i is true, all i). Here is the result of setting P(H 0i is true, all i) =.5. “Right” answers, See Westfall, P.H., Krishen,A. and Young, S.S.(1998). "Using Prior Information to Allocate Significance Levels for Multiple Endpoints," Statistics in Medicine 17,

178 178 Summary: Bayesian Methods Several Bayesian MCPs are available! Intervals Tests Rankings Decision theory Other current research: FDR – Bayesian connection (genetics) Mixture models and Bayesian MCPs (variable selection)

179 179 Discussion Good methods and software are available You can’t use the excuse “I don’t have to use MCPs because there is no good method available” This brings us back to the $100,000,000 question: “When should we use MCPs/MTPs”?

180 180 When Should You Adjust? A Scientific View When there is substantial doubt concerning the collection of hypotheses tested When you data snoop When you play “pick the winner” When conclusions require joint validity

181 181 But What “Family” Should I Use? The set over which you play “pick the winner” The set of conclusions requiring joint validity Not always well-defined Better to decide at design stage or simply to “frame the discussion”

182 182 Multiplicity Invites Selection; Selection has an Effect Variability, probability theory, VERY relevant.

183 183 Final Words:  /k

184 184 References: Books Hochberg, Y. and Tamhane, A.C. (1987). Multiple Comparison Procedures. John Wiley, New York. Hsu, J.C. (1996). Multiple Comparisons: Theory and Methods, Chapman and Hall, London. Westfall, P.H., and Young, S.S. (1993) Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment. Wiley, New York. Westfall, P.H., Tobias, R.D., Rom, D., Wolfinger, R.D., and Hochberg, Y. (1999). Multiple Comparisons and Multiple Tests Using the SAS® System, Cary, NC: SAS Institute Inc. Westfall, P.H. and Tobias, R. (2000). Exercises to Accompany "Multiple Comparisons and Multiple Tests Using the SAS ® System", Cary, NC: SAS Institute Inc.

185 185 References: Journal Articles Bauer, P.; George Chi; Nancy Geller; A. Lawrence Gould; David Jordan; Surya Mohanty; Robert O'Neill; Peter H. Westfall (2003). Industry, Government, and Academic Panel Discussion on Multiple Comparisons in a “Real” Phase Three Clinical Trial. Journal of Biopharmaceutical Statistics, 13(4), Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A new and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57, Berger, J. O. and Delampady, M. (1987), Testing precise hypothesis. Statistical Science 2, Cook, R.J. and Farewell, V.T.(1996). Multiplicity considerations in the design and analysis of clinical trials. JRSS-A 159, Dmitrienko, A, Offen, W. and Westfall, P. (2003). Gatekeeping strategies for clinical trials that do not require all primary effects to be significant. Statistics in Medicine 22, Gönen, M., Westfall, P.H. and Johnson, W.O. (2003). "Bayesian multiple testing for two-sample multivariate endpoints," Biometrics 59, Hellmich M, Lehmacher W. Closure procedures for monotone bi-factorial dose-response designs. Biometrics 2005;61: Koyama, T., and Westfall, P.H. (2005). Decision-Theoretic Views on Simultaneous Testing of Superiority and Noninferiority, Journal of Biopharmaceutical Statistics 15, Lehmacher W., Wassmer G., Reitmeir P.: Procedures for Two-Sample Comparisons with Multiple Endpoints Controlling the Experimentwise Error Rate. Biometrics, 1991, 47: Marcus, R., Peritz, E. and Gabriel, K.R. (1976). On closed testing procedures with special reference to ordered analysis of variance. Biometrika 63, Shaffer, J.P. (1986). Modified sequentially rejective multiple test procedures. Journal of the American Statistical Association 81, 826—831. Westfall, P.H. (1997). "Multiple Testing of General Contrasts Using Logical Constraints and Correlations," Journal of the American Statistical Association 92, Westfall, P.H. and Wolfinger, R.D.(1997). "Multiple Tests with Discrete Distributions," The American Statistician 51, 3-8. Westfall, P.H., Johnson, W.O., and Utts, J.M. (1997). A Bayesian perspective on the Bonferroni adjustment. Biometrika 84, Westfall,P.H. and Wolfinger, R.D. (2000). "Closed Multiple Testing Procedures and PROC MULTTEST." SAS Observations, July, Westfall, P.H., Ho, S.-Y., and Prillaman, B.A. (2001). "Properties of multiple intersection-union tests for multiple endpoints in combination therapy trials," Journal of Biopharmaceutical Statistics 11, Westfall, P.H. and Krishen, A. (2001). "Optimally weighted, fixed sequence, and gatekeeping multiple testing procedures," Journal of Statistical Planning and Inference 99, Westfall, P. and Bretz, F. (2003). Multiplicity in Clinical Trials. Encyclopedia of Biopharmaceutical Statistics, second edition, Shein-Chung Chow, ed., Marcel Decker Inc., New York, pp Westfall, P.H., Zaykin, D.V., and Young, S.S. (2001). Multiple tests for genetic effects in association studies. Methods in Molecular Biology, vol. 184: Biostatistical Methods, pp Stephen Looney, Ed., Humana Press, Toloway, NJ. Westfall, P.H. and Tobias, R.D. (2007). Multiple Testing of General Contrasts: Truncated Closure and the Extended Shaffer-Royen Method, Journal of the American Statistical Association 102:


Download ppt "1 A Course in Multiple Comparisons and Multiple Tests Peter H. Westfall, Ph.D. Professor of Statistics, Department of Inf. Systems and Quant. Sci. Texas."

Similar presentations


Ads by Google