Download presentation

Presentation is loading. Please wait.

Published byGabrielle Pratt Modified over 3 years ago

1
July 2006SAMSI Copyright, 1996 © Dale Carnegie & Associates, Inc. Some Thoughts on Replicability in Science Yoav Benjamini Tel Aviv University www.math.tau.ac.il/~ybenja

2
YB SAMSI 06 Based on Joint work with Ilan Golani Department of Zoology, Tel Aviv University Greg Elmer, Neri Kafkafi Behavioral Neuroscience Branch, National Institute on Drug Abuse/IRP, Baltimore, Maryland Dani Yekutieli, Anat Sakov, Ruth Heller, Rami Cohen, Department of Statistics, Tel Aviv University Dani Yekutieli, Yosi Hochberg Department of Statistics, Tel Aviv University

3
YB SAMSI 06 Outline of Lecture 1.Prolog 2.The replicablity problems in behavior genetics 3.Addressing strain*lab interaction 4.Addressing multiple endpoints 5.The replicability problems in Medical Statistics 6.The replicability problems in Functional Magnetic Resonance Imaging (fMRI) 7.Epilog

4
July 2006SAMSI 1. Prolog J.W.Tukeys last paper (with Jones and Lewis) was an entry on Multiple Comparisons for the International Encyclopedia of Statistics. It started with general discussion that multiple comparisons addresses ``a diversity of issues … that tend to be important, difficult, and often unresolved. Multiple comparisons; Multiple determinations Selection of one or more candidates; Selection of variables; Selecting their transformations; etc. ( … his usual advice: there need not be a single best…)

5
July 2006SAMSI The Mixed Puzzle Then, the Encyclopedia entry included in detail two issues The False Discovery Rate (FDR) approach in pairwise comparisons The random effects vs fixed effects ANOVA

6
July 2006SAMSI "Two alternatives, 'fixed' and 'variable', are not enough. A good way to provide a reasonable amount of realism is to define 'c' by appropriate error term = f-error term +c [ r-error term - f-error term ] … It pays then to learn as much as possible about values of c in the real world… but Whats that to do with Multiple Comparisons?

7
YB SAMSI 06 2. Behavior genetics Study the genetics of behavioral traits:Study the genetics of behavioral traits: Hearing, sight, smell, alcoholism, locomotion, fear, exploratory behavior Compare behavior between inbred strains, crosses, knockouts…Compare behavior between inbred strains, crosses, knockouts… Number of behavioral endpoints ~200 and growingNumber of behavioral endpoints ~200 and growing The entry Tukey wrote was about Replicability

8
YB SAMSI 06 The search for replicable scientific methods Fishers The Design of Experiments (1935)Fishers The Design of Experiments (1935) In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us statistically significant results. (pp 14) i.e. significance level interpreted directly in replications of the experiment. The discussion motivates the inclusions of results more extreme in the rejection region

9
YB SAMSI 06 Replicability Behavior Genetics in transition (Mann, Science 94) …jumping too soon to discoveries.. (and press discoveries) raises the issue of Replicability yet did not mention the two main themes well address. Mann identifies statistics as a major source of troubles yet did not mention the two main themes well address. The common cry Lack of standardization (e.g. Koln 2000)

10
YB SAMSI 06 Does it work? Crabbe et al (Science 1999) experiment at 3 labs:Crabbe et al (Science 1999) experiment at 3 labs: In spite of strict standardization,In spite of strict standardization, they found: Strain effect, Lab effect Lab*Strain Interaction From their conclusions: Thus, experiments characterizing mutants may yield results that are idiosyncratic to a particular laboratory. …differences between labs… can contribute to failures to replicate results of Genetic Experiments Whalsten(2001)

11
YB SAMSI 06 A concrete example: exploratory behavior NIH: Phenotyping Mouse Behavior High throughput screening of mutant mice. NIH: Phenotyping Mouse Behavior High throughput screening of mutant mice. Comparing between 8 inbred strains of mice Dr. Ilan Golani TAU Dr. Elmer, MPRC, Dr Kafkafi, NIDA BehaviorTracking

12
YB SAMSI 06

13
YB SAMSI 06 Using sophisticated data analytic tools we get for segment acceleration (log-transformed):

14
YB SAMSI 06 The display supporting this claim for Distance Traveled (m)

15
YB SAMSI 06 Source dfMSE Fp-value Strain 7102.544.80.00001 Lab 26.352.770.065 Lab*Strain 146.873.000.00028 Residuals2642.29 The statistical analysis supporting this claim for prop. of time in center (logit) and it is a common problem:

16
YB SAMSI 06 Kafkafi&YB et al, PNAS 05

17
YB SAMSI 06 Our statistical diagnosis of the replicability problem Part I. Using the wrong yardstick for variability Fixed Model analysis treating labs effects as fixed Part II. Multiplicity problems many endpoints; repeated testing (screening) (Kafkafi &YB et al, PNAS 05) (Kafkafi &YB et al, PNAS 05)

18
YB SAMSI 06 3. Part 1: The mixed model The existence of Lab*Strain interaction does not diminish the credibility of a behavioral endpoint - in this sense it is not a problem The existence of Lab*Strain interaction does not diminish the credibility of a behavioral endpoint - in this sense it is not a problem This interaction should be recognized as a fact of life This interaction should be recognized as a fact of life Interactions size is the right yardstick against which genetic differences should be compared Interactions size is the right yardstick against which genetic differences should be compared Statistically speaking: Lab is a random factor, as is its interaction with strain. A mixed model should be used (rather than fixed)

19
YB SAMSI 06 The formal Mixed Model Y LGI is the value of an endpoint for Laboratory L, Strain S, index I represents the repetition within each group. Y LSI = S + a L +b L*S + LSI G is the strain effect which is considered fixed, a L ~ N(0, LAB ) is the laboratory random effect, b L*S ~ N(0, LAB*STRAIN ) is the interaction random effect, LSI ~ N(0, ) is the individual variability

20
YB SAMSI 06 Implications of the Mixed Model Source dfMSE Fp-value Strain 7102.544.80.00001 Lab 26.352.770.09 Lab*Strain 146.873.000.00028 Residuals2642.29 0.9 14.80.0028 Esimates of LAB and LAB*STRAIN Technically The threshold for significant strain differences can be much higher 0.43

21
YB SAMSI 06 Implications of the Mixed Model Practically 1. For screening new mutants significance assessed Lab Lab*Strain + /n 2. For screening new mutants vs locally measured background significance assessed Lab*Strain + (1/n+1/m) Unfortunately, even as sample sizes increase the interaction term does not disappear

22
YB SAMSI 06 Implications of the Mixed Model 3. For developing new endpoints A single lab cannot suffice for the development of a new endpoint - no yardstick is available Thus its the developers responsibility to offer estimates of interaction variability and put it in a public dataset (such as Jackson Laboratory)

23
YB SAMSI 06 Behavioral Endpoint FixedMixed Prop. Lingering Time 0.000010.0029 # Progression segments 0.000010.0068 Median Turn Radius (scaled) 0.000010.0092 Time away from wall 0.000010.0108 Distance traveled 0.000010.0144 Acceleration0.000010.0146 # Excursions 0.000010.0178 Time to half max speed 0.000010.0204 Max speed wall segments 0.000010.0257 Median Turn rate 0.000010.0320 Spatial spread 0.000010.0388 Lingering mean speed 0.000010.0588 Homebase occupancy 0.0010.0712 # stops per excursion 0.00280.1202 Stop diversity 0.0270.1489 Length of progression segments 0.440.5150 Activity decrease 0.670.8875 Significance of 8 Strain differences

24
YB SAMSI 06 In summary of part I Practically, the threshold for making discoveries, in all aspects, is set at a higher level Is this a drawback? It is a way to weed out non-replicable differences

25
YB SAMSI 06 What about the warning: Never use a random factor unless your levels are a true random sample We do not agree: Replicability in a new lab, is at least partially captured by a random effect model.We do not agree: Replicability in a new lab, is at least partially captured by a random effect model. Revisit Jones Lewis and Tukey (2002)Revisit Jones Lewis and Tukey (2002)

26
July 2006SAMSI "Two alternatives, 'fixed' and 'variable', are not enough. A good way to provide a reasonable amount of realism is to define 'c' by appropriate error term = f-error term +c [ r-error term - f-error term ] so that 'everything fixed' corresponds to c=0 and 'random' is a particular case of c=1. It pays then to learn as much as possible about values of c in the real world c=0.5 fixed columns and illustrative weights c=1.6 illustrative columns A challenge: can we estimate c?

27
YB SAMSI 06 Behavioral Endpoint Mixed Prop. Lingering Time 0.0029 # Progression segments 0.0068 Median Turn Radius (scaled) 0.0092 Time away from wall 0.0108 Distance traveled 0.0144 Acceleration0.0146 # Excursions 0.0178 Time to half max speed 0.0204 Max speed wall segments 0.0257 Median Turn rate 0.0320 Spatial spread 0.0388 Lingering mean speed 0.0588 Homebase occupancy 0.0712 # stops per excursion 0.1202 Stop diversity 0.1489 Length of progression segments 0.5150 Activity decrease 0.8875 Significance of 8 Strain differences Should we believe all p-value 0.05? Not necessarily - Beware of Multiplicity!

28
YB SAMSI 06 4. Part II: The Multiplicity Problem The more statistical tests in a study - the larger the probability of making a type I errorThe more statistical tests in a study - the larger the probability of making a type I error Stricter control - less power to discover a real effectStricter control - less power to discover a real effect Traditional approaches Dont worry, be happyDont worry, be happy Conduct each test at the usual.05 level (eg Ionadis PLOS paper ) Panic!Panic! Control the prob. of making even a single type I error in the entire study at the usual level (e.g. Bonferroni) Panic causes severe loss of power to discover in large problems

29
YB SAMSI 06 Behavioral Endpoint Mixed Prop. Lingering Time 0.0029 # Progression segments 0.0068 Median Turn Radius (scaled) 0.0092 Time away from wall 0.0108 Distance traveled 0.0144 Acceleration0.0146 # Excursions 0.0178 Time to half max speed 0.0204 Max speed wall segments 0.0257 Median Turn rate 0.0320 Spatial spread 0.0388 Lingering mean speed 0.0588 Homebase occupancy 0.0712 # stops per excursion 0.1202 Stop diversity 0.1489 Length of progression segments 0.5150 Activity decrease 0.8875 Significance of 8 Strain differences Should we believe all p-value 0.05? Not necessarily - Beware of Multiplicity! Should we use Bonferroni? 0.05*1/17=.0029 Panic causes severe loss of power to discover in large problems

30
YB SAMSI 06 Genetic dissection of complex traits: guidelines for interpreting… Lander and KruglyakGenetic dissection of complex traits: guidelines for interpreting… Lander and Kruglyak Adopting too lax a standard guarantees a burgeoning literature of false positive linkage claims, each with its own symbol… Scientific disciplines erode their credibility when substantial proportion of claims cannot be replicated…Adopting too lax a standard guarantees a burgeoning literature of false positive linkage claims, each with its own symbol… Scientific disciplines erode their credibility when substantial proportion of claims cannot be replicated… On the other hand, adopting too high a hurdle for reporting results runs the risk that nascent field will be stillborn.On the other hand, adopting too high a hurdle for reporting results runs the risk that nascent field will be stillborn. Is there an in-between approach? Is there an in-between approach?

31
YB SAMSI 06 The False Discovery Rate (FDR) criterion The FDR approach takes seriously the concern of Lander & Kruglyak: The error in the entire study is measured by Q= the proportion of false discoveries among the discoveries = 0 if none found, and FDR = E(Q) IfIf nothing is real, controlling the FDR at level qguarantees that the probability of making even one false discovery is less than q This is why we choose usual levels of q, q, say 0.05 ButBut otherwise there is room for improving detection power. ThisThis error rate is scalable; adaptive; economically interpretable.

32
YB SAMSI 06 Our motivating work was Soriç (JASA 1989): If we use size 0.05 tests to decide upon statistical discoveries then there is danger that a large part of science is not true We define Q= V/Rif R > 0 = 0 if R = 0 Soriç used E(V)/R for his demonstrations. More recently Ioannidis (PLoS Medicine 05) just repeated the argument using The Positive Predictive Value (PPV) PPV = 1-Q stating most published research findings are false For demonstration in his model he used PPV PPV = 1-E(V)/E(R) = 1-FDR Control of FDR assures large PPV under most of Ioannidis scenarios (except biases such as omission, publication, and interest)

33
YB SAMSI 06 Behavioral Endpoint Mixed Prop. Lingering Time 0.0029 0.05*1/17=.0029 # Progression segments 0.0068 Median Turn Radius (scaled) 0.0092 Time away from wall 0.0108 Distance traveled 0.0144 Acceleration0.0146 # Excursions 0.0178 Time to half max speed 0.0204 Max speed wall segments 0.0257 0.05* 9/17=0.0267 Median Turn rate 0.0320 0.05*10/17=0.0294 Spatial spread 0.0388 0.05*11/17=0.0323 Lingering mean speed 0.0588 Homebase occupancy 0.0712 # stops per excursion 0.1202 Stop diversity 0.1489 0.05*15/17 Length of progression segments 0.5150 0.05*16/17 Activity decrease 0.8875 0.05*17/17 Significance of 8 Strain differences

34
July 2006SAMSI Addressing multiplicity by controlling the FDR FDR control is a very active area of current research mainly because of its scalability (even into millions…) types of dependency resampling procedures FDR adjusted p-values adaptive procedures Bayesian interpretations Related error rates Model selection

35
July 2006SAMSI Is there a mixed-multiplicity connection? Recall in the puzzling entry to the Encyclopedia only two issues were addressed in detail FDR approach in pairwise comparisons The random effects vs fixed effects problem

36
YB SAMSI 06 The mixed-multiplicity connection In the fixed framework we selected three labs and made our analysis as if this is our entire world of referenceIn the fixed framework we selected three labs and made our analysis as if this is our entire world of reference When this is not the case - as when the experiment is repeated in a different lab - the fixed point of view is overly optimisticWhen this is not the case - as when the experiment is repeated in a different lab - the fixed point of view is overly optimistic This is also an essence of the multiplicity problem - say selecting the maximal difference (with the smallest p-value) and treating it as if it is our only comparisonThis is also an essence of the multiplicity problem - say selecting the maximal difference (with the smallest p-value) and treating it as if it is our only comparison In both cases, conclusions from a naïve point of view have too great a chance to be non-replicable

37
YB SAMSI 06 5. Replicability in Medical Research Hormone therapy in postmenopausal women A very large and long randomized controlled study (Womens Health Initiative, Rossouw, Anderson, Prentice, LaCroix, JAMA,2002)A very large and long randomized controlled study (Womens Health Initiative, Rossouw, Anderson, Prentice, LaCroix, JAMA,2002) Study was not performed for drug approval Was stopped before completion because expected effects were reversedWas stopped before completion because expected effects were reversed Bonferroni-adjusted and marginal (nominal) CIs reportedBonferroni-adjusted and marginal (nominal) CIs reported The conclusions contradictory: The decision to stop the trial was based on the marginal CIsThe conclusions contradictory: The decision to stop the trial was based on the marginal CIs

38
YB SAMSI 06 The editorial The authors present both nominal and rarely used adjusted CIs to to take into account multiple testing, thus widening the CIs. Whether such adjustment should be used has been questioned,... ". (Fletcher and Colditz, 2002) Our Puzzle: US and European Regulatory Bodies require adjusting results in clinical trial to multiplicity. So, is the statement true? So, is the statement true? Small Meta-analysis of Methods (with Rami Cohen) Check with the flagship of medical research:

39
YB SAMSI 06 Sampling the New England J of Medicine Period: 3 half-years, 2000,2002,2004Period: 3 half-years, 2000,2002,2004 All articles of length > 6 pages; containing at least once p=All articles of length > 6 pages; containing at least once p= Sample of 20 from each period: 60 articlesSample of 20 from each period: 60 articles No differences between periods - results reported pooled over periods 44/60 reported clinical trials results44/60 reported clinical trials results

40
YB SAMSI 06 How was multiplicity addressed? Type of Correction # of Articles Bonferroni 6 (2) Obrian-Fleming1 Hochberg1 Holm1 Lan-DeMets1 More than 3 SD 1 Primary at.04; Two Secondary at.0175 1 None47 (out of 60 articles)

41
YB SAMSI 06 Success: All studies define primary endpoints

42
YB SAMSI 06 Multiple endpoints No article had a single endpointNo article had a single endpoint 2 articles only corrected for multiple endpoints2 articles only corrected for multiple endpoints 80% define a single primary endpoint80% define a single primary endpoint In many cases there is no clear distinction between primary and secondary endpointsIn many cases there is no clear distinction between primary and secondary endpoints Even when a correction was made it was adjusted for a partial list. (Note: Rami vs Yoav)

43
YB SAMSI 06 Multiple Conf. Intervals: two different concerns The effect of SimultaneityThe effect of Simultaneity Pr(all intervals cover their parameters) < 0.95 Pr(all intervals cover their parameters) < 0.95 The goal of Simultaneous CIs, such as Bonferroni- adjusted CIs, is to assure that Pr( all cover) 0.95 The effect of SelectionThe effect of Selection When only a subset of the parameters is selected for highlighting, for example the significant findings, even the average coverage < 0.95

44
YB SAMSI 06 Implications of selection on average coverage 2/11 do not cover with no selection 2/3 do not cover when selecting significant coefficients (BY & Yekutieli 05: FDR ideas for confidence intervals)

45
YB SAMSI 06 So what? In MCP2005 conference in Shankhai Head of statistical unit in American FDA brought amazing numbers More than half of the Phase III studies fail to show the effect they were designed to show. Is it at least partly because clinical trials are analyzed loosely in terms of multiplicity before standing up to the regulatory agencies, and thus their results are not replicable? More comments in view of Ionadis paper at a later time

46
YB SAMSI 06 6. Functional Magnetic Resonance Imaging (fMRI) Study of the functioning brain: Where is the brain active when we perform a mental task?Study of the functioning brain: Where is the brain active when we perform a mental task? ExampleExample

47
YB SAMSI 06 Functional Magnetic Resonance Imaging (fMRI) Study of the functioning brain: Where is the brain active when we perform a mental task?Study of the functioning brain: Where is the brain active when we perform a mental task? ExampleExample

48
YB SAMSI 06 Unit of data is volume pixel - Voxel 64 x 64 per slice x 16 A comparison of experimental factor per voxel: inference on~ 64K voxels

49
YB SAMSI 06 Assuring replicability in fMRI Analysis: Part II Multiplicity was addressed early on 1.Controlling the prob. of making a false discovery even in a single voxel: 1.FWE control using random field theory of level sets (Worsley & Friston 95, Adlers theory) 2.FWE control with resampling (Nichols & Holmes 96) 3.Extra power by limiting # voxels tested using Regions Of Interest (ROI) from an independent study 2.Genovese, Lazar & Nichols (02) introduced FDR into fMRI analysis. FDR voxel-based analysis available in most software packages e.g. Brain Voyager, SPM, fMRIstat

50
YB SAMSI 06 fMRI: More to do on the multiplicity front Working with regions rather than voxels:Working with regions rather than voxels: –Utilizing activity is in regions –Defining appropriate FDR on regions –Trimming non-active voxels from active regions Using adaptive FDR proceduresUsing adaptive FDR proceduresetc. Pacifico et al (04) Heller et al (05) Heller & YB (06)

51
YB SAMSI 06 The Good the Bad and the Ugly

52
YB SAMSI 06 The Good the Bad and the Ugly

53
YB SAMSI 06 Assuring replicability in fMRI Analysis: Part I Initially results were reported for each subject separately. Then fixed model ANOVA was used to analyze the multiple subjects - within subject yardstick for variability only. within subject yardstick for variability only. Concern about between subject variability was raised more recently. Mixed models analysis, with random effects for subjects, is now available in the main software tools. This is called multi-subject analysis Obviously, the number of degrees of freedom is much smaller and the variability is larger

54
YB SAMSI 06 Multi- Subject using random effects Single subject

55
YB SAMSI 06 Tricks of the trade: Using correlation at first session as pilot q 1 =2/3 at second session at q 2 =.075

56
YB SAMSI 06

57
YB SAMSI 06

58
YB SAMSI 06 Why is Random Effects (mixed model) analysis insensitive 1.Variability between subjects about location of activity and 2.Task specific variability of location between subjects 3.Variability about the size of the activity 4.Problems in mapping different subjects to a single map of the brain 5.Use of uniform smoothing across the brain to solve problem (3) reduces the signal per voxel 6.Pattern of change in signal along time differs between individuals

59
YB SAMSI 06 Epilog The debate Fixed-vs-Random is fierce in the community of neuroimagers. Acceptance to the best journals seems to depend on chance: will the article meet a Random Effects Referee?The debate Fixed-vs-Random is fierce in the community of neuroimagers. Acceptance to the best journals seems to depend on chance: will the article meet a Random Effects Referee? One can read that the multi subject analysis is less sensitive, so results were not corrected for multiplicityOne can read that the multi subject analysis is less sensitive, so results were not corrected for multiplicity Researchers sometimes resort to more questionable ways to control for multiplicity and across subject variabilityResearchers sometimes resort to more questionable ways to control for multiplicity and across subject variability e.g. Conjunction Analysise.g. Conjunction Analysis

60
YB SAMSI 06 Epilog Using the statistic T vi to test for each subject i: H 0vi : there is no effect at voxel v for subject i Conjunction analysis: intersect the individual subjects maps Friston, Worseley and others compare T v =min 1in T vi to a (lower) random field theory based threshold. Nichols et al (05): the complete null is tested at each voxel, so a rejection merely indicates that at least for one subject there is an effect at the voxel. (IS IT ENOUGH TO ASSURE REPLICABILITY?) (IS IT ENOUGH TO ASSURE REPLICABILITY?) Instead T v should be compared to the regular threshold to test the hypothesis that all have effect, and then multiplicity strictly controlled.

61
YB SAMSI 06 Epilog Friston et al (05) object to this proposal, because of loss of power; they suggest testing: there is effect in at least u out of n subjects as the alternative, and then strictly control multiplicity* It is clear that a compromise that addresses replicability - both multiplicity and between subject variability - and sensitivity is needed. Is this the case where Tukeys ideas about 0

62
YB SAMSI 06 In summary Assuring the replicability of results of an experiment is at the heart of the scientific dogma Watch out for two statistical dangers to replicability –Ignoring the variability in those selected to be studied - thus using the wrong yardstick for variability –Selecting to emphasize your best results The second problem emerges naturally when multiple inferences are made and multiplicity is ignored.

63
YB SAMSI 06 The FDR website www.math.tau.ac.il/~ybenja

64
YB SAMSI 06

65
YB SAMSI 06 Further details about the failures to adjust out of 60 articles

66
YB SAMSI 06 The False Discovery Rate (FDR) criterion Benjamini and Hochberg (95) R= # rejected hypotheses = # discoveries Vof these may be in error = # false discoveries The error (type I) in the entire study is measured by i.e. Q is the proportion of false discoveries among the discoveries discoveries (0 if none found) FDR = E(Q) Does it make sense?

67
YB SAMSI 06 Does it make sense? InspectingInspecting 20 features: 1false among 20 discovered - bearable 1false among 2 discovered - unbearable This error rate is adaptive and has also economic interpretation 100 features the above remains the same So this error rate is also scalable IfIf nothing is real, controlling the FDR at level qguarantees that the probability of making even one false discovery is less than q ThisThis is why we choose usual levels of q, q, say 0.05 ButBut otherwise there is room for improving detection power:

68
YB SAMSI 06 FDR controlling proceures. Linear step up procedure (BH, FDR) P i be the observed p-value of a test for H i i=1,2,…m Order the p-values P (1) P (2) … P (m)Order the p-values P (1) P (2) … P (m) LetLet RejectReject

69
YB SAMSI 06 FDR control of Linear StepUp Procedure (BH) Suppose m 0 m of the hypotheses are true If the test statistics are independent, or positive dependent : in general: normally distributed normally distributed

Similar presentations

OK

6. Statistical Inference: Example: Anorexia study Weight measured before and after period of treatment y i = weight at end – weight at beginning For n=17.

6. Statistical Inference: Example: Anorexia study Weight measured before and after period of treatment y i = weight at end – weight at beginning For n=17.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google