Basic statistics European Molecular Biology Laboratory Predoc Bioinformatics Course 17 th Nov 2009 Tim Massingham,
Introduction Basic statistics What is a statistical test Count data and Simpson’s paradox A Lady drinking tea and statistical power Thailand HIV vaccine trial and missing data Correlation Nonparametric statistics Robustness and efficiency Paired data Grouped data Multiple testing adjustments Family Wise Error Rate and simple corrections More powerful corrections False Discovery Rate
What is a statistic Anything that can be measured Calculated from measurements Branches of statistics Frequentist Neo-Fisherian Bayesian Lies Damn lies (and each of these can be split further) “If your experiment needs statistics, you ought to have done a better experiment.” Ernest Rutherford “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.” H.G.Wells “He uses statistics as a drunken man uses lamp-posts, for support rather than illumination.” Andrew Lang (+others) Classical statistics
Repeating the experiment 1,000 times 10,000 times100 times Initial experiment Imagine repeating experiment, some variation in statistic
Frequentist inference Repeated experiments at heart of frequentist thinking Have a “null hypothesis” Null distribution What distribution of statistic would look like if we could repeatedly sample LikelyUnlikely Compare actual statistic to null distribution Classical statistics Finding statistics for which the null distribution is known
Anatomy of a Statistical Test Value of statistic Density of null distribution P-value of test Probability of observing statistic or something more extreme Equal to area under the curve Density measures relative probability Total area under curve equals exactly one
Some antiquated jargon Critical values Dates back to when we used tables Old way Calculate statistic Look up critical values for test Report “significant at 99% level” Or “rejected null hypothesis at ….” Size Critical value Pre-calculated values (standard normal distribution)
Power Probability of correctly rejecting null hypothesis (for a given size) Null distributionAn alternative distribution Power changes with size Reject nullAccept null
Confidence intervals Use same construction to generate confidence intervals Confidence interval = region which excludes unlikely values For the null distribution, the “confidence interval” is the region which we accept the null hypothesis. The tails where we reject the null hypothesis are the critical region
Count data
(Almost) the simplest form of data we can work with Each experiment gives us a discrete outcome Have some “null” hypothesis about what to expect YellowBlackRedGreenOrange Observed Expected20 Example: Are all jelly babies equally liked by PhD students?
Chi-squared goodness of fit test YellowBlackRedGreenOrange Observed Expected20 Summarize results and expectations in a table jelly_baby <- c( 19, 27, 15, 22, 17) expected_jelly <- c(20,20,20,20,20) chisq.test( jelly_baby, p=expected_jelly, rescale.p=TRUE ) Chi-squared test for given probabilities data: jelly_baby X-squared = 4.4, df = 4, p-value = pchisq(4.4,4,lower.tail=FALSE) [1]
Chi-squared goodness of fit test jelly_baby <- c( 19, 27, 15, 22, 17) expected_jelly <- c(20,20,20,20,20) chisq.test( jelly_baby, p=expected_jelly, rescale.p=TRUE ) Chi-squared test for given probabilities data: jelly_baby X-squared = 4.4, df = 4, p-value = What’s this? YellowBlackRedGreenOrange Observed ? How much do we need to know to reconstruct table? Number of samples Any four of the observations or equivalent, ratios for example
More complex models Specifying the null hypothesis entirely in advance is very restrictive YellowBlackRedGreenOrange Observed Expected20 4 df 0 df Allowed expected models that have some features from data e.g. Red : Green ratio Each feature is one degree of freedom YellowBlackRedGreenOrange Observed Expected :23.8 = 15: = 40 4 df 1 df
Example: Chargaff’s parity rule Chargaff’s 2 nd Parity Rule In a single strand of dsDNA, %A≈%T and %C≈%G ACGT Helicobacter pylori From data, %AT = 61% %CG=39% Apply Chargaff to get %{A,C,G,T} ACGT Proportion Number Null hypothesis has one degree of freedom Alt. hypothesis has three degrees of freedom Difference: two degrees of freedom
Contingency tables PieFruit Custardac Ice-creambd Observe two variables in pairs - is there a relationship between them? Silly example: is there a relationship desserts and toppings? A test of row / column independence Real example McDonald-Kreitman test - Drosophila ADH locus mutations BetweenWithin Nonsynonymous72 Synonymous1742
Contingency tables A contingency table is a chi-squared test in disguise PieFruit Custard p Ice-cream (1-p) q(1-q)1 Null hypothesis: rows and columns are independent Multiply probabilities Pie & Custard Pie & Ice- cream Fruit & Custard Fruit & Ice- cream Observedabcd Expectedn p qn (1-p) qn p (1-q)n (1-p)(1-q) p q p (1-q) (1-p) q(1-p)(1-q) × n
Contingency tables Pie & Custard Pie & Ice- cream Fruit & Custard Fruit & Ice- cream Observedabcn - a - b - c Expectedn p qn (1-p) qn p (1-q)n (1-p)(1-q) Observed:three degrees of freedom(a, b & c) Expected:two degrees of freedom(p & q) In general, for a table with r rows and c columns Observed:r c - 1 degrees of freedom Expected:(r-1) + (c-1) degrees of freedom Difference:(r-1)(c-1)degrees of freedom Chi-squared test with one degree of freedom
Bisphenol A Bisphenol A is an environmental estrogen monomer Used to manufacture polycarbonate plastics lining for food cans dental sealants food packaging Many in vivo studies on whether safe: could polymer break down? Is the result of the study independent of who performed it? F vom Saal and C Hughes (2005) An Extensive New Literature Concerning Low-Dose Effects of Bisphenol A Shows the Need for a New Risk Assessment. Environmental Health Perspectives 113(8):928 HarmfulNon-harmful Government9410 Industry011
Bisphenol A HarmfulNon-harmful Government % Industry0119.6% 81.7%18.2%115 Observed table HarmfulNon-harmful Government % Industry % 81.7%18.2% Expected table E.g × ×115 = 85.0 Chi-squared statistic = Test with 1 d.f.p-value = 3.205e-12
Bisphenol A Association measure Discovered that we have dependence How strong is it? Coefficient of association for 2×2 table Chi-squared statistic Number of observations Number between 0 = independent to 1 = complete dependence For the Bisphenol A study data Should test really be one degree of freedom? Reasonable to assume that government / industry randomly assigned? Perhaps null model only has one degree of freedom pchisq( , df=1, lower.tail=FALSE) [1] e-12 pchisq( , df=2, lower.tail=FALSE) [1] e-11
Simpson’s paradox Famous example of Simpson’s paradox C R Charig, D R Webb, S R Payne, and J E Wickham (1986) Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy. BMJ 292: 879–882 Compare two treatments for kidney stones open surgery percutaneous nephrolithotomy (surgery through a small puncture) SuccessFail open perc. neph % success 83% success Percutaneous nephrolithotomy appears better (but not significantly so, p-value 0.15)
Simpson’s paradox SuccessFail open81687 perc. neph SuccessFail open perc. neph Small kidney stones Large kidney stones Missed a clinically relevant factor: the size of the stones The order of treatments is reversed 93% success 87% success p-value % success 69% success p-value 0.55 Combined Open78% Perc. neph.83%
Simpson’s paradox SuccessFail open81687 prec. neph SuccessFail open prec. neph Small kidney stones Large kidney stones What’s happened? SmallLarge Failure of randomisation (actually an observational study) Small and large stones have a different prognosis (p-value<1.2e-7) 88% success total 72% success total Open Prec. neph 93% success 87% success 73% success 69% success
A Lady Tasting Tea
The Lady Tasting Tea “A LADY declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup … Our experiment consists in mixing eight cups of tea, four in one way and four in the other, and presenting them to the subject for judgment in a random order. The subject has been told in advance of what the test will consist, namely that she will be asked to taste eight cups, that these shall be four of each kind, and that they shall be presented to her in a random order.” Fisher, R. A. (1956) Mathematics of a Lady Tasting Tea Eight cups of tea Exactly four one way and four the other The subject knows there are four of each The order is randomised Guess teaGuess milk Tea first??4 Milk first??4 448
Fisher’s exact test Guess teaGuess milk Tea first??4 Milk first??4 448 Looks like a Chi-squared test But the experiment design fixes the marginal totals Eight cups of tea Exactly four one way and four the other The subject knows there are four of each The order is randomised Fisher’s exact test gives exact p-values with fixed marginal totals Often incorrectly used when marginal totals not known
Sided-ness Not interested if she can’t tell the difference Guess teaGuess milk Tea first404 Milk first Guess teaGuess milk Tea first044 Milk first Two possible ways of being significant, only interested in one Exactly right Exactly wrong
Sided-ness tea <- rbind( c(0,4), c(4,0) ) tea [,1] [,2] [1,] 0 4 [2,] 4 0 fisher.test(tea)$p.value [1] fisher.test(tea,alternative="greater")$p.value [1] 1 fisher.test(tea,alternative="less")$p.value [1] More correctMore wrong Only interested in significantly greater Just use area in one tail
Statistical power Are eight cups of tea enough? Guess teaGuess milk Tea first404 Milk first A perfect score, p-value Guess teaGuess milk Tea first314 Milk first Better than chance, p-value
Statistical power Guess teaGuess milk Tea firstp n(1-p) nn Milk first(1-p) np nn nn2n Assume the lady correctly guesses proportion p of the time % % % e-54e-5 To investigate simulate 10,000 experiments calculate p-value for experiment take mean
Thailand HIV vaccine trials
Thailand HIV vaccine trial News story from end of September 2009 “Phase III HIV trial in Thailand shows positive outcome” 16,402 heterosexual volunteers tested every six months for three years Sero +veSero -ve Control Vaccine fisher.test(hiv) Fisher's Exact Test for Count Data data: hiv p-value = alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: sample estimates: odds ratio ,395 Rerks-Ngarm, Pitisuttithum et al. (2009) New England Journal of Medicine /NEJMoa
Thailand HIV vaccine trial "Oh my God, it's amazing. " … "The significance that has been established in this trial is that there is a 5% chance that this is a fluke. So we are 95% certain that what we are seeing is real and not down to pure chance. And that's great." Significant? Would you publish? Should you publish? Study was randomized (double-blinded) Deals with many possible complications male / female balance between arms of trial high / low risk life styles volunteers who weren’t honest about their sexuality (or changed their mind mid-trial) genetic variability in population (e.g. CCL3L1) incorrect HIV test / samples mixed-up vaccine improperly administered
Thailand HIV vaccine trial "Oh my God, it's amazing. " … "The significance that has been established in this trial is that there is a 5% chance that this is a fluke. So we are 95% certain that what we are seeing is real and not down to pure chance. And that's great." Significant? Would you publish? Should you publish? Initially published data based on modified Intent To Treat Intent To Treat People count as soon as they are enrolled in trial modified Intent To Treat Excluded people found to sero +ve at beginning of trial Per-Protocol Only count people who completed course of vaccines A multiple testing issue? How many unsuccessful HIV vaccine trials have there been? One or more and these results are toast.
Missing data Some people go missing during / drop out of trials This could be informative E.g. someone finds out they have HIV from another source, stops attending check-ups Double-blinded trials help a lot Extensive follow-up, hospital records of death etc Missing Completely At Random Missing data completely unrelated to trial Missing At Random Missing data can be imputed Missing Not At Random Missing data informative about effects in trial Volunteer 1----?? Volunteer
Correlation
Correlation? Wikimedia, public domain Various types of data and their correlation coefficients
Correlation does not imply causation If A and B are correlated then one or more of the following are true A causes B B causes A A and B have a common cause (might be surprising) Do pirates cause global warming? Pirates: R. Matthews (2001) Storks delivery Babies (p=0.008) Teaching Statistics 22(2):36-38
Pearson correlation coefficient Standard measure of correlation “Correlation coefficient” Measure of linear correlation, statistic belongs to [-1,+1] 0independent 1 perfect positive correlation -1 perfect negative correlation cor.test( gene1, gene2, method="pearson") Pearson's product-moment correlation data: gene1 and gene2 t = , df = 22808, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: sample estimates: cor Measure of correlation Roughly proportion of variance of gene1 explained by gene2 (or vice versa)
Pearson: when things go wrong Pearson’s correlation test Correlation = (p-value = ) Correlation = (p-value = ) Correlation = (p-value = 5.81e-06) Correlation = (p-value = 6.539e-08) A single observation can change the outcome of many tests Pearson correlation is sensitive to outliers 200 observations from normal distribution x ~ normal(0,1)y ~ normal(1,3)
Spearman’s test Nonparametric test for correlation Doesn’t assume data is normal Insensitive to outliers Coefficient has roughly same meaning Replace observations by their ranks Ranks Raw expression cor.test( gene1, gene2, method="spearman") Spearman's rank correlation rho data: gene1 and gene2 S = , p-value < 2.2e-16 alternative hypothesis: true rho is not equal to 0 sample estimates: rho
Comparison Look at gene expression data cor.test(lge1,lge2,method="pearson") Pearson's product-moment correlation data: lge1 and lge2 t = , df = 22808, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: sample estimates: cor cor.test(lge1,lge2,method="spearman") Spearman's rank correlation rho data: lge1 and lge2 S = , p-value < 2.2e-16 alternative hypothesis: true rho is not equal to 0 sample estimates: rho cor.test(lge1,lge2,method="kendall") Kendall's rank correlation tau data: lge1 and lge2 z = , p-value < 2.2e-16 alternative hypothesis: true tau is not equal to 0 sample estimates: tau
Comparison Spearman and Kendall are scale invariant Log - Log scale Pearson0.97 Spearman0.98 Kendall0.90 Normal - Log scale Pearson0.56 Spearman0.98 Kendall0.90 Normal - Normal scale Pearson0.99 Spearman0.98 Kendall0.90