3 Factors to consider Number of observations Magnitude of effect Technical considerationsBiological variabilityBiological common sense
4 The problem of power… Ideally want to cover every Cytosine (CpG) Have to correct for the number of testsThere’s no way you’ll collect enough data to analyse each C and have p-values which survive multiple testing correctionStats have to find a way to work round this.
5 Maximising power Options Analyse in windows Pre-filter Hierarchical or Adaptive filtering
6 Window sizes Small windows Large windows Good resolution Specific biological effectsHigh MTC burdenSmall observationsHigh p-valuesLots of dataHigh statistical powerLow MTC burdenLow p-valuesEffect averaging
7 Simple Statistical Approach Is the proportion of methylated calls different between two samples, given the number of observations?Meth count AUnmeth count AMeth count BUnmeth count B% changeSignificant?2100No20019851.550756011Probably
8 Contingency tests Chi-square / G-test / Fisher’s exact test Differ only at low observationsSignificant changes require enough observations that any of these should give the same answerOperates on single replicatesTechnical measure of differenceMeth AUnmeth AMeth BUnmeth B
10 Biological considerations Minimum relevant effect size?Balance power vs changeWhat makes biological sense(what would you follow up?)Minimum coverage worth testingNo point testing poorly covered regions
12 Distribution of methylation Chi square assumes a normal distribution, and methylation data isn’t normally distributed
13 Beta binomial distribution More relevant statistics than chi-square. Need to fit custom model to actual data.
14 Implications of a beta distribution Many summaries assume normalityMeanStandard DeviationBoxplotsNone of these is strictly appropriate when looking at methylation data
15 Dealing with replicates Simple approachMerge data from replicates togetherSingle test, High powerPost-hoc test for consistencyExplicitly account for batch effectsLogistic regressionMeasures batch effects and excludes them from final significance calculationWork with methylation valuesNormalise percentage methylation valuesUse conventional statistics (t-tests etc) for comparing groups
16 Hierarchical testing Test larger regions Windows / Features etc.Take significant hits and subdivideSmaller windowsIndividual CpGsCorrect only for these testsAssemble hits together to make up DMRs
17 X X Hierarchical testing GenomeCGIGenomeCGIXGenomeCGIXStatistically ‘creative’ solution to not having enough data
18 Methylation statistics packages swDMR (Perl/R-package)Sliding window DMR finding (choose between t_test, Kolmogorov, Fisher, ChiSquare, Wilcoxon for n = 2; ANOVA, Kruskal for n > 3)methylKit* (R-package by A. Akalin et al.)Sliding window, Fisher’s exact test or logistic regression. Adjusts p-values to q-values using SLIM method.bsseq* (R/Bioconductor by K.D. Hansen)Implements the BSmooth smoothing algorithm. Numerous CpG-wise t-tests and p-value cutoff to define DMRs. Outperforms Fisher’s exact test. Requires biological replicates for DMR detectionBiSeq* (R/Bioconductor by K. Hebestreit et al.)Beta regression model, impractical for very large data other than RRBS or targeted BS-SeqRnBeads* (R package by F. Mueller et al.)works for 450K arrays, BS-Seq, MeDIP or MBD-Seq dataDMAP* (C command line tool by P. Stockwell et al.)RRBS fragment or fixed window approach, Fisher’s exact test, Chi-squared or ANOVARADMeth (C++ command line tool by E. Dolzhenko and A.D. Smith)Beta-binomial regression analysis to find DMCs or DMRs, local likelihood, adjust for neighbouring CpGsMOABS* (C++ command line tool by D. Sun et al.)Beta binomial hierarchical model to capture sampling and biological variation, Credible Methylation Difference (CDIF) single metric that combines biological and statistical significanceComMet (Y. Saito et al., 2014)Bisulfighter suite; DMR detection based on hidden Markov models (HMMs) that enable automated adjustment of DMC chaining criteria. Does not require biological replicatesDSS (R/Bioconductor by Feng et al., 2014)Constructs genome-wide prior distribution for beta-binomial dispersion. Bayesian hierarchical model to detect differentially methylated locimore appearing every other week…* interface well with
19 ToolStatistical testSuitable forImplementationNotesbsseqSample-wise smoothing, then group differences via CpG-wise t-tests (p-value cutoff to define adjacent CpG sites as DMRs)WGBS; not designed for targeted BS-Seq or RRBSR package/BioconductorOutperforms Fisher’s exact test; intended to compare 2 groups;replicates requiredBiSeqDefine CpG clusters, smooth methylation data, model and test group effect (fitting beta regression model to smoothed methylation levels and testing for group effect using the Wald test), hierarchical testing procedure on CpG clusters, then define DMR boundariesRRBS; targeted BS-Seq; for WGBSVery computationally intensive; Not limited to 2 groupsMethylKitModels CpG methylation within a logistic regression. Sliding linear model (SLIM) to correct for multiple testing(e)RRBSR package* WGBS = whole genome BS-Seq; (e)RRBS = (enhanced) reduced representation BS-Seq
20 bsseq – for whole genome BS-Seq Smoothing of low coverage BS-Seq first to get reliable semi-local methylation estimation estimatesNot suitable for captured or restricted dataAfter smoothing it uses biological replicates to estimate biological variation and identify methylated regions (DMRs)Smoothing suitable for even a single sampleWorks for CpG context in humans, will probably not scale to 2x585M Cs in non-CG context