ARIES Methylation Pre-processing and Clean up Geoff Woodward
Overview Initial QC Normalisation Batch Correction Data MWAS (Methylome Wide Assoc. Study) Results
Initial QC Probe p-value confidence in detection overall QC indicator background -ve controls overall QC indicator High background Low signal Poor stringency
Initial QC: Control Probes Mixture of dependent/independent Sample independent Staining (Biotin/DNP) Hybridisation (synthetic target) Extension (hairpin) Sample dependent Bisulfite conversion (HindIII site) G/T mismatch (non-spec.) Specificity & Non-polymorphic Negative
Initial QC: LIMS
LIMS Control DashBoard Real time Jscript/JSON Zoom & scroll All Illumina controls probes +ve & -ve Area Max Median Min
Intial QC: MDS Start pre-processing What’s affecting the data? Failures controls
Initial QC: MDS Remove Controls/Failures Remove Sex Chromosomes
Sample Confirmation Genotyping 65 SNP probes Kmeans clustering Call genotype Cross reference with SNP data Calculate % match Fully automated in pipeline Stored in LIMS
Normalisation Why? Quantile? Not appropriate: Cancer vs. Control – not req. More sensitive differences... Quantile? Rank & scale according to ref dist. (av.) Not appropriate: Type I & II assays differ Medians – opposite ends of β scale SD (across reps.) smaller in Type I probes Interrogate different subsets of the genome Type II > proportion in open-sea Type I > proportion in gene promoters
Normalisation: Method 1 Subset Within Array Normalisation (minfi) To address differences in dist: No. of CpGs in probe body indicates density/loc. Dist. more similar in these groups Approach Reference quantiles: N random type I & II selected for each group Split meth/unmeth channels Linear interpolation fit probes to ref. Doesn’t treat type I & II separately BUT does decrease difference
Normalisation: Method 2 Touleimat & Tost To address differences: CpG region Shore / Shelf / Island / Open-sea Treat Type I & II separately Approach: reference quantiles Type I used “anchors” for each region More reliable / lower SD estimate target quantiles Fit type II to target
Normalisation: Method 3 Dasen (wateRmelon) Under review Separate QN of methylated Type I unmethylated Type I methylated Type II unmethylated Type II intensities. Both directions
Normalisation: Comparison wateRmelon metrics: Imprinted DMRs 237 probes within iDMRs iDMR e=50% meth. SE = SD / √ N SD of all 237 probes N = number of samples iDMRs Raw 0.00431 Dasen 0.00241 Tost 0.00214 Swan 0.00428
Normalisation: Comparison SNP probes 63 highly polym. SNP probes K-means clustering into 3 genotypes SE like measure for each group AA AB BB Raw 9.025 e-05 1.910 e-04 5.145 e-05 Dasen 1.669 e-04 2.047 e-04 2.321 e-05 Tost 8.253 e-05 5.242 e-04 1.541 e-04 Swan Na na
Normalisation: Comparison wateRmelon metrics: X-Chromosome Inactivation 11,232 probes T-test all probes for sex differences ROC analysis using p-val for sex diff. 1 – AUC 0 being the perfect predictor & best sex separation X-Inact. Raw 0.0947 Dasen 0.0889 Tost 0.0892 Swan 0.4952
Comparison: Density Plots Metrics are great but how do they really effect the data? All typeI typeII
Comparison: Density Plots Normalised distributions All typeI typeII
Comparison: Scatter Plot Pepsi Plot – you’ll see why! Raw (x) vs. Normalised (y) typeI typeII SWAN Tost dasen
Comparison: Scatter Plot
Batch Correction: Exp. Design Bisulphite Conversion Excess of samples > 48 Redundant controls QC and PCR MSA4 Plate Well dictates chip position (Robot) Randomised Min. 4 of each time point Max 1 control Mix of gender Infinium 450k Chips 12 arrays per chip Throughput doubled
Batch Correction: Metadata LIMS tracking Every process All consumables ~20 Formamide to hyb. Buffers > 1000 used so far! All equipment Fridge/centrifuge/PCR block
Batch Correction ComBat What are we seeing? Correction Bisulphite batch Correction Many algorithms available SVD/SVA/DWD Gene expression ComBat Chen C, Grennan K, Badner J, Zhang D, Gershon E, et al. (2011) Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods. PLoS ONE 6(2): e17238. doi:10.1371/journal.pone.0017238 Empirical Bayesian framework Create a model matrix Supply batch var Standardise gene-wise Least squares approach Fits L/S model – find priors Adjust to empirical parametric priors
Batch Correction Example data Batch correct Tost norm. data use M values Convert back to β Values can escape 0-1 limit Scale 0.02% of probes Dist. unaffected.
Batch Correction: BEFORE
Batch Correction: AFTER
Datasets ARIES pre-release: Filtered probes SNP probes Age group n Cord 584 F7 598 TF3 (15) 64 F17 280 Antenatal 394 FOM 329
MWAS Choice of servers: Epi-garrod BlueCrystal
Epi-garrod Request account via IT-services for: epi-garrod.bris.ac.uk Relatively quiet server in the dept. No queuing system Check htop before running jobs Cord data requires ~15% RAM
Epi-garrod Data: Permissions for this folder SAN Accessible from multiple servers /mnt/sscm3/ARIES_DATA/… Permissions for this folder You must be a member of the aries group
Blue Crystal Request an account via: Queuing handled Data: https://www.acrc.bris.ac.uk/login-area/apply.cgi Queuing handled Data: /gpfs/cluster/smed/alspac-shared/aries/… Again, permissions required: Member of aries group
Files ALN_dasen_<<time_code>>_betas.Rdata ALN_tost_<<time_code>>_betas.Rdata <<time_code>>_manifest.Rdata fdata.Rdata MWAS.r
ALN_dasen_<<time_code>>_betas.Rdata
<<time_code>>_manifest.Rdata
Fdata_new.RData
CpGassoc CRAN http://cran.r-project.org/web/packages/CpGassoc/index.html Tests for association between an independent variable and methylation Option to include additional covariates Assesses significance with: Holm (step-down Bonferroni) FDR methods
MWAS.r
MWAS.r continued...
MWAS.r continued...
Manhattan / QQ Replicated the following studies results: Gene hits: 450K Epigenome-Wide Scan Identifies Differential DNA Methylation in Newborns Related to Maternal Smoking during Pregnancy. Bonnie R. Joubert, et.al., Gene hits: GFI1, AHRR, MYO1G, CYP1A1 "CYP1A1 plays a key role in the aryl hydrocarbon receptor signaling pathway, which mediates the detoxification of the components of tobacco smoke." - Joubert, et.al.,
Results file
BlueCrystal .bashrc
Any Questions?