Presentation is loading. Please wait.

Presentation is loading. Please wait.

1. Interpreting rich epigenomic datasets

Similar presentations


Presentation on theme: "1. Interpreting rich epigenomic datasets"— Presentation transcript:

1 1. Interpreting rich epigenomic datasets

2 Interpreting chromatin states
Conservation hiCpG-TSS loCpG-TSS Transcribed %Genome Expression L1 repeat Alu repeat Repeats Lamina Dnase TSS CpG TES ZNF Interpreting chromatin states

3 How many states are meaningful: agreement between cell types
Ratio vs. background H1-H9 H9-H1 H1/9-IMR90 IMR90-H1 IMR90-H9 Background Distinctions remain recoverable between cell types, even after chromatin states (IMR90-H1-H9)

4 Preferential enhancer-promoter interactions
IMR90 – Same chromosome interactions Transcribed 3’ Transcribed 5’ Transcribed strong Transcribed weak Transcribed enhancer Enhancer poised Enhancer Active Strongest Enhancer Strong Enhancer Weak Enhancer Low signal Heterochromatin Repressed Bivalent promoter Active Promoter Transcribed Enhancer Off Prom IMR90 – diff chrom H1 – same chrom H1 – diff chrom Different enhancer states show different interactions Enhancers/transcribed/promoters interact Inactive regions show fewer interactions overall (both to active states, and to each other) H3K9me3 states interact between chromosomes in ES cells

5 2. Prioritizing experiments

6 Ever-expanding dimensions of epigenomics
Additional dimensions: Environment Genotype Disease Gender Stage Age Thousands of whole-genome datasets Chromatin marks Cell types Today: Cell-type and chromatin-mark dimensions Next: Personal epigenomes: genotype/phenotype Complete matrix of conditions, individuals, alleles

7 Prioritize experiments for additional cell types
2 methods Method 1 Method 2 Based on unique information Based on chromatin state recovery (1) Quantify state recovery using subsets of marks (2) Capture additional information from mark intensity  Beyond marks: Trade-offs of >cell types vs. >depth

8 Method 1 example: Rank chromatin marks for a new cell type
IMR90 Using all marks Hardest to predict  Prioritize these marks? Easiest to predict (redundant) Mark Prediction Error2 Hardest marks to predict using all other IMR90 marks: H3K3me3, etc Match the marks usually identified as the most useful: a good metric?

9 Method 2 example: Rank additional marks for existing cell type
Extend IMR90 set beyond initial 22 marks 22 Marks common with CD4T data H2AK5ac H3K27ac H3K27me3 H3K9me3 H2BK120ac H3K4ac H3K36me3 H4K20me1 H2BK12ac H3K9ac H3K4me1 H2BK20ac H4K5ac H3K4me2 H3K14ac H4K8ac H3K4me3 H3K18ac H4K91ac H3K79me1 H3K23ac H3K79me2 19 Marks only in CD4T data H2AK9ac H2BK5me1 H3K9me2 CTCF H2BK5ac H3K27me1 H3R2me1 H2AZ H3K36ac H3K27me2 H3R2me2 PolII H4K12ac H3K36me1 H4K20me3 H4K16ac H3K79me3 H4R3me2

10 3. Completing epigenomes computationally
Chromatin mark imputation

11 Predicting signal for missing marks
Question: Can we predict signal intensity of one mark given other sets of marks Datasets used: H1, IMR90 (+H9, K562, GM12878, HSMM) Methodological decisions: Focus on common set of marks Downsample one replicate to 10 million reads Split reads equally between training and test data Bin genome into 2kb bins Model/metrics: Use a linear regression model for predictions Used square error loss on mark signal as objective

12 Eg: Predicting H3K9ac signal
Mark Coeff H3K56ac 0.32 H3K4me3 0.29 H3K4ac 0.22 H3K4me2 0.15 H3K27ac 0.14 H2AK5ac H4K8ac H3K23ac 0.13 H3K14ac H3K79me2 0.12 H4K5ac 0.06 H3K36me3 0.04 H4K91ac 0.01 H3K4me1 -0.01 H3K18ac H3K27me3 -0.02 H4K20me1 -0.04 H2BK120ac -0.05 H3K9me3 Input -0.07 H2BK15ac -0.1 H3K79me1 -0.15 H2BK12ac H2BK20ac -0.22 Intercept -0.16 H3K9ac Predicted H3K9ac True How good is the prediction? How similar to other marks? How does it compare to biological replicate?

13 Impute missing datasets / predict new cell types
Predict missing mark from many others Predict many marks in new cell type Prediction of K27ac,K9ac,K4me1… in GM from DNase Prediction of H3K4me1 from DNase across cell types Use mark correlations to predict missing datasets as matrices become denser Applications: (1) Prediction in difficult to access conditions. (2) Detecting failed experiments/replicates. (3) Finding unexpected prediction/raw differences

14 4. Allele-specific chromatin marks

15 Known imprinted genes confirm allele specific methodology
Map to phased GM12878 haplotypes Count maternal vs. paternal reads, Validation Known imprinted genes are allelic X-inactivation only one chromosome Requires sufficient SNPs and sufficient reads for significance Discover allelic genes genome-wide  Aggregate by gene / chromatin state

16 Allelic activity supported by many marks, Pol2, TFs
Includes X-inactivated paternal chromosome genes

17 Genome-wide correlations for pairs of marks
Aggregate signal across chromatin states Active marks positively correlated H3K27me3 negatively correlated Zoom in on indiv. examples

18 Active/repressive marks on paternal/maternal alleles
Active transcription of paternal chromosome Repressive marks on maternal chromosome Pol2 reads on paternal chromosome Strong repressive signal (K27me3): reads mostly maternal Strong active signal (K79me2 tx): reads mostly paternal

19 Allele-specific chromatin marks: cis-vs-trans effects
Maternal and paternal GM12878 genomes sequenced Map reads to phased genome, handle SNPs indels Correlate activity changes with sequence differences

20 5. Linking enhancers to promoters using many cell types

21 Power should increase with additional cell types
Chromatin State Gene expression Chance of spurious correlation decreases

22 Power to predict links increases with more cell types
True enhancers show excess of high correlation Can estimate number of non-random links at any FDR Number of non-random links increases linearly with number of cell types 30 cell types: 15,000 links

23 Visualizing 10,000s predicted enhancer-gene links
Overlapping regulatory units, both few and many Both upstream and downstream elements linked Enhancers correlate with sequence constraint

24 6. Disease enrichments across 1000s of enhancers

25 Full T1D association spectrum  1000s of causal SNPs
Rank all SNPs by P-value Find chromatin states with enrichment in high ranks Signal spans 1000s of SNPs GM12878 enhancer enrichment now seen GM12878 Lymphoblastoid K562 Myelogenous leukemia Cell type specific: GM and K562 enhancers Chromatin state specific: Enhancers/promoters Could bias in array design contribute to these enrichments?  Evaluate all 1000 genomes SNPs by imputing those in LD

26 Imputing SNPs in LDstronger cell/state separation
Enhancers across cell types Chromatin states in GM12878 Enhancers: 2049 (excess 392) 1940 distinct loci (R^2<.8) Promoters: 462 (excess 81) Transcribed: 4740 (excess 522) Repressed: (excess 76) Insulator: 240 (excess 23) Other: 21k (deplete 1093) Excess of 30,000 SNPs2049 enhancers (excess 392) Mostly found in independent loci (1730 with R2<0.2)  Systematically measure their regulatory contributions


Download ppt "1. Interpreting rich epigenomic datasets"

Similar presentations


Ads by Google