Integrative analysis of genomic and epigenomic data

Integrative analysis of genomic and epigenomic data
Manolis Kellis, RC1

Jason Ernst Acknowledgements Brad Bernstein Pouya Kheradpour
Noam Shoresh Chuck Epstein Tarjei Mikkelsen Pouya Kheradpour

Integrative analysis of genomic / epigenomic data
Defining chromatin states Biologically-meaningful mark combinations Characterizing chromatin states Application to genome annotation Chromatin state dynamics Cluster genes of common behavior Infer cell type activators / repressors Linking enhancers to promoter regions

Chromatin signatures for genome annotation
Challenges Dozens of marks Complex combinatorics Diversity and dynamics Histone code hypothesis Distinct function for distinct combinations of marks? Both additive and combinatorial effects How do we find biologically relevant ones? Unsupervised approach Probabilistic model Explicit combinatorics

Cartoon Illustration of ChromHMM
Transcription Start Site Enhancer DNA Observed chromatin marks. Called based on a poisson distribution Most likely Hidden State Transcribed Region 1 6 5 3 4 1: 3: 4: 5: 6: High Probability Chromatin Marks in State 2: 0.8 0.9 0.7 200bp intervals All probabilities are learned from the data 2 K4me3 K36me3 K4me1 K27ac We had talked about adding the H3K4 etc labels within the shapes Each state: vector of emissions, vector of transitions Ernst et al, In preparation

Application of ChromHMM to 41 chromatin marks in CD4+ T-cells (Barski’07, Wang’08)
Promoter Transcribed Active intergenic Repressed Repetitive Chromatin Marks from (Barski et al, Cell 2007; Wang et al Nature Genetics, 2008); DNAseI hypersensitivity from (Boyle et al, Cell 2008); Expression Data from (Su et al, PNAS 2004); Lamina data from (Guelen et al; Naature 2008)

State transition matrix
The full transition matrix of the Hidden Markov Model. Each row corresponds to a state transition from and each column a state transitioning to. An entry in a cell is the probability when in the state of the row of transitioning to the state of the column. This grid shows the transition matrix is relatively sparse. Enables separation of distinct sub-groups within each class Reveals transitions between different groups

(1) Promoter Associated States: Positional and functional properties
Fold Enrichment Distance to Nearest TSS GO Category 3 4 5 6 7 8 Cell Cycle Phase 2.10 (2x10-7) (1) 1.61 (0.001) (1) 1.15 (1) 1.51 (1) Embryonic Development 1.24 (1) (9x10-23) 1.07 (1) 0.85 (1) 0.54 (1) 1.00 (1) Chromatin 1.20 (1) 0.48 (1) 2.2 (1.4x10-7) 1.64 (1) Response to DNA Damage Stimulus 0.35 (1) 1.55 (0.074) (6.5x10-11) (1.0x10-4) 0.84 (1) RNA Processing 0.49 (1) 0.26 (1) 1.31 (1) 1.91 (4.2x10-11) (8.7x10-24) (3.0x10-4) T cell Activation 0.77 (1) 0.88 (1) 1.27 (1) 0.70 (1) 0.79 (1) (2x10-7) Fold Enrichment (corrected p-value) Distinct positional enrichments: Marks can recruit initiation factors Act of transcription reinforces marks Distinct functional enrichments: Epigenetic memory of activation history Much richer epigenetic vocabulary

(2) Actively Transcribed States: Diverse marks, expression/position biases
Number of Genes Number of Genes Fold Enrichment TSS-associated states Transcription elongation Exon-associated states No single mark uniquely defined transcribed states Associated with active/repressed, expression levels Distinct for start/elongation, short/long, exon/intron Specific combination defines transcription end sites Highly-specific combinations marks ZNF gees Fold Enrichment Transcription End Site 10

(2) Actively Transcribed States: Recovery of highly specific KAP1 combinations
“The achievement of the repressed state by wild-type KAP1 involves decreased recruitment of RNA polymerase II, reduced levels of histone H3 K9 acteylation and H3K4 methylation, an increase in histone occupancy, enrichment of trimethyl histone H3K9, H3K36, and histone H4K20 …” 11

Conserved Motif Enrichment/Depletions (Pouya)
3. Active intergenic states: Distinct TF/motif enrichments Conserved Motif Enrichment/Depletions (Pouya) 12

3. Active intergenic states: Long-range predictive power
Enhancer state predictive of expression level Different intergenic states, different dist. activity Distinguish active from less active enhancer Pairwise State Enrichments after 10kb Gap Enhancer states indeed distant from promoters Overlap between promoters / transcribed 13

(4 & 5) Intergenic and Large Scale Repressed States
Repetitive Repeat Family Enrichments Transition matrix for large scale repressed states Distinct enrichments with lamina-associated regions. Constitutive vs. facultative heterochromatin Distinct response to HDAC inhibitors: State 44 acetylated suggesting active acetylation turnover Distinct sequence signatures: State 46 CAn/TGn/CATGn low-complexity repeat. Distinct enrichments for distinct classes of repeat elements, distinct epigenetic marks Importance of jointly observing entire vector of marks  repetitive would overwhelm other’s signal

Functional enrichments enable annotation of 51 distinct states

Apply genome wide to find novel genes, enhancers, insulators
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 18 19 20 21 22 X Y The enrichments of the states in each chromosome band, the coordinates of which were obtained from the UCSC genome browser (Kent et al, 2002). In this figure one can observe that the satellite enriched states (47-51) are enriched in centromere regions of the chromosome, there are specific chromosome bands where states 41 and 42 have the dominant enrichment signal, the zinc finger enriched state (state 28) enriches on chromosome 19, the unmappable state (state 40) enriches on gapped regions at the beginning of several chromosomes. 16 17 16 17 2 10

Discovery power for promoters, transcripts
TSS Transcribed genes True Positive Rate False Positive Rate False Positive Rate (Left) The blue curve in the figure shows a “Receiver operating characteristic” (ROC) for coverage of bins with a TSS if states are ordered based on their fold enrichment for a TSS, (5,7,6,4,8,3,1,2,9,10,11, 21, 45,20, etc.). The green curve was based on ordering the k-means clusters. The red triangles are based on the individual input marks. The purple curve is based on a logistic regression classifier. The features to the classifier were ln(x+1) transformed values of the raw number of tags in a bin for each mark. No spatial information was given. Results for the classifier are based on five-fold cross validation. The TR-IRLS implementation of logistic regression was used with the default settings except the cgdeveps parameter was set to (Right) The same plot as the left side but for RefSeq transcribed regions opposed to RefSeq TSS. Komarek, P and Moore, A.W. Making Logistic Regression A Core Data Mining Tool With TR-IRLS. IEEE ICDM 2005, pages Carnici, P. Genome-wide analysis of mammalian promoter architecture and evolution. Nature Genetics: 38: (2006). Significantly outperforms single-marks Similar power to supervised learning approach CAGE experiments give possible upper bound

State annotation reveals new protein-coding genes
Transcribed/promoter states enriched in novel protein-coding exons Likely to represent short single-exon genes ( promoter states) Likely to represent low-expression genes ( repressed states)

When novel transcribed regions lack protein signatures:  2,000 Large intergenic non-coding RNAs (lincRNAs) H3K4me3 - K3K36me3 Computational Signal: Chromatin signature of promoter and transcribed Evolutionary signature is not protein-coding Experimental confirmation: Produce RNA molecules Exon/intron structures Evolutionary confirmation: Exons are conserved Promoters are conserved Regulation is conserved Experimental follow-up: They play diverse roles in chromatin regulation Mikkelsen et al. 2007 Guttman, Lin, Kellis, Regev, Rinn, Lander, Nature, Feb 2009

Combine chromatin signatures and regulatory motifs  New developmental enhancers in human and fly
Visel, Penacchio, Rubin, Ren, Nature 2008 Zeitlinger et al, Genes & Development 2007 Chromatin signatures and evolutionary signature predictive of enhancers Experimental techniques developed for inferring expression domains Large-scale databases mapping every elements to its expression pattern Ability to test new patterns / artificial elements in fly, mouse embryos

Shedding light on GWAS disease SNPs
State Enrichments for SNPs and meta-study database of GWAS hits rs in Chr2 intergenic region 40kb 3’ of IKZF2 (lymphocyte devel) Strongest disease association with numerous inflamations (Gudbjartsson09) Strong hit for State33, while surrounding region unenriched (37 and 41-43)

Application to ENCODE datasets
with Brad Bernstein, Noam Shoresh, Chuck Epstein Chromatin modification marks (Bernstein) Cell-Type specific genome annotation TF binding data (Snyder) Interpretation ENCODE reference cell types 11 chromatin marks 8 cell types Diverse additional functional datasets

Assessing predictive value for subset of marks
State Inferred with all 41 marks Recovery of states with increasing number of marks Greedy ordering of marks State Inferred with subset of marks State Inferred with all 41 marks State confusion matrix with 11 ENCODE marks

Comparing chromatin states across cell types
K562 HUVEC NHEK Pairwise state fold enrichments Proportion of genome K562 HUVEC CTCF island state (State 9) highly stable across cell types NHEK It’d be nice to point to a more variable mark too. CTCF is a pretty obvious thing, it shouldn’t be the only thing you point to.  TODO: Add interpretation lines to the 10-state model!

K562 HUVEC NHEK GO Category P-value ectoderm development 2.90E-09 epidermis development 1.80E-08 keratinocyte differentiation 3.00E-06 tissue development 3.20E-06 cell adhesion 1.90E-05 K562 HUVEC GO Enrichment for TSS in Active promoter state (1) in NHEK and unmodified state (7) in HUVEC NHEK It’d be nice to point to a more variable mark too. CTCF is a pretty obvious thing, it shouldn’t be the only thing you point to. NHEK HUVEC

K562 HUVEC NHEK GO Category P-value blood vessel development 2.60E-05 vasculature development 3.00E-05 angiogenesis 3.50E-05 blood vessel morphogenesis 1.20E-04 K562 HUVEC GO Enrichment for TSS in Active promoter state (1) in HUVEC and unmodified state (7) in NHEK NHEK It’d be nice to point to a more variable mark too. CTCF is a pretty obvious thing, it shouldn’t be the only thing you point to. NHEK HUVEC

Clusters of genes with coherent marks across entire length
CTCF H3K27ac H3K9ac H3K4me3 H3K4me2 H3K4m1 H3K9m1 H4K20m1 Pol2 K36me3 K27me3 HMM 0 HMM 1 HMM 2 HMM 3 GO Category P-value Immune response 2x10-63 Leukocyte activation 5x10-32 Lymphocyte activation 6x10-32 GO Category P-value Cell adhesion 1x10-13 Ecoterm development 1x10-9 Extracellular region part 3x10-8 Cluster 10 Cluster 14

Signatures of activators and repressors
Active states 2-2 22 TF Expression +TF expressed  Motif depleted 2-4 24 Motif enrichment Repressed states TF expr  no motif If motif  No expr 2-2 22 TF Expression Activator signature + 2-4 24 Motif enrichment +TF expressed  Motif enriched 2-2 22 TF Expression Repressed states -TF expressed  Motif enriched 2-4 24 Motif enrichment - 2-2 22 TF Expression Active states 2-4 24 Motif enrichment -TF expressed  Motif depleted TF expr  no motif If motif  No expr Repressor signature

Example of activator and repressor
xx 0: “Off” state 5,6: “Enhancer” states 9: “On” state HNF HepG2 activator xx 0: “Off” state 5,6: “Enhancer” states 9: “On” state CREB GM repressor 2-2 22 Expression 2-4 24 Fold enrichment

Linking candidate enhancers to correlated target genes
Search for coherent changes between: gene expression chromatin marks at distant loci (10kb) Combine two vectors: Expression vector for each gene Vector of mark intensities at dist locus (combine marks based on enhancer emissions) 3. High correlation  enhancer/target link 10kb Candidate TM4SF1 Enhancer

Predictive power of distal enhancer regions
Correlation of individual regions (Sorted by Rank) Mark intensity correlation w/ expr 10kb upstream 100kb upstream 10kb/100kb controls At least 100 regions with >80% correlation

Defining chromatin states Biologically-meaningful mark combinations Characterizing chromatin states Application to genome annotation Chromatin state dynamics Cluster genes of common behavior Infer cell type activators / repressors Linking enhancers to promoter regions Where to next?

Where to next? Technology Development Data production
Data dissemination and visualization Overlaying and combining datasets Integrative data analysis Biological discovery and understanding

Integrative analysis of genomic and epigenomic data

Similar presentations

Presentation on theme: "Integrative analysis of genomic and epigenomic data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Integrative analysis of genomic and epigenomic data

Similar presentations

Presentation on theme: "Integrative analysis of genomic and epigenomic data"— Presentation transcript:

Similar presentations

About project

Feedback