Analysis of ChIP-Seq Data

Slides:



Advertisements
Similar presentations
Methods to read out regulatory functions
Advertisements

DNA:chromatin interactions
Epigenetics Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Regulomics II: Epigenetics and the histone code Jim Noonan GENE760.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA
Detecting DNA-protein Interactions Xinghua Lu Dept Biomedical Informatics BIOST 2055.
Functional Non-Coding DNA Part I Non-coding genes and non-coding elements of coding genes BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Understanding the Human Genome: Lessons from the ENCODE project
Data Analysis for High-Throughput Sequencing
ChIP-seq QC Xiaole Shirley Liu STAT115, STAT215. Initial QC FASTQC Mappability Uniquely mapped reads Uniquely mapped locations Uniquely mapped locations.
Sequencing Errors and Biases Biological Sequence Analysis BNFO 691/602 Spring 2013 Mark Reimers.
High-Throughput Sequencing
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
Model Selection in Machine Learning + Predicting Gene Expression from ChIP-Seq signals
An Introduction to ENCODE Mark Reimers, VIPBG (borrowing heavily from John Stamatoyannopoulos and the ENCODE papers)
Controls for TTS identification using PET A series of controls were implemented in order to evaluate the potential contamination by internal priming in.
The Genome is Organized in Chromatin. Nucleosome Breathing, Opening, and Gaping.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA
ChIP-on-Chip and Differential Location Analysis Junguk Hur School of Informatics October 4, 2005.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.
Proteome and interactome Bioinformatics.
ChIP-chip Data. DNA-binding proteins Constitutive proteins (mostly histones) –Organize DNA –Regulate access to DNA –Have many modifications Acetylation,
I519 Introduction to Bioinformatics, Fall, 2012
Sackler Medical School
EDACC Quality Characterization for Various Epigenetic Assays
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
A B IL-4(+) IL-4(-) IL-4(+) IL-4(-) ChIP-Seq (STAT6) Ramos IL-4 (+) P-value Ramos IL-4 (-) P-value BEAS2B IL-4 (+) P-value BEASB IL-4 (-) P-value fold.
Log 2 (expression) H3K4me2 score A SLAMF6 log 2 (expression) Supplementary Fig. 1. H3K4me2 profiles vary significantly between loci of genes expressed.
Overview of ENCODE Elements
Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.
DNAse Hyper-Sensitivity BNFO 602 Biological Sequence Analysis, Spring 2014 Mark Reimers, Ph.D.
Biol 456/656 Molecular Epigenetics Lecture #5 Wed. Sept 2, 2015.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Peak Calling for ChIP-Seq data Larry Meyer UCSC Bioinformatics Dept. BME 230 January 11, 2011.
BME 130 – Genomes Lecture 14 Chromatin, Gene expression, and splicing.
Additional high-throughput sequencing techniques (finding all functional elements of genome) June 15, 2017.
Epigenetics Continued
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
Gene expression from RNA-Seq
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Figure 1. Distinct chromatin regions isolated by the N-ChroP strategy
Dynamic epigenetic enhancer signatures reveal key transcription factors associated with monocytic differentiation states by Thu-Hang Pham, Christopher.
Simon v ChIP-Seq Analysis Simon v
Figure 1. (A) Number of 8-oxodGs per million of dGs (8-oxodg/106 dG) measured by LC-MS/MS in untreated (NT), UV-irradiated (UV) and NAC-treated.
High-Resolution Profiling of Histone Methylations in the Human Genome
Figure 4. (A) Scatterplot of RPC4 T statistic (between TP0 and TP36) for the indicated groups of isolated tRNA genes (RPC4 peak only, n = 35; RPC4 + H3K4me3.
Protein Occupancy Landscape of a Bacterial Genome
Adrien Le Thomas, Georgi K. Marinov, Alexei A. Aravin  Cell Reports 
High-Resolution Profiling of Histone Methylations in the Human Genome
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
ChIP-seq Robert J. Trumbly
Volume 17, Issue 6, Pages (November 2016)
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
Volume 10, Issue 7, Pages (February 2015)
Human Promoters Are Intrinsically Directional
Volume 63, Issue 6, Pages (September 2016)
Evolution of Alu Elements toward Enhancers
Volume 132, Issue 2, Pages (January 2008)
Dynamic Regulation of Nucleosome Positioning in the Human Genome
Volume 21, Issue 9, Pages (November 2017)
Volume 64, Issue 3, Pages (November 2016)
Volume 47, Issue 4, Pages (August 2012)
Volume 7, Issue 2, Pages (August 2010)
Volume 6, Issue 4, Pages (April 2016)
Divergent Transcription from Active Promoters
STAT4 facilitates a permissive epigenetic landscape (H3K4me3) in activated NK cells. STAT4 facilitates a permissive epigenetic landscape (H3K4me3) in activated.
Identification of chromatin modifying complex recruiting H3K9 methyltransferases. a, A MEME-ChIP analysis was performed to identify the transcription factor.
Presentation transcript:

Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers

Analysis of ChIP-Seq Data Genomic Data Analysis Course Moscow July 2013 Mark Reimers, Ph.D

What Are the Questions? Where are histone modifications? Where do TFs bind to DNA? Where do miRNAs or RNABPs bind to 3’ UTRs? How different is binding between samples? Focal questions are more where… We haven’t got good enough to ask how much?

Why ChIP-Seq? ChIP-Seq is ideal (and is now the standard method) for mapping locations where regulatory proteins bind on DNA Typically ‘only’ 2,000 - 20,000 active binding sites with footprint ~200-400 base pairs Similarly ChIP-Seq is fairly efficient for mapping uncommon histone modifications and for RNA Polymerase occupancy , because the genomic regions occupied are very narrow

Chromatin Immuno-Precipitation Chromatin Immuno-Precipitation (ChIP) is a method for selecting fragments from DNA near specific proteins or specific histone modifications From Massie, EMBO Reports, 2008

Chromatin Immuno-precipitation Proteins are cross-linked to DNA by formaldehyde or by UV light NB proteins are even more linked to each other than to DNA DNA is fragmented Antibodies are introduced NB cross-linking may disrupt epitopes Antibodies are pulled out (often on magnetic beads) DNA is released and sequenced

CLIP-Seq – A Related Assay Cross-linking immuno-precipitation (CLIP)-Seq is used to map locations of RNA-binding proteins on mRNA Even miRNA binding can be mapped indirectly by CLIP-Seq with antibodies raised to Argonaute – an miRNA accessory protein

What ChIP-Seq Data Look Like Note that Input also has peaks… in many of the same locations Draw out From Rozowsky et al, Nature Biotech 2009

The Value of Controls: ChIP vs. Control Reads NB. Non-specific enrichment depends on protocol Need controls for every batch run Red dots are windows containing ChIP peaks and black dots are windows containing control peaks used for FDR calculation

Goals of Analysis Identify genomic regions - ‘peaks’ – where TF binds or histones are modified Quantify and compare levels of binding or histone modification between samples Characterize the relationships among chromatin state and gene expression or splicing

General Characteristics of ChIP-Seq Data Fragments are quite large relative to binding sites of TFs ChIP-exo (ChIP followed by exonuclease treatment) can trim reads to within a smaller number of bases Histone modifications cover broader regions of DNA than TFs Histone modification measures often undulate following well-positioned nucleosomes

ChIP Reads Pile Up in ‘Peaks’ at TF Binding Sites on Alternate Strands Indicate

ChIP-Seq for Transcription Factors Typically several thousand distinct peaks across the genome Not clear how many of lower peaks represent low-affinity binding sites From Rozowsky et al, Nature Biotech 2009

ChIP-Seq for Polymerase Fine mapping of Pol2 occupancy shows peaks at 5’ and 3’ ends From Rahl et al Cell 2010

ChIP-Seq Histone Modifications Many histone modifications are over longer stretches rather than peaks May have different profiles Not clear how to compare

Issues in Analysis of ChIP-Seq Data Many false positive peaks How to use controls in data analysis How to count reads starting at same locus What are appropriate controls? Naked DNA, untreated chromatin, IgG Some DNA regions are not uniquely identifiable – ‘mappability’ How to compare different samples? Overlap between peak-finding algorithm results are often poor

Mapability Issues Many TFBS and histone modifications lie in low-complexity or repeat regions of DNA With short reads (under 75 bp), with some errors, it may not be possible to uniquely identify (map) the locus of origin of a read UCSC provides a set of mapability tracks Select Mapping and Sequencing Tracks Select Mapability 35, 40, 50 & 70-mer mapability (some with different error allowances)

END for Seq Analysis

ChIP-Seq for Histone Modifications Various histone modifications characterize different regulatory states

Exon Peaks

Intronic Peaks My guess is that there is an alternate initiation site for this gene

Intergenic H3K4me3 Peaks Peak of H3K4me3 in region not annotated by RefSeq Corresponds to unknown TAR in cerebellum (annotated by Aceview)

H3K4me3 vs Gene expression 91.5% of expressed genes have H3K4me3 peaks Biological significance? Genes with peaks but no expression … maybe poised Expressed genes with no peaks … maybe failure of peak-finder H3K4me3 peaks at TSS - peaks within 1 kb of the TSS

CLIP-Seq for RNA Binding Proteins Argonaute high throughput sequencing of cross-linking immunoprecipitation (HITS-CLIP) protocol, also known as CLIP-seq. Thomson D W et al. Nucl. Acids Res. 2011;39:6845-6853 © The Author(s) 2011. Published by Oxford University Press.

The ENCODE Project Comprehensive characterization of chromatin state and locations of some TFs and other DNA-binding proteins (e.g. Pol2) across various conditions in human cell lines and in mouse tissues So far no normal human tissues, which likely have rather different epigenetic marks

Demo: ENCODE Data at UCSC

ChIP-Seq Demo

Peak Calling

Goals of Peak-Calling Identify discrete locations where a particular protein binds Often applied more generally (and IMHO poorly) to identification of short regions with a particular histone modification

Issues in Peak-Calling Background of random genomic reads is not uniform Affected by CG content and other factors Most good algorithms try to estimate a local background Local background has peaks too!

Peak-Finding - Simple Extend tags; sum overlaps at each base Find center of each discrete cluster Issues: Are clusters discrete? How much to extend? Fragment size unclear

Peak Finding – Better Tags starting on opposite strands are likely to start at opposite ends of precipitated fragments Identifying the cross-over point leads to better accuracy

Issue: How to Identify a Cluster Background varies – often related to local CG content and chromatin state Need statistical test for excess counts above local background Usually done by binning counts into 1kb bins across genome

Control Reads Show Peaks Also From Rozowsky et al Nature Biotech 2009

Cause of Variation in Read Density In study of FoxA1 binding, even control reads enriched near FoxA1 binding site! Probably due to open chromatin near FoxA1 binding site Density of Control Channel reads around FoxA1 site Courtesy Shirley Liu

Peak Finding by MACS Smart peak imputation estimate Uses read directions Empirical estimate of fragment length Local frequency estimate Using control, if available Using wide estimate, otherwise Not using sequence

MACS Workflow Key innovation is to estimate fragment length empirically If no control sample MACS estimates background from median of ChIP bin counts No use of CG content

The Value of Controls: ChIP vs. Control Reads Red dots are windows containing ChIP peaks and black dots are windows containing control peaks used for FDR calculation

Issue: Fragment Lengths Puzzle: Fragments from sonication expected to be between 200 – 500 bp Empirically estimated fragment size ~ 100 Shirley Liu’s explanation: preferential fragmentation near TF

Peak Calling Demo

Quantitative Comparison of ChIP Data