Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysis of ChIP-Seq Data

Similar presentations

Presentation on theme: "Analysis of ChIP-Seq Data"— Presentation transcript:

1 Analysis of ChIP-Seq Data
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers

2 Analysis of ChIP-Seq Data
Genomic Data Analysis Course Moscow July 2013 Mark Reimers, Ph.D

3 What Are the Questions? Where are histone modifications?
Where do TFs bind to DNA? Where do miRNAs or RNABPs bind to 3’ UTRs? How different is binding between samples? Focal questions are more where… We haven’t got good enough to ask how much?

4 Why ChIP-Seq? ChIP-Seq is ideal (and is now the standard method) for mapping locations where regulatory proteins bind on DNA Typically ‘only’ 2, ,000 active binding sites with footprint ~ base pairs Similarly ChIP-Seq is fairly efficient for mapping uncommon histone modifications and for RNA Polymerase occupancy , because the genomic regions occupied are very narrow

5 Chromatin Immuno-Precipitation
Chromatin Immuno-Precipitation (ChIP) is a method for selecting fragments from DNA near specific proteins or specific histone modifications From Massie, EMBO Reports, 2008

6 Chromatin Immuno-precipitation
Proteins are cross-linked to DNA by formaldehyde or by UV light NB proteins are even more linked to each other than to DNA DNA is fragmented Antibodies are introduced NB cross-linking may disrupt epitopes Antibodies are pulled out (often on magnetic beads) DNA is released and sequenced

7 CLIP-Seq – A Related Assay
Cross-linking immuno-precipitation (CLIP)-Seq is used to map locations of RNA-binding proteins on mRNA Even miRNA binding can be mapped indirectly by CLIP-Seq with antibodies raised to Argonaute – an miRNA accessory protein

8 What ChIP-Seq Data Look Like
Note that Input also has peaks… in many of the same locations Draw out From Rozowsky et al, Nature Biotech 2009

9 The Value of Controls: ChIP vs. Control Reads
NB. Non-specific enrichment depends on protocol Need controls for every batch run Red dots are windows containing ChIP peaks and black dots are windows containing control peaks used for FDR calculation

10 Goals of Analysis Identify genomic regions - ‘peaks’ – where TF binds or histones are modified Quantify and compare levels of binding or histone modification between samples Characterize the relationships among chromatin state and gene expression or splicing

11 General Characteristics of ChIP-Seq Data
Fragments are quite large relative to binding sites of TFs ChIP-exo (ChIP followed by exonuclease treatment) can trim reads to within a smaller number of bases Histone modifications cover broader regions of DNA than TFs Histone modification measures often undulate following well-positioned nucleosomes

12 ChIP Reads Pile Up in ‘Peaks’ at TF Binding Sites on Alternate Strands

13 ChIP-Seq for Transcription Factors
Typically several thousand distinct peaks across the genome Not clear how many of lower peaks represent low-affinity binding sites From Rozowsky et al, Nature Biotech 2009

14 ChIP-Seq for Polymerase
Fine mapping of Pol2 occupancy shows peaks at 5’ and 3’ ends From Rahl et al Cell 2010

15 ChIP-Seq Histone Modifications
Many histone modifications are over longer stretches rather than peaks May have different profiles Not clear how to compare

16 Issues in Analysis of ChIP-Seq Data
Many false positive peaks How to use controls in data analysis How to count reads starting at same locus What are appropriate controls? Naked DNA, untreated chromatin, IgG Some DNA regions are not uniquely identifiable – ‘mappability’ How to compare different samples? Overlap between peak-finding algorithm results are often poor

17 Mapability Issues Many TFBS and histone modifications lie in low-complexity or repeat regions of DNA With short reads (under 75 bp), with some errors, it may not be possible to uniquely identify (map) the locus of origin of a read UCSC provides a set of mapability tracks Select Mapping and Sequencing Tracks Select Mapability 35, 40, 50 & 70-mer mapability (some with different error allowances)

18 END for Seq Analysis

19 ChIP-Seq for Histone Modifications
Various histone modifications characterize different regulatory states

20 Exon Peaks

21 Intronic Peaks My guess is that there is an alternate initiation site for this gene

22 Intergenic H3K4me3 Peaks Peak of H3K4me3 in region not annotated by RefSeq Corresponds to unknown TAR in cerebellum (annotated by Aceview)

23 H3K4me3 vs Gene expression
91.5% of expressed genes have H3K4me3 peaks Biological significance? Genes with peaks but no expression … maybe poised Expressed genes with no peaks … maybe failure of peak-finder H3K4me3 peaks at TSS - peaks within 1 kb of the TSS

24 CLIP-Seq for RNA Binding Proteins
Argonaute high throughput sequencing of cross-linking immunoprecipitation (HITS-CLIP) protocol, also known as CLIP-seq. Thomson D W et al. Nucl. Acids Res. 2011;39: © The Author(s) Published by Oxford University Press.

25 The ENCODE Project Comprehensive characterization of chromatin state and locations of some TFs and other DNA-binding proteins (e.g. Pol2) across various conditions in human cell lines and in mouse tissues So far no normal human tissues, which likely have rather different epigenetic marks

26 Demo: ENCODE Data at UCSC

27 ChIP-Seq Demo

28 Peak Calling

29 Goals of Peak-Calling Identify discrete locations where a particular protein binds Often applied more generally (and IMHO poorly) to identification of short regions with a particular histone modification

30 Issues in Peak-Calling
Background of random genomic reads is not uniform Affected by CG content and other factors Most good algorithms try to estimate a local background Local background has peaks too!

31 Peak-Finding - Simple Extend tags; sum overlaps at each base
Find center of each discrete cluster Issues: Are clusters discrete? How much to extend? Fragment size unclear

32 Peak Finding – Better Tags starting on opposite strands are likely to start at opposite ends of precipitated fragments Identifying the cross-over point leads to better accuracy

33 Issue: How to Identify a Cluster
Background varies – often related to local CG content and chromatin state Need statistical test for excess counts above local background Usually done by binning counts into 1kb bins across genome

34 Control Reads Show Peaks Also
From Rozowsky et al Nature Biotech 2009

35 Cause of Variation in Read Density
In study of FoxA1 binding, even control reads enriched near FoxA1 binding site! Probably due to open chromatin near FoxA1 binding site Density of Control Channel reads around FoxA1 site Courtesy Shirley Liu

36 Peak Finding by MACS Smart peak imputation estimate
Uses read directions Empirical estimate of fragment length Local frequency estimate Using control, if available Using wide estimate, otherwise Not using sequence

37 MACS Workflow Key innovation is to estimate fragment length empirically If no control sample MACS estimates background from median of ChIP bin counts No use of CG content

38 The Value of Controls: ChIP vs. Control Reads
Red dots are windows containing ChIP peaks and black dots are windows containing control peaks used for FDR calculation

39 Issue: Fragment Lengths
Puzzle: Fragments from sonication expected to be between 200 – 500 bp Empirically estimated fragment size ~ 100 Shirley Liu’s explanation: preferential fragmentation near TF


41 Peak Calling Demo

42 Quantitative Comparison of ChIP Data

Download ppt "Analysis of ChIP-Seq Data"

Similar presentations

Ads by Google