Peak Calling for ChIP-Seq data Larry Meyer UCSC Bioinformatics Dept. BME 230 January 11, 2011.

Peak Calling for ChIP-Seq data Larry Meyer UCSC Bioinformatics Dept. BME 230 January 11, 2011

ChIP-Seq ChIP == Chromatin Immunoprecipitation Seq == Sequencing

Histones Histones are protein which compose the nucleosome. DNA is organized spatially by wrapping around nucleosomes. Posttranslational modifications to the histone proteins affect the state of the bound DNA; some examples: – H3K9ac: acetylation of the 9 th residue (which is a Lysine); this mark is associated with open chromatin and transcription initiation – H3K36me3: associated with actively transcribed DNA

Transcription Factors Proteins which regulate the transcription of genes by binding to DNA in the promoters of the regulated genes. TFs often exhibit DNA specificity; e.g. TATA box binding factor usually binds to TATAAA. Transcription Factor Binding Sites (TFBS): these are the sites where TFs bind

High-Throughput Sequencing Solexa/Illumina – sequencing by synthesis. Read length: 25-50+ bp. 1 Gbp/run Solid – ligation sequencing. Read length: 35+ bp. 1 Gbp/run Helicos – single molecule (no PCR step) 454 – pyrosequencing; much longer reads, but rarely used in these applications (more expensive; has trouble with homopolymers); BME Prof Nader Pourmand helped invent this technique

ChIP-Seq Applications Open Chromatin (DNaseI hypersensitivity and FAIRE) Methylation (Methyl-Seq) Histone Modifications Histone Modifications Transcript Factor Binding Sites (TFBS) Transcript Factor Binding Sites (TFBS) RNA-Seq (expression)

ChIP-Seq

Raw Data => Aligned Reads Sequencer provides reads (aka “tags”) Approx. 10 million reads per lane/experiment in Illumina 3G sequencer These are aligned to the appropriate reference genome (e.g. hg18) using an aligner designed to handle short reads (e.g. maq).

Strand Agnostic The ChIP-Seq assay is strand agnostic (i.e. either strand is equally likely to be sequenced). Reads start at the 5’ of fragments

Extend Aligned Reads Use bedExtendRanges utility to extend mapped tags (shown in red) to the average DNA fragment size (which was chosen before sequencing using a gel).

Aligned Reads => Read Density Read density == number of DNA fragments overlapping a given genomic coordinate. This can be calculated with the bedItemOverlapCount utility.

ENCODE Broad Histone tracks: H3K9ac, and H3K36me in cell line GM12878 in a 50kb window

Control/Input Same ChIP-Seq protocol, but without the ChIP enrichment step Done by all ENCODE (modENCODE?) labs Most peak callers use the input signal Some labs also do a naked DNA input (no crosslinking) – I’m not going to cover that here

Why do you need Control/Input? Need to control for artifacts of enrichment assay and sequencing technology Input is often higher in the area of interest (e.g. because of open chromatin), which can lead to false positives if you use the input at a whole genome level

Control and H3K9ac signal track from Broad for GM12878 cell line Note that (a) the control signal is higher than the enriched signal and (b) we are nowhere near a gene start. I have no idea what is causing this huge peak.

Peak Callers Convert raw signal into “significant” regions Ubiquitous task in bioinformatics: – CpG Islands – High GC Content – Genes – Copy Number Variation – Repeats

Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing, Robertson et al, Nature Methods, 2007 No input/control runs (because of cost?) Assumes uniform background (more on this assumption later) Simple Peak Caller

The null model for the background distribution is a Poisson distribution with the λ parameter estimated using the number of reads/base (aka, the sequencing depth). Recall that the poisson is a discrete distribution with a single parameter λ; here’s the pmf and cdf: We use this distribution to estimate false discovery rate (FDR) for peaks of a given height In Robertson’s data, the average number of reads/base is 0.95, so our estimate for λ = 0.95

Robertson et al. observed 11,004 regions with a read density of >= 11. Under the background/null model, we expect the number of such peaks to be: Pr(X>=11) * Size of Human Genome In R Syntax: 1 - ppois(10, 0.95) * 3.08e9 = 18.4 So we expect 18.4 bases to have a read density >= 11 by chance under the null model FDR(Max Height >= 11) = FP/(FP+TP) = 18.4/11004 = 0.001 We call peaks which have a FDR = 11 bases high

Modern Peak Callers Point/punctate Peaks (e.g. TF ChIP-Seq) – MACS (Shirley Liu @ Harvard) – PeakSeq (Joel Rozowsky @ Yale) – QuEST (Rick Myers @ HudsonAlpha) Broad Peaks (e.g. Open Chromatin) – Unnamed peak caller (Broad Institute) – F-Seq (Duke)

MACS Model-based Analysis of ChIP-Seq (MACS), Zhang et al., Genome Biology, 2008 MACS estimates average peak width by taking advantage of the fact that the sequencer sequences from the 5’ end of Watson and Crick strands

MACS (cont.) Empirically determined d value determines sliding window size used for peak detection Input tags from neighborhood of a given peak are used to parameterize a Poisson distribution to model the background and determine peak significance: λ local = max(λ WG, λ 1k, λ 5k, λ 10k )

Our Peak Caller Model background with a poisson distribution; use control data set to estimate λ in 1kb windows: λ 1k = average read density in 1kb window λ WG = average read density in whole genome λ local = max(λ WG, λ 1k ) Use a very conservative p-value: 1e-9 Accept peaks with height >= F -1 (1-p, λ local )

Poisson Example λ = 10, with two p-value cutoffs

R code for Poisson example bitmap(file = "sample.jpg", res = 256); lambda <- 10 liberalCutoff <- qpois(1 - 1e-2, lambda) conservativeCutoff <- qpois(1 - 1e-9, lambda) plot(0:40, dpois(0:40, lambda), type = 's', xlab = 'heights', ylab = "probability", ylim = c(0, 0.15)) leg <- character(2) leg[1] = sprintf("liberal cutoff (1e-2) = %d", liberalCutoff) leg[2] = sprintf("convservative cutoff (1e-9) = %d", conservativeCutoff) legend("topleft", leg, text.col = c("green", "red")) abline(v = liberalCutoff, col = 'green') abline(v = conservativeCutoff, col = 'red') abline(h = 0) dev.off()

PeakSeq PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls, Rozowsky et al., Nature Methods, 2009

Average signal height at TSS for all samples (including input)

PeakSeq (cont.) First pass: look for peaks in sample (using randomized copies of sample to calculate significance) Second pass: input counts for given chromosome is normalized to the sample. Input is then used in discrete windows (1 Mb) for a second significance filtering step

Broad Institute – KISS Peak Calling Genome-wide maps of chromatin state in pluripotent and lineage-committed cells, Mikkelsen et al., Nature, 2007 Similar to previously described input-less algorithm, but uses 1kb windows to calculate the background Poisson distribution They think using background/input doesn’t help

F-Seq F-Seq: a feature density estimator for high- throughput sequence tags, Boyle et al, Bioinformatics, 2008 Raw density signal is converted to a processed signal using kernel density estimate (Parzen window method) Significance is determined by using the standard deviations of these densities

Homework At least one partner in each group should already know C/C++ I will have office hours on several Wednesdays before the homework is due; also please feel free to email me questions Please start early!

Acknowledgements Jim Kent, Kate Rosenbloom, Tim Dreszer – UCSC Browser Staff Tarjei Mikkelson – Broad Institute

Peak Calling for ChIP-Seq data Larry Meyer UCSC Bioinformatics Dept. BME 230 January 11, 2011.

Similar presentations

Presentation on theme: "Peak Calling for ChIP-Seq data Larry Meyer UCSC Bioinformatics Dept. BME 230 January 11, 2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Peak Calling for ChIP-Seq data Larry Meyer UCSC Bioinformatics Dept. BME 230 January 11, 2011.

Similar presentations

Presentation on theme: "Peak Calling for ChIP-Seq data Larry Meyer UCSC Bioinformatics Dept. BME 230 January 11, 2011."— Presentation transcript:

Similar presentations

About project

Feedback