Analyzing ChIP-seq data

Slides:

Advertisements

Similar presentations

Methods to read out regulatory functions

Advertisements

ChIP-seq Data Analysis

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.

Tingwen Chen (陳亭妏) Bioinformatics center CGU

Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.

Chromatin Immuno-precipitation (CHIP)-chip Analysis

Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.

Analysis of ChIP-Seq Data

Data Analysis for High-Throughput Sequencing

Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.

Differentially expressed genes

Transcription factor binding motifs (part I) 10/17/07.

Evaluation of Signaling Cascades Based on the Weights from Microarray and ChIP-seq Data by Zerrin Işık Volkan Atalay Rengül Çetin-Atalay Middle East Technical.

ChIP-seq QC Xiaole Shirley Liu STAT115, STAT215. Initial QC FASTQC Mappability Uniquely mapped reads Uniquely mapped locations Uniquely mapped locations.

ENCODE enhancers 12/13/2013 Yao Fu Gerstein lab. ‘Supervised’ enhancer prediction Yip et al., Genome Biology (2012) Get enhancer list away to genes DNase.

1 1 - Lectures.GersteinLab.org Overview of ENCODE Elements Mark Gerstein for the "ENCODE TEAM"

Genome-wide mapping of transcription factor Oct4, Sox2 and Nanog binding-sites in mouse embryonic stem cells Genome Institute of Singapore Department of.

Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.

A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.

지별아 Journal. Background ▪ Nucleosomes consist of ~150 bp of DNA wrapped around a core histone octamer. ▪ Nucleosome positioning on genomic.

SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA

Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.

Massive Parallel Sequencing

* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)

Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.

I519 Introduction to Bioinformatics, Fall, 2012

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Next Generation Sequencing

Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.

Other genomic arrays: Methylation, chIP on chip… UBio Training Courses.

Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Day 5-2 What bioinformatics.

Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.

Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,

Algorithms in Bioinformatics: A Practical Introduction

A B IL-4(+) IL-4(-) IL-4(+) IL-4(-) ChIP-Seq (STAT6) Ramos IL-4 (+) P-value Ramos IL-4 (-) P-value BEAS2B IL-4 (+) P-value BEASB IL-4 (-) P-value fold.

Statistical Testing with Genes Saurabh Sinha CS 466.

Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.

. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.

Overview of ENCODE Elements

Lecture-5 ChIP-chip and ChIP-seq

Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.

Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.

Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.

Transcription factor binding motifs (part II) 10/22/07.

Canadian Bioinformatics Workshops

Special Topics in Genomics ChIP-chip and Tiling Arrays.

BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™

Introduction The stem cell derived transcription factors SOX4, POU2F2 and BACH2 are known to be important in B-cell differentiation and B-cell malignancies.

ChIP-seq Downstream Analysis Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

Canadian Bioinformatics Workshops

Epigenetics Continued

De novo Motif Finding using ChIP-Seq

De novo Motif Finding using ChIP-Seq

Dynamic epigenetic enhancer signatures reveal key transcription factors associated with monocytic differentiation states by Thu-Hang Pham, Christopher.

Volume 11, Issue 2, Pages (August 2012)

Taichi Umeyama, Takashi Ito Cell Reports

Volume 7, Issue 5, Pages (June 2014)

In collaboration with Mikkelsen Lab

ChIP-seq Robert J. Trumbly

Volume 62, Issue 1, Pages (April 2016)

Control of the Embryonic Stem Cell State

Songjoon Baek, Ido Goldstein, Gordon L. Hager Cell Reports

Volume 1, Issue 3, Pages (September 2007)

Volume 133, Issue 6, Pages (June 2008)

Volume 132, Issue 6, Pages (March 2008)

Volume 1, Issue 3, Pages (September 2007)

Volume 24, Issue 8, Pages e7 (August 2018)

Taichi Umeyama, Takashi Ito Cell Reports

Presentation transcript:

Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore

Transcriptional Control (I)

Transcriptional Control (II)

Protein-DNA binding sites Binding sites usually consist of 5-12 bases (upto 30 bp) Binding site sequence preferences of protein factors is not exact. It may be represented as a weight matrix AGCTAAACCACGTGGCATGGGACGTATGCCCAGTA Transcription factor Binding site

Question Can we identify where the transcription factors bind on the genome? Can we identify the binding motifs of the transcription factors?

Technology: ChIP experiment Chromatin immunoprecipitation experiment Detect the interaction between protein (transcription factor) and DNA.

Technology: ChIP-seq Sonication + ChIP ChIP-sequencing + mapping to reference genome Noise Peak detection

ChIP-seq data Tag Mapping Peak calling (CCAT) Motif scanning (CentDist)

CCAT: A peak finding method

ChIP-seq peak finders ChIP-Seq is becoming the main stream for genome-wide study of protein-DNA interactions, histone modifications and DNA methylation patterns. Many tools have been proposed for ChIP-Seq analysis (e.g., PeakFinder, MACs, SISSRs, PeakSeq, CisGenome)

Aim Contribution of CCAT: Aim: How to estimate noise in a ChIP-seq library? How to perform a more correct FDR estimation? Aim: Hope to show that CCAT can identify weak binding sites which cannot be discovered by existing methods.

ChIP-seq model (Linear signal-noise model) Binding regions Our sample library:

How to identify binding sites with the help of control library? Our sample library (N=27): Control library (M=14): Sample library has 3 fold more reads. Hence, we predict this is a binding site.

What happen if we cannot correctly estimate the noise? Our sample library (N=27): Control library (M=28): When control library has almost the same size as the sample library! Fail to identify this binding site.

How to estimate noise? (I) If we know the list of background regions R, the noise can be estimated as Our sample library (N=27): Control library (M=28): In this example, we estimate  = 7/14 x 28/27.

How to estimate noise? (II) Given some initial guess of , we can predict the list of background regions R by Our sample library (N=27): Control library (M=28): In this example, if  = 1, predicted background regions are regions with #sample_reads < 27/28 #ctrl_reads.

How to estimate noise? (III) Input: ChIP library and control library Set  = 1; Iterate until  is stablized Estimate the background regions Predict  from the regions R;

Spike-in Simulation Spike-in dataset generated from: Strategy for generating Nanog spike-in dataset: Determine spike-in region from Nanog1, and retrieve spike-in reads from Nanog2; Background noise in ChIP library come from control1, and noise in control library come from control2; Two spike-in datasets: Nanog and H3K4me3 library ID antibody # of uniquely mapped reads* reference control 1 GFP 3.83M Chen et. al., 2008, Cell control 2 WCE 6.76M Unpublished Nanog 1 Nanog 6.03M Marson et. al., 2008, Cell Nanog 2 8.42M H3K4me3 1 H3K4me3 6.94M H3K4me3 2 8.85M Mikkelson et. al., 2007, Nature

Spike-in Simulation Convergency is fast! The noise rate coverge in about 5 iterations! The noise rate estimation is accurate. Relative error < 5%!

FDR estimation Given a list of candidate sites ranked by some scoring function, our aim is to determine the cutoff threshold such that FDR<0.05; If the threshold is too loose, We get more noise. If threshold is too strengent, We miss the weak peaks. To identify the weak peaks, we need an accurate FDR estimation

Methods for estimating FDR A number of methods for determine the cutoff. Bionomial p-value, e.g., Benjamini-Hochberg (B-H) correction by (Benjamini & Hochberg, 1995; Rozowsky et al., 2009) Storey’s method by (Storey, 2002; Nix et al., 2008) Empirical p-value, e.g., eFDR by (Nix et al.,2008) Library swapping proposed by (Zhang et al., 2008)

Is binomial p-value good? Observed background variation is different from the estimation from the binomial model. Reason: The wet lab noise is not uniformly distributed in the genome. Binomial p-value is not good enough!

Library swapping N reads from ChIP library N reads from control library N sample reads N control reads N sample reads N control reads ChIP sites Control sites Determine empirical cutoff

More on library swapping Library swapping works well for most cases. However, as mentioned by Zhang et al., the estimated FDR would be biased for some cases. We found that the bias is due to the fact that they did not consider the noise rate.

Modified library swapping N reads from ChIP library N reads from control library N sample reads N control reads N sample reads N control reads ChIP sites Control sites Determine empirical cutoff

Spike-in Simulation FDR estimation for Nanog FDR estimation for H3K4me3 Library swapping has the best FDR estimation!

Application to mESC H3K4me3 data ChIP library: Mikkelson et. al., 2007, Science. Control library: Chen et. al., 2008, Cell. Normalized difference score (Nix et. al., 2008, BMC Bioinfo.) Distinct chromatin features associated with strong and weak H3K4me3 sites. FDR CCAT: 0.02 PeakSeq: 0.05 qPCR validation

Application to mESC H3K36me3 data Comparison of 8176 novel regions to RefSeq, Ensembl, and MGC gene annotation.

Motif scanning for ChIP-seq data

Advantages of ChIP-seq ChIP-seq allows us to precisely map global binding sites for any TF with validated antibody. It offers two advantages: More candidate binding sites (known as peaks) Higher resolution (usually the main motif is located +/- 100bp from the peaks)

How to find motifs in ChIP-seq data? Input: a set of peaks Select high intensity peaks. For every selected peak, extract the DNA sequence in, says, +/-200bp region from the peak. Perform motif finding on those selected DNA sequences.

Apply such approach on AR dataset in LNCaP cell-line LNCaP cell-line DHT treated 2hr, ChIP-ed with AR antibody MACS reports 58788 binding sites Using 600 vertebrate PWMs (145 clusters) from TRANSFAC. Perform CEAS and Core-TF using top 10000 sites. Window size: 200, 400, 1000 For Core-TF: we try random background and promoter background.

Motif Scanning Result (top 20 results) There are 7 known co-TFs of AR. CoreTF 200bp GATA CEBP NKX OCT Out of 7 known co-factors of AR, 5 of them are discovered by CoreTF and CEAS. ETS AR FOX NF1 CEAS 400bp

Detail of motif scanning result CORE_TF prombg 200 CORE_TF prombg 400 CORE_TF prombg 1000 CORE_TF randbg 200 CORE_TF randbg 400 CORE_TF randbg 1000 CEAS 200 CEAS 400 CEAS 1000 AR 2 6 1 CEBP 12 16 25 20 15 7 ETS 64 61 66 37 47 FOX GATA 10 13 14 NF1 40 60 70 21 31 3 NKX 11 5 4 OCT 8 19 26 AP4 65 AUC 0.91 0.8917 0.8742 0.9358 0.9375 0.9208 0.6854 0.6875 0.625

ChIP-seq protocol revisit From the empirical study of Qi et al.(2006), we know that the length of ChIP fragment follows a gamma distribution. sonication immunoprecipiation

ChIPed motif show center enrichment around AR peaks Due to the ChIP-seq protocal, we expect the correct motif shows a center enrichment for the frequency graph. We assume noise like CG bias is uniformly distributed. If the motif is not real, its 1st derivative will be near zero. Below frequency graph shows that AR has center enrichment while the velocity graph shows that AR is not noise. AR motif distribution around AR peak Velocity distribution for AR motif

Co-motifs show center enrichment around peaks Since co-regulating factors are expected to co-occur in close proximity, we expected co-motifs also show center enrichment around peaks. For example, NF1 is a known co-motif of AR. We observe center enrichment of NF1 motif around the AR peaks.

Center distribution score We define a score function based on the frequency graph and the velocity graph. Features: We don’t require background model. We will learn the window size automatically We will learn the PWM score cutoff

Automatically learn the parameter of the frequency graph V$AR_02

Non-co-motifs do not show center enrichment around peaks Below two figures verify this.

CENTDIST workflow

CENTDIST Based on the center enrichment of the TFs relative to the peak, we derive a method CENTDIST. CENTDIST measures the center enrichment based on Z-score. Then, the ranked TFs are reported.

Can CENTDIST find known co-motifs of AR? All known co-motifs of AR show good center enrichment. Note that although Oct1 motif does not show good enrichment around the peaks, Oct1 motif shows good enrichment for 1st and 2nd order derivative.

CENTDIST vs CEAS vs CORE_TF CENTDIST CORE_TF prombg 200 CORE_TF prombg 400 CORE_TF prombg 1000 CORE_TF randbg 200 CORE_TF randbg 400 CORE_TF randbg 1000 CEAS 200 CEAS 400 CEAS 1000 AR 1 2 6 CEBP 14 12 16 25 20 15 7 ETS 9 64 61 66 37 47 FOX GATA 10 13 NF1 11 40 60 70 21 31 3 NKX 8 5 4 OCT 19 26 AP4 65 AUC 0.9683 0.91 0.8917 0.8742 0.9358 0.9375 0.9208 0.6854 0.6875 0.625

Can CENTDIST identify novel factor? AP4 is rank 21 in CENTDIST. Core-TF and CEAS rank AP4 low, since AP4 is not highly enrich around the peaks.

Validation of AP4

Validation of AP4 To be unbiased, we make a AP4 ChIP-seq. 38% of AP4 peaks overlap with AR peaks. 62768 2296 3786 AR AP4

Validation of AP4 We also check the microarray expression. The result suggests that AP4 may co-localize with AR to directly up-regulate the transcription of androgen target genes.

Validation using ChIP-seq from ES cell CENTDIST performs better than CEAS and Core-TF for most cases. CENTDIST CEAS Core-TF Nanog 0.9647 0.7346 0.7549 Oct4 0.9133 0.825 0.7508 Sox2 0.9499 0.8765 0.6939 Stat3 0.9309 0.7492 0.7308 Smad1 0.8483 0.8803 0.7048 P300 0.9234 0.8098 0.719 KLF4 0.8432 0.6864 0.8015 ESRRB 0.8622 0.9744 0.9295 Cmyc 0.9776 0.8401 0.9237 Nmyc 0.9334 0.5235 0.9107 ZFX 0.9545 0.5373 0.9221 E2F1 0.9529 0.5349 0.9351 AVG AUC 0.921192 0.747667 0.814733

p300 CENTDIST has potential to find enhancer factors using p300 ChIP-seq. Known cofactors CENTDIST CORE_TF CEAS Sox 5 3 8 Oct 1 2 6 Nanog 33 49 107 Stat 40 63 137 ERE CP2 4 91 12 E2F 18 116 7 AVG RANK 16 47

Discussion CentDist can find motifs which are marginally over-represented. CentDist can detect the window size CentDist doesn’t require background model

Acknowledgement Bioinformatics Sequencing Cancer Biology Guoliang Li Pramila Charlie Lee Han Xu Fabi Kuan Hon Loh Chang Cheng Wei Gao Song Chandana Rikky Zhang Zhi Zhou Sequencing Wei Chialin Handoko Lusy Sequencing team Cancer Biology Edwin Cheung Pau You Fu