ChIP-seq hands-on Iros Barozzi, Campus IFOM-IEO (Milan) Saverio Minucci, Gioacchino Natoli Labs.

Slides:



Advertisements
Similar presentations
Methods to read out regulatory functions
Advertisements

Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.
High Throughput Sequencing
Introduction to Short Read Sequencing Analysis
Data Analysis for High-Throughput Sequencing
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
High Throughput Sequencing
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Expression Analysis of RNA-seq Data
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Introduction to Short Read Sequencing Analysis
RNAseq analyses -- methods
Massive Parallel Sequencing
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics.
NIH Extracellular RNA Communication Consortium 2 nd Investigators’ Meeting May 19 th, 2014 Sai Lakshmi Subramanian – (Primary
I519 Introduction to Bioinformatics, Fall, 2012
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Chip-Seq Peak Calling in Galaxy | Lisa Stubbs | PowerPoint by Casey Hanson.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
EDACC Quality Characterization for Various Epigenetic Assays
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Day 5-2 What bioinformatics.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Introduction to RNAseq
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID
 CHANGE!! MGL Users Group meetings will now be on the 1 st Monday of each month 3:00-4:00 Room Note the change of time and room.
Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.
No reference available
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Lisa Stubbs | Chip-Seq Peak Calling in Galaxy1.
Short Read Workshop Day 5: Mapping and Visualization
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
User-friendly Galaxy interface and analysis workflows for deep sequencing data Oskari Timonen and Petri Pölönen.
HOMER – a one stop shop for ChIP-Seq analysis
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Canadian Bioinformatics Workshops
Practice:submit the ChIP_Streamline.pbs 1.Replace with your 2.Make sure the.fastq files are in your GMS6014 directory.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Short Read Workshop Day 1 - Experimental Design Example 1: How to log in to vieques.
From Reads to Results Exome-seq analysis at CCBR
Using command line tools to process sequencing data
Day 5 Mapping and Visualization
Short Read Sequencing Analysis Workshop
Short Read Sequencing Analysis Workshop
Chip – Seq Peak Calling in Galaxy
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
GE3M25: Data Analysis, Class 4
Day 5 Session 29: Questions and follow-up…. James C. Fleet, PhD
2nd (Next) Generation Sequencing
ChIP-Seq Data Processing and QC
Exploring and Understanding ChIP-Seq data
Epigenetics System Biology Workshop: Introduction
Maximize read usage through mapping strategies
ChIP-seq Robert J. Trumbly
BF528 - Sequence Analysis Fundamentals
Chip – Seq Peak Calling in Galaxy
Chromatin basics & ChIP-seq analysis
The Variant Call Format
Presentation transcript:

ChIP-seq hands-on Iros Barozzi, Campus IFOM-IEO (Milan) Saverio Minucci, Gioacchino Natoli Labs

Main goals  Becoming familiar with essential tools and formats  Visualizing and contextualizing raw data  Understand biases at each step of the analysis  If something went wrong, identify which experimental step could have risen the issue  FAQs

Overview  Quality control  Alignment  Raw data visualization  Peak calling  Experimental validation  At each step: - critical evaluation - understanding possible issues Illumina website

Overview Alignment Raw data visualization Peak calling

FASTQ Read/Tag +description Qscores

Quality control   critical evaluation of good/bad fastqc output  what to really expect from a HiSeq lane: - trimming - contaminants evaluation  Before we start: don’t get scared, some biases can be intrinsic to the regions of the DNA you are IPing and not a technical problem

Quality control

Quality control: HiSeq

TTAATTGATTTTATGGATAAAAGTATTGTTTAGATTTGTTTTTTTAATATTGGATTTATATAAAAAAAGATGTTTTTTTTTGTATTTTTTTTGTAGTTTAA TGGAGGAATGAGAGGAAAGGTGGAAGTAATTTTGTGTTAATTAAAGGTATTTTTAAAAATTTAATAATATGTATTGAGAAGGAGGTAATTTTTGTTTAATT TTTATTTTTAATTTTTTTTTGTTTTTTGGTGTTTTGTTTTTTTTTGGGGTATTTGAGTTTAAATTTTTTAGAATAATTGTAACTGTGAGAAAATTGGGATT GTTTTGGGTTTTTAAGGATTTTTTTTTATTTATGTTGTAATTTTTTTAAATGTTAGATTATTTTAGTGATGAGTTTAGGATATATGTGTATTAAGGTGGAT GAAAGAAAAAAGAAAGAAAGAAATAAAGAAAGAAAGAAGAAAGGAAGGAAGGTTAAGGGAGTGGAGTAGGTAGAGTGAGAAATGTAGAGATCGGAATAGTT Align only these substrings

Quality control

JunB, TF, 12M reads more than enough!

Quality control

NTGTTGATGACATGTATCCGGATGTTTGGAGTGGGACATGTATCCGGATGTGTTTTGTTGACATGTATCCGGATGTT NTTATGGTGGACATGTATCCGGATGTTGTGAATTGGACATGTATCCGGATGTCCTATAGTGGACATTATCCGGATGT NAGAGAATGTGGACATGTATCCGGATGTGTTGGCTTGACACATATCCGGATGTTCCACACAGGACAAGATCGGAAGG NTTATGCAGAAACGGACATGTATCCGGATGTGTCTTGTGTGACATGTATCCGGATGTCTGCAACGTCCGCTGAGCCA NTTCATAGGTGCGGACATGTATCCGGATGTGTCGCAATGTGATATGATCCGGATGTCGAGGCTCTACATCCGGATAC NTGATGTGGGGTGACATGTATCCGGATGGATTTGGGGACATGTATCCGGATGCATCCGGATACATGTCATGTCACAG NTAATGCTGGACATGTATCCGGATGTCTTTCTGTGGGGCAGCCACTGGCTCATCGAAAGATCGGAAGAGCTCGTATG NATGGTACGGACATGTATCCGGATGTCATCCGGATACATGTCCAGGACGACAGATCGGAAGAGCTCGTATGCCGTCT NTGGTGCTGGACATGTATCCGGATGTATGTTCTCGATAAAACATCCGGATACATGACAACACCAGAGATCGGAAGAG NTGGTGGTTGGGACATGTATCCGGATGTTTACAGTATCTGCAACGTCCGCTGAGCCAGTGGCTGCCCCACCAGTGTT NACATCATGGACATGTATCCGGATGAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAATA Some primer contamination!

Alignment GAATGAGAGGAAAGGTGGAAGTAATTTTGTGTTAATTAAAGGTATTTTTAAAAATTTAATAATATGTATTGAGGAAAGGTTAATT I am a reference genome! AAAGGGAT GGAAAGGT unique mapping multi mapping *

Alignment  uniqueness depends on which regions are enriched for the protein you are immuno-precipitating H3K27ac: # reads processed: 39’897’199 # reads with at least one reported alignment: 33’984’279 (85.18%) # reads that failed to align: 2’568’674 (6.44%) # reads with alignments suppressed due to -m: 3’344’246 (8.38%) H3K9me3 # reads processed: 82’402’674 # reads with at least one reported alignment: 29’278’881 (35.53%) # reads that failed to align: 11’211’423 (13.61%) # reads with alignments suppressed due to -m: 41’912’370 (50.86%)

Alignment  I use Bowtie  no gaps  no clipping  ~35-40’ for 25 millions reads using 4 cores REF: AGCTAGCATCGTGTCGCCCGTCTAGCATACGCATGA READ: AAAGTGTCGCC-GACTAGCTCC

Alignment GapsClipping Bowtieno Bowtie2yes Bwayes Novoalignyes

Alignment Hard trimming impossible BowtieNovoalign 673,1261,129,146 1,076,9662,167,620

Alignment  bowtie -t -v 2 -m 1 -S -p 8 --phred33-quals /db/bowtie/mm9/mm9 sample.fastq sample.SAM  -m 1 uniquely alignable reads only  -v 2 up to two mismatches  -S SAM output  -p 8 uses 8 CPUs  --phred33-quals quality scores  /db/bowtie/mm9/mm9 the genomic index

SAM/BAM formats

HWI-ST880:111:D1101ACXX:5:1101:1145:2177_1:N:0:ATCACG 16 chr M * 0 ACTTCACTCATGAAGAATGGGCTTTGCTGGATTCTTCCCAGAAGAGTNTCT XA:i:1 MD:Z:47C3 NM:i:1

SAM/BAM formats HWI-ST880:111:D1101ACXX:5:1101:1145:2177_1:N:0:ATCACG 16 chr M * 0 ACTTCACTCATGAAGAATGGGCTTTGCTGGATTCTTCCCAGAAGAGTNTCT XA:i:1 MD:Z:47C3 NM:i:1

SAM/BAM formats HWI-ST880:111:D1101ACXX:5:1101:1145:2177_1:N:0:ATCACG 16 chr M * 0 ACTTCACTCATGAAGAATGGGCTTTGCTGGATTCTTCCCAGAAGAGTNTCT XA:i:1 MD:Z:47C3 NM:i:1 Example: 3S8M1D6M4S 3 soft, 8 match, 1 deletion, 6 match and 4 soft

PCR duplicates GAATGAGAGGAAAGGTGGAAGTAATTTTGTGTTAATTAAAGGTATTTTTAAAAATTTAATAATATGTATTGAGGAAAGGTTAATT These reads are likely to have been generated by a non-random amplification process (PCR) rather than random fragmentation (unless you have a very low genomic coverage or a very high sequencing depth) Consider just one (or a number estimated using a statistics)

Data visualization  UCSC genome browser:  Very important to get acquainted with your data!  bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Iros barozzi&hgS_otherUserSessionName=Epigen_ bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Iros barozzi&hgS_otherUserSessionName=Epigen_

Peak calling  I use MACS and RSEG Adapted from PMID:

Peak calling  MACS is very good in: - finding SHARP signals in IP versus a control (e.g. TFs) - finding small but reproducible differences among IPs (e.g. changes in chromatin marks after challenging the cells with toxin or drugs that span down to less than 1 kbp, Nos2 example: chr11:78,683, ,694,401)  RSEG is very good at: - finding BROAD signals in IP versus a control (e.g. H3K9me3) - finding domain-level chromatin changes between different IPs

Peak calling: MACS  macs14 –t IP.SAM –c input.SAM --name=IP_vs_input --format=SAM --tsize=51 -- gsize=2.72e9 --wig --single-wig --pvalue=1e-5 --format=SAM --space=10  P-value threshold: --pvalue=1e-5  --wig --single-wig  --format=SAM  If you do not have any input/IgG: --nolambda

Peak calling: MACS

Peak calling: RSEG  IP versus random expectation (no control): rseg -c mouse-mm9-size.bed -o $PWD -i 20 -v -d deadzones-k36-mm9.bed H2AK5ac_UT.tags.bed  IP versus control: rseg-diff -c mouse-mm9-size.bed -o $PWD -i 20 -v -mode 2 -d deadzones-k36- mm9.bed H2AK5ac_UT.tags.bed input.tags.bed  IP versus IP: rseg-diff -c mouse-mm9-size.bed -o $PWD -i 20 -v –mode 3 -d deadzones-k36- mm9.bed H2AK5ac_LPS_2h.tags.bed H2AK5ac_UT.tags.bed

Peak calling: RSEG

Peak calling PMID:

Peak calling PMID:

Normalization  Can we compare datasets with a different sequencing depth?  How do we normalize on sequencing depth?

Assumption: linearity

Summary  Quality control  Alignment to the reference genome  Dealing with PCR duplicates  Data visualization  Peak calling

Summary H3K27ac: # reads after quality filtering: 39’897’199 # reads with a unique alignment on the genome: 33’984’279 (85.18%) # reads after PCR duplicates: 31’069’231 (77.87%) H3K9me3 # reads after quality filtering: 82’402’674 # reads with a unique alignment on the genome: 29’278’881 (35.53%) # reads after PCR duplicates: 26’273’699 (31.88%) H3K4me3 (with strong PCR bias) # reads after quality filtering: 28’681’583 # reads with a unique alignment on the genome: 16’928’963 (59.02%) # reads after PCR duplicates: 2’715’994 (9.47%)

Experimental (qPCR) validation Most significant Least significant Negative controls

FAQs: how deep?  What is the efficiency of your antibody (SNR)?  What is the fraction of the genome potentially covered by the protein of interest?  Saturation plots

FAQs: how deep?

PU.1 (TF)H3K4me1H3K9me3 ~50 Mbp~140 Mbp~150 Mbp (>250 Mbp)

FAQs: how deep?  You always have the opportunity to sequence your library again and increase the depth! PU.1 (TF)H3K4me1H3K9me3 100% ~ 19M (PCR bias < 15%) (30M raw reads) 100% ~ 20M (PCR bias < 15%) (30M raw reads) 100% ~ 30M (PCR bias < 15%) (60M raw reads)

FAQs: what about the control?  Which is the best control? Ideally the best control is the IP performed in the same cells in which the protein is not expressed. This is rarely feasible. Input of the IP and IgG are equally good control.  What if no experimental control is available? Don’t worry you can run your analysis without estimating local biases. In most cases artifacts are a small fraction on the total number of enriched regions and won’t dramatically affect the results.

Galaxy 

Fish the ChIPs  A Mac GUI for ChIP-seq analysis 

UCSC session  UCSC session at: bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Iros barozzi&hgS_otherUserSessionName=Epigen_ bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Iros barozzi&hgS_otherUserSessionName=Epigen_  Murine macrophages  Untreated and LPS 4h  Pu.1, H3K4me1, H3K4me3  Peaks coordinates in BED for Pu.1 UT can be downloaded from:

Supplementary: manipulating files  Samtools  Bedtools  Picard

Supplementary: useful literature  Nature Methods 6, S22 - S32 (2009) Computation for ChIP-seq and RNA-seq studies Shirley Pepke, Barbara Wold & Ali Mortazavi  Nature Reviews Genetics 10, (October 2009) ChIP–seq: advantages and challenges of a maturing technology Park PJ  Nat Immunol Sep 20;12(10): doi: /ni ChIP-Seq: technical considerations for obtaining high-quality data. Kidder BL, Hu G, Zhao K.  PLoS One Jul 8;5(7):e Evaluation of algorithm performance in ChIP-seq peak detection. Wilbanks EG, Facciotti MT.

Supplementary: FASTQ format - replace the Read/Tag +description Qscores