ChIP-seq Methods & Analysis

ChIP-seq Methods & Analysis
Gavin Schnitzler Asst. Prof. Medicine TUSM, Investigator at MCRI, TMC

ChIP-seq COURSE OUTLINE
Day 1: ChIP techniques, library production, USCS browser tracks Day 2: QC on reads, Mapping binding site peaks, examining read density maps. Day 3: Analyzing peaks in relation to genomic feature, etc. Day 4: Analyzing peaks for transcription factor binding site consensus sequences. Day 5: Variants & advanced approaches.

DAY 3 LECTURE OUTLINE (finishing up peak mapping, from Tuesday).
Exploring your peak data with Galaxy & Cistrome Analyzing overlaps between peak sets, with galaxy and in UNIX

DAY 2 LECTURE OUTLINE FASTQC (quality control on reads)
Getting your raw data -Exercise: Getting around UNIX, downloading & unpacking Mapping reads to the genome & identifying binding site peaks -Exercise: Running Bowtie & MACs Visualizing your results -Exercise: Custom UCSC browser tracks

Let’s try that again (w/ a streamlined proven command set)
Open putty & login to cluster.uit.tufts.edu “mkdir chip” “cd chip” “cp /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/*.gz .“ [make sure to add the final space & period, this tells UNIX to keep the same filename & put it in the current directory] Now repeat this for …/*.txt Do “ls” to see what you got … ip19.fastq --- is the raw data for ERalpha ChIP seq from mouse liver on chrom 19 input19.fastq --- is the raw data for the corresponding input sample workflow1.txt --- This file lists all of the commands you will use to process your raw sequence data, map reads to the genome & map peaks. Do “cat *.txt” Now you have all the commands you will use & can copy & paste (by selecting & then right clicking in Putty).

All the commands needed to go from sequence to peaks
gunzip *.fastq.gz bsub -oo ip19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m best --strata mm9 ip19.fastq ip19.map bsub -oo input19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m best --strata mm9 input19.fastq input19.map awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' input19.map > input19.bed awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' ip19.map > ip19.bed module add python/2.6.5 export PYTHONPATH=/cluster/shared/gschni01/lib/python2.6/site-packages:$PYTHONPATH export PATH=/cluster/shared/gschni01/bin:$PATH bsub -oo ipvinput19.macsinfo macs14 --format=BED --bw=210 --keep-dup=1 -B -S -c input19.bed -t ip19.bed --name ipvinput19 cd *aph/treat gunzip *.gz bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl ip19_trim_norm.bdg all ipvinput19_treat_afterfiting_all.bdg cd ../control bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl input19_trim_norm.bdg all ipvinput19_control_afterfiting_all.bdg

Mapping reads to a genome
Understanding the bowtie command (which you’ll have cut & pasted from your screen): bsub -oo ip19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m best --strata mm9 ip19.fastq ip19.map bsub –oo ip19.bowtieinfo, submits the batch process and names an output/error file. /cluster/shared/gschni01/bowtie*/bowtie gives the path to the bowtie program & tells unix to run it -n 1 tells Bowtie to accept no more than 1 mismatch between a the first 25 bp of a sequence read & its best homologue in the genome -m 1 tells Bowtie to reject any reads that are identical to more than 1 sequence in the genome (since we wouldn’t know which locus our read really came from) -5 8 tells Bowtie to trim the first 8 (lower quality) bases from the read before mapping --best & --strata tell bowtie to try hard to find the best match [name].fastq is your input file & [name].map specifies the name of the output file.

How did Bowtie do? Check your .bowtie info bsub output files:
“head *.bowtieinfo“ … The lines you’re interested in are the ones before the line (after which info of the bsub run itself is given) ==> LiE_ERaIP_chr19.bowtieinfo <== # reads processed: # reads with at least one reported alignment: (99.48%) # reads that failed to align: 554 (0.15%) # reads with alignments suppressed due to -m: 1368 (0.37%) Note that most of the reads aligned to some other sequence in the genome, very few failed to & map also very few had matched more than 1 genomic sequence (-m 1). This is great - but atypical - it only looks this good because I filtered the .fastq files for things that mapped to chr19… The actual data for all chromosomes looks like: # reads processed: # reads with at least one reported alignment: (70.49%) # reads that failed to align: (6.14%) # reads with alignments suppressed due to -m: (23.37%) Reported alignments to 1 output stream(s) Should be very low, unless you have contamination of non-mouse sequence. Typical level due to repeat sequences in mammalian genome

cp /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/*.fastq.gz .

Using awk change from .map to .bed format
Understanding your awk command: awk 'OFS='\t' {print $4, $5, $5+length($6),$1,".",$3}' ip19.map > ip19.bed OFS=‘\t’ tells awk to output tab delimited data The print command says: print these data columns in order: #4:chromosome, #5:start_bp, #5:start_bp+length(#6:sequence)=end_bp, #1:identifier, “.” as a placeholder & #3:strand awk would normally print to the screen, but here we redirect the output to create a new .bed file (> can be used for any other UNIX command too!).

How do peak-finders map binding sites?
Fragments are of a range of sizes & contain the TF binding site at a (mostly) random position within them. Reads are read (randomly) from left or right edges (sense or antisense) of fragments. Thus peak for sense tags will be 1/2 the fragment length upstream… Binding site position = mid-way between sense tag peak & antisense tag peak. To get binding site peak, shift sense downstream by ½ fragsize & antisense upstream by ½ fragsize. Adapted from slide set by: Stuart M. Brown, Ph.D., Center for Health Informatics & Bioinformatics, NYU School of Medicine & from Jothi, et al. Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data. NAR (2008), 36:

cp /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/*.fastq.gz .

Mapping binding peaks w/ MACs
Understanding the commands used for MACS module load python/2.6.5 … This tells the cluster to use an optional version of python. export PYTHONPATH=/cluster/shared/gschni01/lib/python2.6/site-packages:$PYTHONPATH export PATH=/cluster/shared/gschni01/bin:$PATH These tell UNIX where to find the necessary libraries to run MACS: Using MACS to identify peaks from ChIP-Seq data. Feng J, Liu T, Zhang Y. Curr Protoc Bioinformatics Jun;Chapter 2:Unit doi: / bi0214s34.

MACs parameters Now, let’s run MACs using our input file as control (after –c) and our ip file as the ‘treatment’ or experimental file (after –t). bsub -oo ipvinput19.macsinfo macs14 --format=BED --bw=210 --keep-dup=1 -B -S -c input19.bed -t ip19.bed --name ipvinput19 --format=BED tells MACs that the input file is in .bed format --bw=210 tells MACs the expected size of sequenced fragments (before addition of linkers, which add an additional ~90 bp) from which value it attempts to build a model from sense and antisense sequence reads --keep-dup=1 instructs MACS to consider only the first instance of a read starting at any given genomic base pair coordinate & pointing in the same direction – assuming that additional reads starting at the same base pair are due to amplified copies of the same ChIP fragment in the library (by default MACS estimates the number of duplicates that are likely to arise by linear amplification of all fragments from a limited starting sample, and sets the threshold to cut out replicate reads with a much higher number – likely artifacts, but keep-dup=1 is even cleaner) -B tells MACS to make a bedgraph file of read density at each base pair (which can be used to visualize the results on the UCSC browser) & -S tells MACS to make a single .bedgraph file instead of one for each chromosome --name gives the prefix name for all output files.

Examine your MACS output
Start with your .macsinfo bsub -oo file. vi LiE_ERaIPvINPUT_chr19.macsinfo Use the arrow keys to go to the top, where you’ll see all of the parameters you put in to run MACs. After some runtime info (including possible warnings, that you can ignore if there are not millions of them), you’ll see: INFO @ Sun, 10 Feb :27:51: #1 total tags in treatment: INFO @ Sun, 10 Feb :27:51: #1 user defined the maximum tags... INFO @ Sun, 10 Feb :27:51: #1 filter out redundant tags at the same location and the same strand by allowing at most 1 tag(s) INFO @ Sun, 10 Feb :27:51: #1 tags after filtering in treatment: INFO @ Sun, 10 Feb :27:51: #1 Redundant rate of treatment: 0.26 This is useful information. It tells you how many different reads you had (out of all of the reads which mapped to only one place in the mouse genome- from Bowtie). You want this number to be high and the “redundant rate” to be low. (You’ll need the tags after filtering numbers later, so jot them down somewhere)

Using duplication levels to estimate your library size
Assuming you have 100 initial fragments in your library (before amplification) & which fragment gets read is random: #seqs read: # diff reads: % duplicated: 9% 27% 33% 43% 55% 69% x-more left in lib: x-more than prev: Thus, if you have low % duplicates (e.g. 9%) in one lane, adding an additional run of the same number of reads will give you 1.6x more, or 2 additional runs will give you 2.2x more (1.6*1.4). …but if you have a high % duplicates (e.g. 43%) adding one more lane will only give you 1.37x more unique reads than you had initially. This indicates that your library has low complexity – probably because too few fragments from your ChIP survived to the library amplification step.

MACs ‘shiftsize’ model
Keep scrolling down your .macsinfo file… INFO @ Sun, 10 Feb :27:51: #2 Build Peak Model... INFO @ Sun, 10 Feb :27:51: #2 number of paired peaks: 0 Sun, 10 Feb :27:51: Too few paired peaks (0) so I can not build the model! Broader your MFOLD range parameter may erase this error. If it still can't build the model, please use --nomodel and --shiftsize 100 instead. Sun, 10 Feb :27:51: Process for pairing-model is terminated! Sun, 10 Feb :27:51: #2 Skipped... Sun, 10 Feb :27:51: #2 Use 100 as shiftsize, 200 as fragment length Here MACs tried to estimate the “shift size” for moving sense & antisense reads to get a final peak position, by identifying sets of strong + & - strand peaks at a certain distance from each other. There wasn’t enough info on chromosome 9 to do this, so it made a guess that the fragment size was 200 & shiftsize was is close enough to the actual fragment size of ~150 bp that we can go with this.

MACs model file This is the result I got when I ran MACs with all chromosomes #2 Build Peak Model... #2 number of paired peaks: 683 Fewer paired peaks (683) than 1000! Model may not be build well! Lower your MFOLD parameter may erase this warning. Now I will use 683 pairs to build model! finished! predicted fragment length is 125 bps Generate R script for model : LiE_IP_v_INPUT_11_2012_dup1_model.r Call peaks... use control data to filter peak candidates... Finally, 9504 peaks are called! find negative peaks by swapping treat and control Finally, 337 peaks are called! d = estimated fragment size. Actual size ~150 bp, so this is not perfect, suggesting a bit more tweaking could b useful. To generate this file you will need to go into R, and enter: Source(“MACS_output_file.r”), which will generate a .pdf

Peaks & negative peaks Keep scrolling down your .macsinfo file until you find… … INFO @ Sun, 10 Feb :36:47: #3 Finally, 364 peaks are called! INFO @ Sun, 10 Feb :36:47: #3 find negative peaks by swapping treat and control INFO @ Sun, 10 Feb :36:52: #3 Finally, 36 peaks are called! INFO @ Sun, 10 Feb :36:52: #4 Write output… This is the pay-off, where MACS identifies your ER alpha peak locations! 364 peaks on chromosome 19 (which is ~1/50th of the genome) suggests ~20,000 peaks for the whole genome, which is not bad! Equally critical, MACS now swaps treat & control (pretending your INPUT data is your IP & your ChIP data is your input) and looks again for peaks. The number of “negative” peaks found in this way should be far less than the positive peaks, and the 10:1 ratio here is fine.

WinSCP (SFTP/FTP software for Windows): http://winscp. net/eng/index

Looking at MACS data in Excel
Using WinSCP move the _peaks.xls file to the PC & open it.

Toubleshooting MACs For details on how to troubleshoot many problems in MACs, see the file ChIPseq_analysis_methods_2013_02_11.doc on the cbi website. Briefly… MACs can’t build a model: - Adjust the mfold values (the fold over background ranges MACs considers for paired peaks) - Tell MACs to not build a model, but instead use the shiftsize you specify. Peaks/Negative Peaks ratio is poor or too few peaks are detected: - Adjust model settings to see if you can improve both. Otherwise, you may have to conclude that 1) your library was no good or 2) the factor just doesn’t bind to many places in the genome.

Trimming .bdg files With the –B & -S commands, MACS generated a bedGraph file that can be used to visualize your combined read density information (with + & - reads shifted by shiftsize) in the UCSC browser MACS gets too enthusiastic, however, and occasionally places the end of a read past the what the UCSC browser thinks is the end of a chromosome (causing the UCSC browser to reject the whole file). To avoid this, you need to trim your .bdg files to remove anything past chromosome ends.

Normalizing .bdg files If you sequenced 100 M reads (A) you may have a peak that is 200 reads at its apex. But if you only took a subsample 10 M reads (B), that peak would be only ~20 reads at its apex. To compare (A) & (B), just divide by the # of million mapped reads… now both peaks have a max of 2. The same is true when comparing across samples: normalizing to “reads per million mapped reads” (RPMR) lets you directly compare peak intensity across samples & conditions.

Millions of non-duplicated mapped reads reported in .macsinfo
cp /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/*.fastq.gz . gunzip *.fastq.gz bsub -oo ip19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m best --strata mm9 ip19.fastq ip19.map bsub -oo input19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m best --strata mm9 input19.fastq input19.map awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' input19.map > input19.bed awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' ip19.map > ip19.bed module add python/2.6.5 export PYTHONPATH=/cluster/shared/gschni01/lib/python2.6/site-packages:$PYTHONPATH export PATH=/cluster/shared/gschni01/bin:$PATH bsub -oo ipvinput19.macsinfo macs14 --format=BED --bw=210 --keep-dup=1 -B -S -c input19.bed -t ip19.bed --name ipvinput19 cd *aph/treat gunzip *.gz bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl ip19_trim_norm.bdg all ipvinput19_treat_afterfiting_all.bdg gzip *.bdg cd ../control bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl input19_trim_norm.bdg all ipvinput19_control_afterfiting_all.bdg Millions of non-duplicated mapped reads reported in .macsinfo

Uploading to UCSC browser
Use WinSCP to move your .gz compacted .bdg files & the …peaks.bed file the MACs generated to your PC. Go to Select mouse mm9 genome & hit enter Click on add custom tracks Select each of these files & upload them Explore! Get a sense of what the data looks like. Important sanity check: called peaks should be clearly evident in the .bdg data.

Galaxy & Cistrome MAIN GALAXY SITE: https://main.g2.bx.psu.edu/
GALAXY/CISTROME (specialized for ChIP-seq data):

Useful Galaxy Tools https://main.g2.bx.psu.edu/
Get data-> upload file: to get your data into Galaxy. For .fastq data files it’s best to give ftp server URL (from right click or control click (for mac) on link provided by your core, will need to sign up for a free account for .ftp file transfers. Liftover: To convert coordinates between genomes or different builds in the same genome (can also do in the UCSC browser) Text manipulation: add lines, rearrange columns, etc. – functional but limited and very unwieldy. Convert formats: Useful, but doesn’t cover everything. Fetch sequences: Get DNA sequence just like USCC table browser Operate on genomic Intervals-> determine intersections between sets of regions, etc.

Useful Galaxy Tools https://main.g2.bx.psu.edu/ NGS Tools:
QC and manipulation -> run FASTQC Mapping -> map reads to genome with Bowtie or BWA SAM & BAM: Convert between & manipulate SAM and BAM format files often required for certain programs. Peak Calling: MACS & a few others BED Tools: Convert between BAM & BED and manipulate .bed files. On the face of it, this looks powerful, but it is VERY slow. My quick benchmark, downloading a 1.5 Gb .fastq.gz raw data file that took 13 secs to download to the cluster took >30 minutes to upload to Galaxy.

GALAXY/CISTROME http://cistrome.dfci.harvard.edu/ap/root
Galaxy tools specially designed for ChIP-seq analysis. Most things you can find elsewhere, but Cistrome allows easy access to many analyses that give you some quick insights into your data. Sign up for the free account, so we can explore what Cistrome. …“cistrome” is a term coined by Myles Brown’s lab at DFCI, which the genomewide distribution of a transcription factor on chromatin.

Cistrome-specific tools
CISTROME TOOLBOX Data Preprocessing: Run MACS & variants, as well as some designed for ChIP-chip. Integrative analysis: CORRELATION-> Venn Diagram (overlaps of 2 to 3 peak coordinate datasets) Use WinSCP to move the 3 .bed files from: /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/ERa_cistrome_beds …to your PC & then upload them to Cistrome. Select Venn Diagram & select these 3 files in the drop-down menus.

CISTROME TOOLBOX: ASSOCIATION STUDY CEAS (cis element annotation system) Provides quick info on distribution of your TF peaks relative to chromsomes, gene start & end sites, exons, etc. Select CEAS & then select one of your uploaded files (by number) in the dropdown menu.

CISTROME TOOLBOX: ASSOCIATION STUDY GCA: Gene centered annotation Find the nearest interval in the given intervals set for every annotated coding gene (e.g. where’s the nearest ER binding site for each gene in the genome) peak2gene: Peak Center Annotation Input a peak file, and It will search each peak on UCSC GeneTable to get the refGenes near the peak center (e.g. where’s the nearest gene to each ERa binding site in the genome).

CISTROME TOOLBOX: ASSOCIATION STUDY Conservation Plot Calculates the PhastCons scores in several intervals sets Functional transcription factor binding sites are expected to have consensus elements for that factor (and/or partnering factors that help recruit it or stabilize it on chromatin). This is less true, of course, for histone modifications, which may spread for some distance from initial recruiting factors. Choose conservation plot & feed one or more of your bed files to it. One way to tell if you have improved your peak calling (e.g. by tweaking MACS parameters) is if the conservation at the center of your peaks goes up).

Overlaps in Cistrome or Galaxy
The Venn Diagram function gave you some indication of the degree of overlap between your .bed file datasets – but this is only a top level analysis. Operate On Genomic Intervals-> Intersect This lets you create a new .bed file which has only the regions that intersect between two datasets. Overlapping Pieces of Intervals: (saves only the regions shared between 1 & 2) Overlapping Intervals: (saves complete intervals from file 1 that overlap anything in file 2)

How can we tell whether overlaps are significantly greater than chance?
Go to the cluster & move those same 3 files into your chip folder in your /cluster/shared/userID/chip folder: cp /cluster/tufts/cbi*/Ch*/Sam*/ER*beds/*.bed . Now, let’s assess whether the overlap between ERalpha binding sites between liver and aorta is greater than expected by chance using: bsub perl /cluster/home/g/s/gschni01/perl*/overlap_1.3.pl AoE_all.bed LiE_all.bed –outfile AoE_v_LiE.overlap This program identifies all times when .bed regions in file1 overlap bed regions in file2 & estimates the frequency expected by chance.

Assigning a p value p.=.74 p.=2e-10
We have 3 bits of count frequency information: The number of overlaps, the number of regions compared, and the expected background frequency: This type of data is like coin tosses & is ideally suited for a binomial test, which uses “number of matches”, “number of tests” and “expected background frequency” to calculate p. values. If you flip a coin, say 10 times and it comes up heads 6 out of 10 (frequency 0.6 vs. expected 0.5), that would not seem unlikely – and a binomial test would tell you this. However if you flip a coin 1000 times & get heads 600 out of 1000, that would seem a bit odd, and the binomial test would indicate this by saying that the probability of the null hypothesis (that the frequency of heads is 0.5) is low. p.=.74 p.=2e-10

A brief forray into R Looking at the overlap program results, we know that there were 1653 overlaps between Aorta & Liver ER sites, out of 8260 Aorta regions tested (our number of tests), and the background frequency was 95.11/8260. To run our binomial test we’ll want to start up the R statistical programming language, by typing: module load R If you just type R now, you get this message: To run R please invoke the following command to run it via LSF's interactive queue: bsub -Ip -q int_public6 R Do what it suggests & you’ll get welcome information in R.

Binomial tests for overlaps
Now ask R to run your binomial test by typing: binom.test(1653, 8260, 95.11/8260) The p.value is <2e-16. Very low. So, yes, ER binding sites in liver and aorta overlap more than expected by chance… but ERa is still binding to ~80% different places between these two tissues. Now exit R by typing “q()” & saying “n” to the question about saving. Binomial tests are useful for many different types of count data & they will also give you probabilities for ANTI-enrichment as well as enrichment.

Getting R (for your PC) R: http://cran.r-project.org/ RStudio:
Install RStudio after you have installed R. For more info on using R & Unix see: UNIX resources & R resources

Overlaps between peaks & genes
In that same file are also .txt files listing the transcription start sites (TSSes) of genes that were up- or down-regulated by estrogen in aorta or liver. Get them by typing: cp /cluster/tufts/cbi*/Ch*/Sam*/ER*beds/*.txt . Take a look at one of them using head [name].txt chr Dnahc6 chr C8g chr Tnip3 … The file format is (tab-delimited) chromosome, TSS, transcription direction (+=sense) & geneID. You can get all this info easily from the UCSC browser, for individual genes (by hand)… … or you can get this information for all genes & extract what you want for your gene set of interest.. Check out the RNA-seq module for info on making & handling .gtf files.

Overlaps between peaks & genes 2
The overlap program can recognize this type of file & will test for overlaps between ChIP-seq peaks and regions around the listed TSSes (default +/-1000 bp). You can also change this range by specifying a –range variable. Find the overlaps between 10-kb regions around TSSes of genes up- or downregulated in each tissue & the corresponding ER binding site data using variations on: bsub perl /cluster/home/g/s/gschni01/perl*/overlap_1.3.pl Ao_up_TSS.txt AoE_all.bed –outfile Ao_up_v_AiE.overlap Note the number of overlaps, number of genes (= number of tests) and the number of overlaps expected by chance, then start up R and use binomial tests to determine whether there is significant enrichment for each comparison. What conclusions can you draw?

An important note on Data Storage
.fastq files are huge (too big for CDs or, for more than a few, your PC hard drive). So are many of the analysis files (like your .map & .bed files). You can request extra storage space on the cluster – for more info go to: Even that fills up fast: I’d recommend buying an external >1 Terabyte hard drive (~$200 or less).

Broad IGV (“Integrative Genomics Viewer”), an alternative to UCSC browser
You will need to register, but they don’t send you spam.

ChIP-seq Methods & Analysis

Similar presentations

Presentation on theme: "ChIP-seq Methods & Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ChIP-seq Methods & Analysis

Similar presentations

Presentation on theme: "ChIP-seq Methods & Analysis"— Presentation transcript:

Similar presentations

About project

Feedback